What’s the difference between the pandas generic d...
# questions
m
What’s the difference between the pandas generic dataset and, say, pandas csv dataset classes? From what I can tell, they offer the same functionality for reading csv files. Is one a legacy version that was supposed to be replaced by the other?
m
I’ts right in the docs:
```Creates a new instance of ``GenericDataSet`` pointing to a concrete data file
on a specific filesystem. The appropriate pandas load/save methods are
dynamically identified by string matching on a best effort basis.```
vs:
```loads/saves data from/to a CSV file using an underlying
filesystem (e.g.: local, S3, GCS). It uses pandas to handle the CSV file.```
So you can use first one and potentially have parquet/CSV but also things like feather, hdf and so on (which don’t have their own datasets in Kedro but are supported by pandas). In
pandas.GenericDataSet
, you just specify
file_format
argument in the catalog and then the dataset will basically use
pandas.read_{file_format}
to load the thing you want.
m
I know, but why is there a need for a specific csv dataset then? I would assume that one is legacy?
m
Some history can be found here https://github.com/kedro-org/kedro/pull/987 Maybe someone someone else from the Kedro team will chip in here
👍 1
d
@marrrcin covered everything for the most part. To add a bit more context around
I know, but why is there a need for a specific csv dataset then? I would assume that one is legacy?
You probably don't need to use CSV dataset specifically if you use the generic one. However, the implementation of something like
pandas.ParquetDataSet
or
pandas.ExcelDataSet
is different than just a wrapper around
pd.read_parquet
or
pd.read_excel
, respectively. Maybe these could converge at some point by having
pandas.GenericDataSet
(or, if there are no specific datasets, you can just call it
pandas.PandasDataset
) including special logic based on some of these file formats, and that would provide a single, simpler entry point, but I don't know that this has been discussed really.
👍 2
👍🏼 1