Matthias Roels
05/19/2023, 6:59 AMmarrrcin
05/19/2023, 7:12 AM```Creates a new instance of ``GenericDataSet`` pointing to a concrete data file
on a specific filesystem. The appropriate pandas load/save methods are
dynamically identified by string matching on a best effort basis.```
vs:
```loads/saves data from/to a CSV file using an underlying
filesystem (e.g.: local, S3, GCS). It uses pandas to handle the CSV file.```
So you can use first one and potentially have parquet/CSV but also things like feather, hdf and so on (which don’t have their own datasets in Kedro but are supported by pandas). In
pandas.GenericDataSet
, you just specify file_format
argument in the catalog and then the dataset will basically use pandas.read_{file_format}
to load the thing you want.Matthias Roels
05/19/2023, 7:35 AMmarrrcin
05/19/2023, 8:57 AMDeepyaman Datta
05/19/2023, 9:16 AMI know, but why is there a need for a specific csv dataset then? I would assume that one is legacy?You probably don't need to use CSV dataset specifically if you use the generic one. However, the implementation of something like
pandas.ParquetDataSet
or pandas.ExcelDataSet
is different than just a wrapper around pd.read_parquet
or pd.read_excel
, respectively.
Maybe these could converge at some point by having pandas.GenericDataSet
(or, if there are no specific datasets, you can just call it pandas.PandasDataset
) including special logic based on some of these file formats, and that would provide a single, simpler entry point, but I don't know that this has been discussed really.