What s the difference between the pandas generic dataset and Kedro #questions

What’s the difference between the pandas generic d...

Matthias Roels

05/19/2023, 6:59 AM

What’s the difference between the pandas generic dataset and, say, pandas csv dataset classes? From what I can tell, they offer the same functionality for reading csv files. Is one a legacy version that was supposed to be replaced by the other?

marrrcin

05/19/2023, 7:12 AM

I’ts right in the docs:

```Creates a new instance of ``GenericDataSet`` pointing to a concrete data file

on a specific filesystem. The appropriate pandas load/save methods are

dynamically identified by string matching on a best effort basis.```

vs:

```loads/saves data from/to a CSV file using an underlying

filesystem (e.g.: local, S3, GCS). It uses pandas to handle the CSV file.```

So you can use first one and potentially have parquet/CSV but also things like feather, hdf and so on (which don’t have their own datasets in Kedro but are supported by pandas). In

pandas.GenericDataSet

, you just specify

file_format

argument in the catalog and then the dataset will basically use

pandas.read_{file_format}

to load the thing you want.

Matthias Roels

05/19/2023, 7:35 AM

I know, but why is there a need for a specific csv dataset then? I would assume that one is legacy?

marrrcin

05/19/2023, 8:57 AM

Some history can be found here https://github.com/kedro-org/kedro/pull/987 Maybe someone someone else from the Kedro team will chip in here

👍 1

Deepyaman Datta

05/19/2023, 9:16 AM

@marrrcin covered everything for the most part. To add a bit more context around

I know, but why is there a need for a specific csv dataset then? I would assume that one is legacy?

You probably don't need to use CSV dataset specifically if you use the generic one. However, the implementation of something like

pandas.ParquetDataSet

pandas.ExcelDataSet

is different than just a wrapper around

pd.read_parquet

pd.read_excel

, respectively. Maybe these could converge at some point by having

pandas.GenericDataSet

(or, if there are no specific datasets, you can just call it

pandas.PandasDataset

) including special logic based on some of these file formats, and that would provide a single, simpler entry point, but I don't know that this has been discussed really.

👍 2

👍🏼 1

2 Views

Open in Slack

Previous Next