Hello is there a reason why `pandas ParquetDataSet` does not Kedro #questions

Hello, is there a reason why `pandas.ParquetDataSe...

FlorianGD

02/14/2023, 4:32 PM

Hello, is there a reason why

pandas.ParquetDataSet

does not use pandas all the time? I would like to use it for partitioned data, and I want to use the

filters

that

pandas.read_parquet

provides, but it is not available for

pyarrow.parquet.ParquetDataset.read

. Doing a quick test and using

pd.read_parquet

every time seems to work ok, even though it does not behave exactly the same.

FlorianGD

02/14/2023, 4:32 PM

If needed, I could try and make a PR

datajoely

02/14/2023, 5:06 PM

It actually predates parquet support in Pandas

datajoely

02/14/2023, 5:06 PM

so we should actually just adopt their API

datajoely

02/14/2023, 5:07 PM

fancy raising a PR to clean this up

datajoely

02/14/2023, 5:07 PM

you could also use

pandas.GenericDataSet

if you definitely want to use the pandas API

FlorianGD

02/14/2023, 5:20 PM

OK, I do not have the time tonight, but I will open an issue and try to make a PR afterwards. Thanks for pointing the

GenericDataSet

, I was not aware of it!

👍 1

John Melendowski

02/14/2023, 10:58 PM

Chiming in here, pyarrow does support this, I think you're just using the wrong portion of the api. Below will return instance of pyarrow table which has a method for casting to pandas https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

👍 1

FlorianGD

02/15/2023, 8:30 AM

I think using pandas everywhere would make the

loard_args

consistent and not depend on whether we try to read a folder or a file

FlorianGD

02/15/2023, 8:30 AM

and I think pandas delegates to this method under the hood

6 Views