https://kedro.org/ logo
#questions
Title
# questions
f

FlorianGD

02/14/2023, 4:32 PM
Hello, is there a reason why
pandas.ParquetDataSet
does not use pandas all the time? I would like to use it for partitioned data, and I want to use the
filters
that
pandas.read_parquet
provides, but it is not available for
pyarrow.parquet.ParquetDataset.read
. Doing a quick test and using
pd.read_parquet
every time seems to work ok, even though it does not behave exactly the same.
If needed, I could try and make a PR
d

datajoely

02/14/2023, 5:06 PM
It actually predates parquet support in Pandas
so we should actually just adopt their API
fancy raising a PR to clean this up
you could also use
pandas.GenericDataSet
if you definitely want to use the pandas API
f

FlorianGD

02/14/2023, 5:20 PM
OK, I do not have the time tonight, but I will open an issue and try to make a PR afterwards. Thanks for pointing the
GenericDataSet
, I was not aware of it!
👍 1
j

John Melendowski

02/14/2023, 10:58 PM
Chiming in here, pyarrow does support this, I think you're just using the wrong portion of the api. Below will return instance of pyarrow table which has a method for casting to pandas https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
👍 1
f

FlorianGD

02/15/2023, 8:30 AM
I think using pandas everywhere would make the
loard_args
consistent and not depend on whether we try to read a folder or a file
and I think pandas delegates to this method under the hood