Hello, is there a reason why `pandas.ParquetDataSe...
# questions
f
Hello, is there a reason why
pandas.ParquetDataSet
does not use pandas all the time? I would like to use it for partitioned data, and I want to use the
filters
that
pandas.read_parquet
provides, but it is not available for
pyarrow.parquet.ParquetDataset.read
. Doing a quick test and using
pd.read_parquet
every time seems to work ok, even though it does not behave exactly the same.
If needed, I could try and make a PR
d
It actually predates parquet support in Pandas
so we should actually just adopt their API
fancy raising a PR to clean this up
you could also use
pandas.GenericDataSet
if you definitely want to use the pandas API
f
OK, I do not have the time tonight, but I will open an issue and try to make a PR afterwards. Thanks for pointing the
GenericDataSet
, I was not aware of it!
👍 1
j
Chiming in here, pyarrow does support this, I think you're just using the wrong portion of the api. Below will return instance of pyarrow table which has a method for casting to pandas https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
👍 1
f
I think using pandas everywhere would make the
loard_args
consistent and not depend on whether we try to read a folder or a file
and I think pandas delegates to this method under the hood