Hi all, does anybody have any experience implement...
# questions
i
Hi all, does anybody have any experience implementing a custom Pyarrow dataset? We're thinking of replacing our partitioned pandas datasets to just use Pyarrow's hive partitioning implementation and forget about vanilla partitioned datasets
d
The very early versions of parquet dataset were this because it predated pandas adding support
I think approaches have moved on from then so it’s best to start from a modern starting point. I do wonder if ibis may also help here…
i
Yep, arrow -> duckdb -> ibis is part of why I'm looking into it
potentially directly arrow -> ibis
but i think the underlying dataset implementation for a partitioned parquet dataset should be an arrow definition which you then filter using Ibis (@Cody Peterson)
❤️ 1
c
Ibis can consume and produce PyArrow objects or Parquet files
i
Would you go the PyArrow-> Ibis route or directly query hive partitioned parquet files using Ibis? files are on abfs and final target is to filter partitions based on a range of dates. ideally I would eventually use something like the duckdb azure extension once it becomes usable from Ibis, which should provide more performance by allowing filter-pushdown on partitions https://github.com/ibis-project/ibis/issues/7493
polars (I think) has azure blob filter-pushdown capabilities on their scan_parquet function natively, but ideally I would interface through Ibis since that allows for a single API for our SQL and filebased queries.
c
would recommend reading Parquet, PyArrow may be used behind the scenes. you could use the Polars backend to read in the data and also process it, I believe it used the native scan_parquet function. you could also use the Polars backend to read and pass it off to a DuckDB connection for further processing via PyArrow objects
🎉 2
i
Ibis does a flawless passthrough to scan_parquet native functionality wow
🚀 2
Just following up on this: we ended up building a dataset which directly uses pyarrow's parquet reader and writer. It solved a lot of pain points around partitioning and updating data which we had from the standard kedro partitioned datasets. The idea is to eventually move onto ibis which will provide us a unified data processing API regardless of whether the data comes from SQL or a file. But for now we implemented a basic arrow dataset which can then be passed onto any library which accepts them as an input.
❤️ 2