Hi all, does anybody have any experience implement...
# questions
Hi all, does anybody have any experience implementing a custom Pyarrow dataset? We're thinking of replacing our partitioned pandas datasets to just use Pyarrow's hive partitioning implementation and forget about vanilla partitioned datasets
The very early versions of parquet dataset were this because it predated pandas adding support
I think approaches have moved on from then so it’s best to start from a modern starting point. I do wonder if ibis may also help here…
Yep, arrow -> duckdb -> ibis is part of why I'm looking into it
potentially directly arrow -> ibis
but i think the underlying dataset implementation for a partitioned parquet dataset should be an arrow definition which you then filter using Ibis (@Cody Peterson)
❤️ 1
Ibis can consume and produce PyArrow objects or Parquet files
Would you go the PyArrow-> Ibis route or directly query hive partitioned parquet files using Ibis? files are on abfs and final target is to filter partitions based on a range of dates. ideally I would eventually use something like the duckdb azure extension once it becomes usable from Ibis, which should provide more performance by allowing filter-pushdown on partitions https://github.com/ibis-project/ibis/issues/7493
polars (I think) has azure blob filter-pushdown capabilities on their scan_parquet function natively, but ideally I would interface through Ibis since that allows for a single API for our SQL and filebased queries.
would recommend reading Parquet, PyArrow may be used behind the scenes. you could use the Polars backend to read in the data and also process it, I believe it used the native scan_parquet function. you could also use the Polars backend to read and pass it off to a DuckDB connection for further processing via PyArrow objects
🎉 2
Ibis does a flawless passthrough to scan_parquet native functionality wow
🚀 2
Just following up on this: we ended up building a dataset which directly uses pyarrow's parquet reader and writer. It solved a lot of pain points around partitioning and updating data which we had from the standard kedro partitioned datasets. The idea is to eventually move onto ibis which will provide us a unified data processing API regardless of whether the data comes from SQL or a file. But for now we implemented a basic arrow dataset which can then be passed onto any library which accepts them as an input.
❤️ 2