Hi all does anybody have any experience implementing a custo Kedro #questions

Hi all, does anybody have any experience implement...

Iñigo Hidalgo

11/24/2023, 9:51 AM

Hi all, does anybody have any experience implementing a custom Pyarrow dataset? We're thinking of replacing our partitioned pandas datasets to just use Pyarrow's hive partitioning implementation and forget about vanilla partitioned datasets

datajoely

11/24/2023, 9:53 AM

The very early versions of parquet dataset were this because it predated pandas adding support

datajoely

11/24/2023, 9:53 AM

I think approaches have moved on from then so it’s best to start from a modern starting point. I do wonder if ibis may also help here…

Iñigo Hidalgo

11/24/2023, 10:53 AM

Yep, arrow -> duckdb -> ibis is part of why I'm looking into it

Iñigo Hidalgo

11/24/2023, 10:55 AM

potentially directly arrow -> ibis

Iñigo Hidalgo

11/24/2023, 10:56 AM

but i think the underlying dataset implementation for a partitioned parquet dataset should be an arrow definition which you then filter using Ibis (@Cody Peterson)

❤️ 1

Cody Peterson

11/24/2023, 3:58 PM

Ibis can consume and produce PyArrow objects or Parquet files

Iñigo Hidalgo

11/24/2023, 4:25 PM

Would you go the PyArrow-> Ibis route or directly query hive partitioned parquet files using Ibis? files are on abfs and final target is to filter partitions based on a range of dates. ideally I would eventually use something like the duckdb azure extension once it becomes usable from Ibis, which should provide more performance by allowing filter-pushdown on partitions https://github.com/ibis-project/ibis/issues/7493

Iñigo Hidalgo

11/24/2023, 4:29 PM

polars (I think) has azure blob filter-pushdown capabilities on their scan_parquet function natively, but ideally I would interface through Ibis since that allows for a single API for our SQL and filebased queries.

Cody Peterson

11/24/2023, 4:50 PM

would recommend reading Parquet, PyArrow may be used behind the scenes. you could use the Polars backend to read in the data and also process it, I believe it used the native scan_parquet function. you could also use the Polars backend to read and pass it off to a DuckDB connection for further processing via PyArrow objects

🎉 2

Iñigo Hidalgo

11/24/2023, 11:48 PM

Ibis does a flawless passthrough to scan_parquet native functionality wow

🚀 2

Iñigo Hidalgo

01/15/2024, 11:32 AM

Just following up on this: we ended up building a dataset which directly uses pyarrow's parquet reader and writer. It solved a lot of pain points around partitioning and updating data which we had from the standard kedro partitioned datasets. The idea is to eventually move onto ibis which will provide us a unified data processing API regardless of whether the data comes from SQL or a file. But for now we implemented a basic arrow dataset which can then be passed onto any library which accepts them as an input.

❤️ 2

30 Views

Open in Slack

Previous Next