Wondered if anyone else has come across this or perhaps I m Kedro #questions

Wondered if anyone else has come across this, or p...

Jordan Barlow

04/28/2025, 4:51 PM

Wondered if anyone else has come across this, or perhaps I'm doing something wrong. I'm reading from/writing to a hive partition of parquet files using Ibis with the DuckDB backend (

ibis.FileDataset

kedro-datasets>=7.0.0

). Kedro seems to make an assumption with the

filepath

catalog key of a dataset, that the dataset can be read from and written to that same path. However,

Backend.write_parquet

and

<http://Backend.to|Backend.to>_parquet

are different when

load_args={'hive_partitioning': True}

, as the corresponding DuckDB functions require a directory arg when writing, but a nested glob when reading: https://duckdb.org/docs/stable/data/partitioning/hive_partitioning.html This is reflected at the Ibis level as well: https://github.com/ibis-project/ibis/issues/10939 Things still work if you have a catalog entry like this:

Copy code

my_hive:
  type: ibis.FileDataset
  filepath: data/01_raw/my_hive/first_col=*/second_col=*/*.parquet
  table_name: my_hive
  file_format: parquet
  connection: ${_duckdb}
  load_args:
    hive_partitioning: true
  save_args:
    partition_by: ${tuple:first_col,second_col}

But the write operation will treat the entire filepath like a directory path, and you end up with something like:

Copy code

my_hive
└── first_col=*
    └── second_col=*
        └── *.parquet
            └── first_col=val_1
                ├── second_col=cat_1
                │   └── data_0.parquet
                ├── second_col=cat_2
                    └── data_0.parquet
            └── ...

This isn't really a Kedro design problem – perhaps the DuckDB API should be more symmetric. Has anyone else overcome this at the Kedro level? Thanks.

Deepyaman Datta

04/28/2025, 5:47 PM

Hmm, this is a little annoying. I agree DuckDB should be more symmetric. :) I think you could work around this with transcoding: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#read-the-same-file-using-different-datasets-with-transcoding

K 1

Deepyaman Datta

04/28/2025, 5:48 PM

(I don't think the statement that they share the same

filepath

is correct, should be able to have the different filepaths as you need)

Jordan Barlow

04/28/2025, 7:48 PM

Thank you, that was exactly what I needed - just split the catalog entry into the two variations:

Copy code

my_hive@write:
  type: ibis.FileDataset
  filepath: data/01_raw/my_hive
  table_name: my_hive
  file_format: parquet
  connection: ${_duckdb}
  save_args:
    partition_by: ${tuple:first_col,second_col}

my_hive@read:
  type: ibis.FileDataset
  filepath: data/01_raw/my_hive/first_col=*/second_col=*/*.parquet
  table_name: my_hive
  file_format: parquet
  connection: ${_duckdb}
  load_args:
    hive_partitioning: true

🙌 1

Open in Slack

Previous Next