Jordan Barlow
04/28/2025, 4:51 PMibis.FileDataset
, kedro-datasets>=7.0.0
).
Kedro seems to make an assumption with the filepath
catalog key of a dataset, that the dataset can be read from and written to that same path.
However, Backend.write_parquet
and <http://Backend.to|Backend.to>_parquet
are different when load_args={'hive_partitioning': True}
, as the corresponding DuckDB functions require a directory arg when writing, but a nested glob when reading:
https://duckdb.org/docs/stable/data/partitioning/hive_partitioning.html
This is reflected at the Ibis level as well:
https://github.com/ibis-project/ibis/issues/10939
Things still work if you have a catalog entry like this:
my_hive:
type: ibis.FileDataset
filepath: data/01_raw/my_hive/first_col=*/second_col=*/*.parquet
table_name: my_hive
file_format: parquet
connection: ${_duckdb}
load_args:
hive_partitioning: true
save_args:
partition_by: ${tuple:first_col,second_col}
But the write operation will treat the entire filepath like a directory path, and you end up with something like:
my_hive
└── first_col=*
└── second_col=*
└── *.parquet
└── first_col=val_1
├── second_col=cat_1
│ └── data_0.parquet
├── second_col=cat_2
└── data_0.parquet
└── ...
This isn't really a Kedro design problem – perhaps the DuckDB API should be more symmetric. Has anyone else overcome this at the Kedro level?
Thanks.Deepyaman Datta
04/28/2025, 5:47 PMDeepyaman Datta
04/28/2025, 5:48 PMfilepath
is correct, should be able to have the different filepaths as you need)Jordan Barlow
04/28/2025, 7:48 PMmy_hive@write:
type: ibis.FileDataset
filepath: data/01_raw/my_hive
table_name: my_hive
file_format: parquet
connection: ${_duckdb}
save_args:
partition_by: ${tuple:first_col,second_col}
my_hive@read:
type: ibis.FileDataset
filepath: data/01_raw/my_hive/first_col=*/second_col=*/*.parquet
table_name: my_hive
file_format: parquet
connection: ${_duckdb}
load_args:
hive_partitioning: true