Wondered if anyone else has come across this, or p...
# questions
j
Wondered if anyone else has come across this, or perhaps I'm doing something wrong. I'm reading from/writing to a hive partition of parquet files using Ibis with the DuckDB backend (
ibis.FileDataset
,
kedro-datasets>=7.0.0
). Kedro seems to make an assumption with the
filepath
catalog key of a dataset, that the dataset can be read from and written to that same path. However,
Backend.write_parquet
and
<http://Backend.to|Backend.to>_parquet
are different when
load_args={'hive_partitioning': True}
, as the corresponding DuckDB functions require a directory arg when writing, but a nested glob when reading: https://duckdb.org/docs/stable/data/partitioning/hive_partitioning.html This is reflected at the Ibis level as well: https://github.com/ibis-project/ibis/issues/10939 Things still work if you have a catalog entry like this:
Copy code
my_hive:
  type: ibis.FileDataset
  filepath: data/01_raw/my_hive/first_col=*/second_col=*/*.parquet
  table_name: my_hive
  file_format: parquet
  connection: ${_duckdb}
  load_args:
    hive_partitioning: true
  save_args:
    partition_by: ${tuple:first_col,second_col}
But the write operation will treat the entire filepath like a directory path, and you end up with something like:
Copy code
my_hive
└── first_col=*
    └── second_col=*
        └── *.parquet
            └── first_col=val_1
                ├── second_col=cat_1
                │   └── data_0.parquet
                ├── second_col=cat_2
                    └── data_0.parquet
            └── ...
This isn't really a Kedro design problem – perhaps the DuckDB API should be more symmetric. Has anyone else overcome this at the Kedro level? Thanks.
d
Hmm, this is a little annoying. I agree DuckDB should be more symmetric. :) I think you could work around this with transcoding: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#read-the-same-file-using-different-datasets-with-transcoding
K 1
(I don't think the statement that they share the same
filepath
is correct, should be able to have the different filepaths as you need)
j
Thank you, that was exactly what I needed - just split the catalog entry into the two variations:
Copy code
my_hive@write:
  type: ibis.FileDataset
  filepath: data/01_raw/my_hive
  table_name: my_hive
  file_format: parquet
  connection: ${_duckdb}
  save_args:
    partition_by: ${tuple:first_col,second_col}

my_hive@read:
  type: ibis.FileDataset
  filepath: data/01_raw/my_hive/first_col=*/second_col=*/*.parquet
  table_name: my_hive
  file_format: parquet
  connection: ${_duckdb}
  load_args:
    hive_partitioning: true
🙌 1