I'm trying to read a csv file (by chunks) and then...
# questions
h
I'm trying to read a csv file (by chunks) and then save the result as a parquet partitioned files. The following catalog raises a DatasetError:
Copy code
"{company}.{layer}.transactions":
  type: pandas.ParquetDataset
  filepath: data/{company}/{layer}/transactions
  save_args:
    partition_cols: [year, month]
The error:
DatasetError: ParquetDataset does not support save argument 'partition_cols'. Please use '<http://kedro.io|kedro.io>.PartitionedDataset' instead.
How am I supposed to do it using PartitionedDatasets and what is the reason behind blocking the use of partition_cols in pandas.ParquetDataset (I'm asking because i could just override it with a custom Dataset)?
h
Someone will reply to you shortly. In the meantime, this might help:
r
Hi @Hugo Barreto, I am not exactly sure on the rationale behind why
partition_cols
is not supported. May be @Nok Lam Chan or someone has a better idea as this has been around from the start. You can do this using PartitionedDatasets as mentioned here and use the arg
dataset: pandas.ParquetDataset
as the underlying dataset. Thank you
👍 1
h
Thanks for pointing me out
👍 1
n
image.png
Please feel free to raise an issue/PR to fix this.
I am pretty sure this is mentioned recently (maybe it's an internal conversation I have seen elsewhere). This code is quite old, thus you see the error is explicitly raised for the argument. AFAIK, in the old days partitioning with parquet isn't working very well, which may be related to
fast_parquet
engine. These day the default is
pyarrow
and it is supported by pandas out of the box.
TL;DR, I think it's an outdated guardrail that should be removed
thankyou 2