I m trying to read a csv file by chunks and then save the re Kedro #questions

I'm trying to read a csv file (by chunks) and then...

Hugo Barreto

01/17/2025, 7:02 PM

I'm trying to read a csv file (by chunks) and then save the result as a parquet partitioned files. The following catalog raises a DatasetError:

Copy code

"{company}.{layer}.transactions":
  type: pandas.ParquetDataset
  filepath: data/{company}/{layer}/transactions
  save_args:
    partition_cols: [year, month]

The error:

DatasetError: ParquetDataset does not support save argument 'partition_cols'. Please use '<http://kedro.io|kedro.io>.PartitionedDataset' instead.

How am I supposed to do it using PartitionedDatasets and what is the reason behind blocking the use of partition_cols in pandas.ParquetDataset (I'm asking because i could just override it with a custom Dataset)?

Hall

01/17/2025, 7:02 PM

Someone will reply to you shortly. In the meantime, this might help:

Ravi Kumar Pilla

01/17/2025, 7:19 PM

Hi @Hugo Barreto, I am not exactly sure on the rationale behind why

partition_cols

is not supported. May be @Nok Lam Chan or someone has a better idea as this has been around from the start. You can do this using PartitionedDatasets as mentioned here and use the arg

dataset: pandas.ParquetDataset

as the underlying dataset. Thank you

👍 1

Hugo Barreto

01/17/2025, 7:24 PM

Thanks for pointing me out

👍 1

Nok Lam Chan

01/17/2025, 7:36 PM

image.png

Nok Lam Chan

01/17/2025, 7:36 PM

Please feel free to raise an issue/PR to fix this.

Nok Lam Chan

01/17/2025, 7:37 PM

I am pretty sure this is mentioned recently (maybe it's an internal conversation I have seen elsewhere). This code is quite old, thus you see the error is explicitly raised for the argument. AFAIK, in the old days partitioning with parquet isn't working very well, which may be related to

fast_parquet

engine. These day the default is

pyarrow

and it is supported by pandas out of the box.

Nok Lam Chan

01/17/2025, 7:38 PM

TL;DR, I think it's an outdated guardrail that should be removed

thankyou 2

7 Views

Open in Slack

Previous Next