hi all what s the best way of working with partitioned datas Kedro #questions

hi all! what's the best way of working with partit...

Livia Biggi

08/21/2023, 5:55 PM

hi all! what's the best way of working with partitioned datasets? I have created a catalog entry:

Copy code

spine_inference:
  type: PartitionedDataSet
  path: <s3://staging/proj/spines/inference/spine/>
  dataset: pandas.ParquetDataSet
  filename_suffix: ".parquet"
  overwrite: True

which is created in

Copy code

def create_partitioned_data(df: pd.DataFrame, *args):
   ... performing operations ...
    # partitioning by province
    parts = {}

    for province in spine_inference["location_province_code"].unique():
        parts[f"spine_inference_{province}"] = spine_inference[
            spine_inference["location_province_code"] == province
        ]
    return parts

I then want to run another node taking all partitions of

spine_inference

as input, and saving a pre-processed version of them as output. the function I wrote takes as input a single pandas dataframe and performs operations on it (I naively assumed that kedro would automatically apply the function to each partition), however I get the error

TypeError: unhashable type: 'list'

: what is the correct way of working with partitioned datasets? my partitions sum up to ~20M rows, so there's no way we can load the whole dataset at once

Deepyaman Datta

08/21/2023, 6:23 PM

Can you show where the error is coming from? I assume it's from the loading node, and you were able to save properly?

Deepyaman Datta

08/21/2023, 6:26 PM

Also, not that you asked, but I'm guessing your loop is inefficient, since you're indexing the dataframe multiple times. You can time the results (would be happy to hear what you find), but I would find the following more idiomatic (and perhaps performant):

Copy code

parts = {}

    for province, data in spine_inference.groupby("location_province_code"):
        parts[f"spine_inference_{province}"] = data
    return parts

or, even more simply:

Copy code

return {f"spine_inference_{province}": data for province, data in spine_inference.groupby("location_province_code")}

Open in Slack

Previous Next