hi all! what's the best way of working with partit...
# questions
l
hi all! what's the best way of working with partitioned datasets? I have created a catalog entry:
Copy code
spine_inference:
  type: PartitionedDataSet
  path: <s3://staging/proj/spines/inference/spine/>
  dataset: pandas.ParquetDataSet
  filename_suffix: ".parquet"
  overwrite: True
which is created in
Copy code
def create_partitioned_data(df: pd.DataFrame, *args):
   ... performing operations ...
    # partitioning by province
    parts = {}

    for province in spine_inference["location_province_code"].unique():
        parts[f"spine_inference_{province}"] = spine_inference[
            spine_inference["location_province_code"] == province
        ]
    return parts
I then want to run another node taking all partitions of
spine_inference
as input, and saving a pre-processed version of them as output. the function I wrote takes as input a single pandas dataframe and performs operations on it (I naively assumed that kedro would automatically apply the function to each partition), however I get the error
TypeError: unhashable type: 'list'
: what is the correct way of working with partitioned datasets? my partitions sum up to ~20M rows, so there's no way we can load the whole dataset at once
d
Can you show where the error is coming from? I assume it's from the loading node, and you were able to save properly?
Also, not that you asked, but I'm guessing your loop is inefficient, since you're indexing the dataframe multiple times. You can time the results (would be happy to hear what you find), but I would find the following more idiomatic (and perhaps performant):
Copy code
parts = {}

    for province, data in spine_inference.groupby("location_province_code"):
        parts[f"spine_inference_{province}"] = data
    return parts
or, even more simply:
Copy code
return {f"spine_inference_{province}": data for province, data in spine_inference.groupby("location_province_code")}