Livia Biggi
08/21/2023, 5:55 PMspine_inference:
type: PartitionedDataSet
path: <s3://staging/proj/spines/inference/spine/>
dataset: pandas.ParquetDataSet
filename_suffix: ".parquet"
overwrite: True
which is created in
def create_partitioned_data(df: pd.DataFrame, *args):
... performing operations ...
# partitioning by province
parts = {}
for province in spine_inference["location_province_code"].unique():
parts[f"spine_inference_{province}"] = spine_inference[
spine_inference["location_province_code"] == province
]
return parts
I then want to run another node taking all partitions of spine_inference
as input, and saving a pre-processed version of them as output. the function I wrote takes as input a single pandas dataframe and performs operations on it (I naively assumed that kedro would automatically apply the function to each partition), however I get the error TypeError: unhashable type: 'list'
: what is the correct way of working with partitioned datasets? my partitions sum up to ~20M rows, so there's no way we can load the whole dataset at onceDeepyaman Datta
08/21/2023, 6:23 PMparts = {}
for province, data in spine_inference.groupby("location_province_code"):
parts[f"spine_inference_{province}"] = data
return parts
or, even more simply:
return {f"spine_inference_{province}": data for province, data in spine_inference.groupby("location_province_code")}