Hi Folks I have a `PartitionedDataSet` like this ```scenario Kedro #questions

Hi Folks! I have a `PartitionedDataSet` like this:...

Abhishek Bhatia

06/19/2023, 7:46 AM

Hi Folks! I have a

PartitionedDataSet

like this:

Copy code

scenario_x/
├── iter_1/
│   ├── run_1.csv
│   ├── run_2.csv
│   └── run_3.csv
└── iter_2/
    ├── run_1.csv
    ├── run_2.csv
    └── run_3.csv
scenario_y/
├── iter_1/
│   ├── run_1.csv
│   ├── run_2.csv
│   └── run_3.csv
└── iter_2/
    ├── run_1.csv
    ├── run_2.csv
    └── run_3.csv

The catalog entry is like this:

Copy code

_partitioned_csvs: &_partitioned_csvs
  type: PartitionedDataSet
  dataset:
    type: pandas.CSVDataSet
    load_args:
      index_col: 0
    save_args:
      index: true
  overwrite: true
  filename_suffix: ".csv"

_partitioned_jsons: &_partitioned_jsons
  type: PartitionedDataSet
  dataset:
    type: json.JSONDataSet
  filename_suffix: ".json"

my_csv_part_ds:
  path: data/07_model_output/my_csv_part_ds
  <<: *_partitioned_csvs

my_json_part_ds:
  path: data/07_model_output/my_json_part_ds
  <<: *_partitioned_jsons

When I run the pipeline, the csv partitioned dataset gets deleted first, and then new one gets written, but the json partitioned dataset remains, and new ones get added. I need a sort of a custom behaviour, wherein, the 2nd level of the partition should get overwritten, and not first level partition i.e. in the node which produces the partitioned csv, the return value is like this:

Copy code

def node_that_generates_part_ds(scenario, **kwargs):
  res = {'scenario_x/iter_1/run_1': df1, 'scenario_x/iter_1/run_2': df2,  .... and so on}}
  return res

so when return

res

keys contain scenario_x, scenario_y shoul NOT get deleted. Can anyone guide me on how can I achieve this? Thanks! 🙂

3 Views

Open in Slack

Previous Next