Abhishek Bhatia
06/19/2023, 7:46 AMPartitionedDataSet
like this:
scenario_x/
├── iter_1/
│ ├── run_1.csv
│ ├── run_2.csv
│ └── run_3.csv
└── iter_2/
├── run_1.csv
├── run_2.csv
└── run_3.csv
scenario_y/
├── iter_1/
│ ├── run_1.csv
│ ├── run_2.csv
│ └── run_3.csv
└── iter_2/
├── run_1.csv
├── run_2.csv
└── run_3.csv
The catalog entry is like this:
_partitioned_csvs: &_partitioned_csvs
type: PartitionedDataSet
dataset:
type: pandas.CSVDataSet
load_args:
index_col: 0
save_args:
index: true
overwrite: true
filename_suffix: ".csv"
_partitioned_jsons: &_partitioned_jsons
type: PartitionedDataSet
dataset:
type: json.JSONDataSet
filename_suffix: ".json"
my_csv_part_ds:
path: data/07_model_output/my_csv_part_ds
<<: *_partitioned_csvs
my_json_part_ds:
path: data/07_model_output/my_json_part_ds
<<: *_partitioned_jsons
When I run the pipeline, the csv partitioned dataset gets deleted first, and then new one gets written, but the json partitioned dataset remains, and new ones get added.
I need a sort of a custom behaviour, wherein, the 2nd level of the partition should get overwritten, and not first level partition i.e.
in the node which produces the partitioned csv, the return value is like this:
def node_that_generates_part_ds(scenario, **kwargs):
res = {'scenario_x/iter_1/run_1': df1, 'scenario_x/iter_1/run_2': df2, .... and so on}}
return res
so when return res
keys contain scenario_x, scenario_y shoul NOT get deleted.
Can anyone guide me on how can I achieve this?
Thanks! 🙂