Abhishek Bhatia
07/10/2023, 11:33 AMmy_multi_format_part_dataset:
type: PartitionedDataSet
path: "data/path/to/dataset"
dataset:
- type: pandas.CSVDataSet
load_args:
index_col: 0
save_args:
index: false
filename_suffix: ".csv"
- type: json.JSONDataSet
filename_suffix: ".json"
datajoely
07/10/2023, 11:41 AMAbhishek Bhatia
07/10/2023, 11:56 AMdata/
└── 07_model_output/
└── experiments/
├── experiment_1/
│ ├── run_1/
│ │ ├── logs/
│ │ │ ├── replication_1.csv
│ │ │ └── replication_2.csv
│ │ └── common/
│ │ ├── parameters.json
│ │ └── run_details.csv
│ └── run_2/
│ ├── logs/
│ │ ├── replication_1.csv
│ │ └── replication_2.csv
│ └── common/
│ ├── parameters.json
│ └── run_details.csv
└── experiment_2/
└── run_1/
├── logs/
│ ├── replication_1.csv
│ ├── replication_2.csv
│ └── replication_3.csv
└── common/
├── parameters.json
└── run_details.csv
• Each run of the pipeline depending on what experiment name was given, and the specified parameters produces a different output
• If same name experiment folder already exists, it should be overwritten else new folder is created and existing space is not deleted.
I handle above behaviour by subclassing PartitionedDataSet
and creating a HeirarchicalDataSet
, which can handle at directory level data needs to overwritten
Something like:
class HeirarchicalDataSet(PartitionedDataSet):
def __init__(
self,
path,
dataset,
filepath_arg: str = "filepath",
filename_suffix: str = "",
credentials = None,
load_args = None,
fs_args = None,
base_level=1
):
The catalog entry looks like this:
my_hrchical_ds:
type: my_kedro_project.datasets.HeirarchicalDataSet
path: data/07_model_output/experiments
dataset:
type: pandas.CSVDataSet
load_args:
index_col: 0
save_args:
index: false
filename_suffix: ".csv"
base_level: 1
I want to be able to handle any format datasets inside this, and load them appropriately according to their base datasets
Sorrry for the overly loooong post! 🙂datajoely
07/10/2023, 12:15 PM