Hi folks! Is it possible to define multiple types...
# questions
a
Hi folks! Is it possible to define multiple types of base datasets for PartitionedDataSet? I have a use case where my node may return both pandas dataframe and dictionaries in the output. Having separate catalog entries for each also doesn't suit my needs. Was hoping if the following was anyway possible?
Copy code
my_multi_format_part_dataset:
  type: PartitionedDataSet
  path: "data/path/to/dataset"
  dataset:
    - type: pandas.CSVDataSet
      load_args:
        index_col: 0
      save_args:
        index: false
      filename_suffix: ".csv"
    - type: json.JSONDataSet
      filename_suffix: ".json"
d
So you can definitely subclass things to work this way, but I would suggest the readability of two different catalog entries feels more maintainable
a
The reason why I don't have separate catalog entries is because I have a generic pipeline node that can return outputs by multiple groups (variable names per run) i.e. Node returns a dictionary in which each key can be composed of multiple groups The output structure looks like following: Here, common folder contains both format datasets
Copy code
data/
└── 07_model_output/
    └── experiments/
        ├── experiment_1/
        │   ├── run_1/
        │   │   ├── logs/
        │   │   │   ├── replication_1.csv
        │   │   │   └── replication_2.csv
        │   │   └── common/
        │   │       ├── parameters.json
        │   │       └── run_details.csv
        │   └── run_2/
        │       ├── logs/
        │       │   ├── replication_1.csv
        │       │   └── replication_2.csv
        │       └── common/
        │           ├── parameters.json
        │           └── run_details.csv
        └── experiment_2/
            └── run_1/
                ├── logs/
                │   ├── replication_1.csv
                │   ├── replication_2.csv
                │   └── replication_3.csv
                └── common/
                    ├── parameters.json
                    └── run_details.csv
• Each run of the pipeline depending on what experiment name was given, and the specified parameters produces a different output • If same name experiment folder already exists, it should be overwritten else new folder is created and existing space is not deleted. I handle above behaviour by subclassing
PartitionedDataSet
and creating a
HeirarchicalDataSet
, which can handle at directory level data needs to overwritten Something like:
Copy code
class HeirarchicalDataSet(PartitionedDataSet):
    
    def __init__(
        self,
        path,
        dataset,
        filepath_arg: str = "filepath",
        filename_suffix: str = "",
        credentials = None,
        load_args = None,
        fs_args = None,
        base_level=1
    ):
The catalog entry looks like this:
Copy code
my_hrchical_ds:
  type: my_kedro_project.datasets.HeirarchicalDataSet
  path: data/07_model_output/experiments
  dataset:
    type: pandas.CSVDataSet
    load_args:
      index_col: 0
    save_args:
      index: false
  filename_suffix: ".csv"
  base_level: 1
I want to be able to handle any format datasets inside this, and load them appropriately according to their base datasets Sorrry for the overly loooong post! 🙂
NOTE: The real context is NOT exactly a clone of experiment tracking, but this was as close a analogy as I could think of.
d
So the short answer is what you’re doing feels technically possible, but it’s pretty hard to support since it’s quite an unusual approach to Kedro