Hi folks Is it possible to define multiple types of base dat Kedro #questions

Hi folks! Is it possible to define multiple types...

Abhishek Bhatia

07/10/2023, 11:33 AM

Hi folks! Is it possible to define multiple types of base datasets for PartitionedDataSet? I have a use case where my node may return both pandas dataframe and dictionaries in the output. Having separate catalog entries for each also doesn't suit my needs. Was hoping if the following was anyway possible?

Copy code

my_multi_format_part_dataset:
  type: PartitionedDataSet
  path: "data/path/to/dataset"
  dataset:
    - type: pandas.CSVDataSet
      load_args:
        index_col: 0
      save_args:
        index: false
      filename_suffix: ".csv"
    - type: json.JSONDataSet
      filename_suffix: ".json"

datajoely

07/10/2023, 11:41 AM

So you can definitely subclass things to work this way, but I would suggest the readability of two different catalog entries feels more maintainable

Abhishek Bhatia

07/10/2023, 11:56 AM

The reason why I don't have separate catalog entries is because I have a generic pipeline node that can return outputs by multiple groups (variable names per run) i.e. Node returns a dictionary in which each key can be composed of multiple groups The output structure looks like following: Here, common folder contains both format datasets

Copy code

data/
└── 07_model_output/
    └── experiments/
        ├── experiment_1/
        │   ├── run_1/
        │   │   ├── logs/
        │   │   │   ├── replication_1.csv
        │   │   │   └── replication_2.csv
        │   │   └── common/
        │   │       ├── parameters.json
        │   │       └── run_details.csv
        │   └── run_2/
        │       ├── logs/
        │       │   ├── replication_1.csv
        │       │   └── replication_2.csv
        │       └── common/
        │           ├── parameters.json
        │           └── run_details.csv
        └── experiment_2/
            └── run_1/
                ├── logs/
                │   ├── replication_1.csv
                │   ├── replication_2.csv
                │   └── replication_3.csv
                └── common/
                    ├── parameters.json
                    └── run_details.csv

• Each run of the pipeline depending on what experiment name was given, and the specified parameters produces a different output • If same name experiment folder already exists, it should be overwritten else new folder is created and existing space is not deleted. I handle above behaviour by subclassing

PartitionedDataSet

and creating a

HeirarchicalDataSet

, which can handle at directory level data needs to overwritten Something like:

Copy code

class HeirarchicalDataSet(PartitionedDataSet):
    
    def __init__(
        self,
        path,
        dataset,
        filepath_arg: str = "filepath",
        filename_suffix: str = "",
        credentials = None,
        load_args = None,
        fs_args = None,
        base_level=1
    ):

The catalog entry looks like this:

Copy code

my_hrchical_ds:
  type: my_kedro_project.datasets.HeirarchicalDataSet
  path: data/07_model_output/experiments
  dataset:
    type: pandas.CSVDataSet
    load_args:
      index_col: 0
    save_args:
      index: false
  filename_suffix: ".csv"
  base_level: 1

I want to be able to handle any format datasets inside this, and load them appropriately according to their base datasets Sorrry for the overly loooong post! 🙂

Abhishek Bhatia

07/10/2023, 11:57 AM

NOTE: The real context is NOT exactly a clone of experiment tracking, but this was as close a analogy as I could think of.

datajoely

07/10/2023, 12:15 PM

So the short answer is what you’re doing feels technically possible, but it’s pretty hard to support since it’s quite an unusual approach to Kedro

11 Views

Open in Slack

Previous Next