Hello in the `catalog yml` file I would like to specify `loa Kedro #questions

Hello, in the `catalog.yml` file, I would like to ...

Swamini Khurana

06/26/2024, 8:04 AM

Hello, in the

catalog.yml

file, I would like to specify

load_args

to load a dataset where the

load_args

are taken from a yaml file written during the execution of a different node. e.g.

Copy code

dummy_dataset:
  filepath: /path/to/dataset
  type: some_spatial_dataset
  load_args:
    dummy_arg: <4 element tuple from dummy_arg.yml>

My understanding was that I could do this using Hooks, where I could specify that this should be done prior to loading of dummy_dataset by specifying

before_dataset_loaded

(example seen here: https://docs.kedro.org/en/stable/hooks/examples.html) Considerations: The dataset type is an abstract dataset for geospatial datasets/vector files Question for kedro team: • Is this possible to execute? • Are we approaching this correctly or have we missed something?

marrrcin

06/26/2024, 8:26 AM

It's hacky but it's possible: https://linen-slack.kedro.org/t/15740279/hi-everyone-could-a-dataset-be-an-output-to-another-dataset-#67f6fa34-cb91-4cc0-afb2-f7a839f5c152 Excerpt:

Copy code

class LazyDataSet(AbstractDataSet):
# constructors and other stuff
def _load():
    def lazy_loader(path):
         return PickleDataSet(path).load()
    return lazy_loader

And then you do this in 2 nodes: 1.

node(inputs="from_sql_query", func=<extract the path you need>, outputs="path_you_need")

node(inputs=["path_you_need", "lazy_dataset"], lambda path, lazy: lazy(path))

Swamini Khurana

06/26/2024, 9:51 AM

Hi, thank you for sharing the proposed solution. It looks like this will require implementation of 2 nodes everytime I want this dataset loaded: Did I understand this correctly? The above solution will effectively increase the size of my project 2x as 98% of my nodes need this dataset type as input 😅 Wondering if we can still explore further solutions here ...

marrrcin

06/26/2024, 9:56 AM

You could potentially use the hooks too as you've described, but imho it will be more error prone

🙏 1

Elena Khaustova

06/26/2024, 10:50 AM

Hello Swamini, am I right that your goal is to change the

load_args

dummy_dataset

at the runtime, so you want to redefine the parameters set in the

catalog.yml

Swamini Khurana

06/26/2024, 10:58 AM

yes, exactly!

Elena Khaustova

06/26/2024, 11:24 AM

We normally do not encourage users to modify the catalog at runtime and inject dynamic behaviour, though it’s still possible. For that, you can use the after_node_run hook, which has the

catalog

as input. So, you will need to access the target parameters in the catalog via the private

_datasets

property and modify them. Since we do not recommend doing this - there are no public methods for that. You will need to look at the DataCatalog implementation to be able to do that. Here is an example of injecting the dynamic behaviour for the mlflow plugin.

Swamini Khurana

06/26/2024, 1:01 PM

Thank you, Elena. Noted on the warning. Will dig into your suggestion. Otherwise our next best solution currently is bypassing kedro I/O while loading these datasets altogether: Something we want to avoid as we will miss the point of kedro altogether 😞 😞

Elena Khaustova

06/26/2024, 1:11 PM

That doesn’t sound nice, indeed 🫤 Feel free to reach out if you need any tips on the implementation of the above.

🙏 1

3 Views

Open in Slack

Previous Next