Hello, in the `catalog.yml` file, I would like to ...
# questions
s
Hello, in the
catalog.yml
file, I would like to specify
load_args
to load a dataset where the
load_args
are taken from a yaml file written during the execution of a different node. e.g.
Copy code
dummy_dataset:
  filepath: /path/to/dataset
  type: some_spatial_dataset
  load_args:
    dummy_arg: <4 element tuple from dummy_arg.yml>
My understanding was that I could do this using Hooks, where I could specify that this should be done prior to loading of dummy_dataset by specifying
before_dataset_loaded
(example seen here: https://docs.kedro.org/en/stable/hooks/examples.html) Considerations: The dataset type is an abstract dataset for geospatial datasets/vector files Question for kedro team: • Is this possible to execute? • Are we approaching this correctly or have we missed something?
m
It's hacky but it's possible: https://linen-slack.kedro.org/t/15740279/hi-everyone-could-a-dataset-be-an-output-to-another-dataset-#67f6fa34-cb91-4cc0-afb2-f7a839f5c152 Excerpt:
Copy code
class LazyDataSet(AbstractDataSet):
# constructors and other stuff
def _load():
    def lazy_loader(path):
         return PickleDataSet(path).load()
    return lazy_loader
And then you do this in 2 nodes: 1.
node(inputs="from_sql_query", func=<extract the path you need>, outputs="path_you_need")
2.
node(inputs=["path_you_need", "lazy_dataset"], lambda path, lazy: lazy(path))
s
Hi, thank you for sharing the proposed solution. It looks like this will require implementation of 2 nodes everytime I want this dataset loaded: Did I understand this correctly? The above solution will effectively increase the size of my project 2x as 98% of my nodes need this dataset type as input 😅 Wondering if we can still explore further solutions here ...
m
You could potentially use the hooks too as you've described, but imho it will be more error prone
🙏 1
e
Hello Swamini, am I right that your goal is to change the
load_args
of
dummy_dataset
at the runtime, so you want to redefine the parameters set in the
catalog.yml
?
s
yes, exactly!
e
We normally do not encourage users to modify the catalog at runtime and inject dynamic behaviour, though it’s still possible. For that, you can use the after_node_run hook, which has the
catalog
as input. So, you will need to access the target parameters in the catalog via the private
_datasets
property and modify them. Since we do not recommend doing this - there are no public methods for that. You will need to look at the DataCatalog implementation to be able to do that. Here is an example of injecting the dynamic behaviour for the mlflow plugin.
s
Thank you, Elena. Noted on the warning. Will dig into your suggestion. Otherwise our next best solution currently is bypassing kedro I/O while loading these datasets altogether: Something we want to avoid as we will miss the point of kedro altogether 😞 😞
e
That doesn’t sound nice, indeed 🫤 Feel free to reach out if you need any tips on the implementation of the above.
🙏 1