Is there any workaround to enable a node to have the same da Kedro #questions

Join Slack

Is there any workaround to enable a node to have t...

# questions

Elior Cohen

03/24/2024, 11:27 AM

Is there any workaround to enable a node to have the same dataset as input and output?

Nok Lam Chan

03/24/2024, 11:34 AM

Can you elaborate why you want to do that? This is not possible by design to prevent circular dependencies.

Elior Cohen

03/24/2024, 11:45 AM

Yep sure

Elior Cohen

03/24/2024, 11:47 AM

I have a use case where I edit some large JSON file, basically adding fields multiple nested places

Elior Cohen

03/24/2024, 11:48 AM

next time I run the pipeline, I need to run it on the altered-JSON, original one is no longer relevant

Elior Cohen

03/24/2024, 11:48 AM

So now every time I find myself editing the catalog again and again after each change

Nok Lam Chan

03/24/2024, 12:43 PM

I see, then your pipeline should starts with the altered JSON, it doesn’t have to be the same dataset from what I understand here. What’s the problem to keep them as separate dataset? If you don’t need to trigger the steps of editing json you can simply use tag or separate it out from the default pipeline and only run it when necessary

Iñigo Hidalgo

03/24/2024, 2:40 PM

Hi Elior, This is a common need for me in some pipelines. Kedro only cares that you do not have the same dataset name in a circular dependency, but there is nothing preventing you from having dataset_a: filepath: same_path.json dataset_b: filepath: same_path.json Kedro will treat those as two separate datasets. But you should be aware that this can lead to unexpected behavior

Elior Cohen

03/24/2024, 2:41 PM

god that's so straightforward I don't know how I couldn't think of this myself facepalming

Iñigo Hidalgo

03/24/2024, 2:43 PM

i had a similar lightbulb moment at some point when i first encountered this need hahaha. to avoid repeating yourself in yaml config you can use yaml anchoring

👍🏼 1

Iñigo Hidalgo

03/24/2024, 2:44 PM

Copy code

_common_definition: &common_definition
  path: some_json.json

dataset_a:
  <<: *common_definition

dataset_b:
  <<: *common_definition

quoting from memory, if you search the kedro docs for “anchor” you should find this

Matthias Roels

03/25/2024, 6:28 PM

Just to add to the discussion. Although the proposed workarounds are solid. Be aware that when you overwrite an existing dataset, your pipelines are no longer idempotent (running it twice produces the same result). You could achieve a similar result by just versioning your dataset somehow (input

, output

v+1

) where the version number v is used in the filepath and supplied at runtime (e.g as an env var). This way, your pipelines remain idempotent

this 2

Iñigo Hidalgo

03/25/2024, 6:29 PM

that is a very very good point. in our usecase these datasets are append-only tables which helps preserve idempotency

👍 1

46 Views

Open in Slack

Previous Next