https://kedro.org/ logo
#questions
Title
# questions
e

Elior Cohen

03/24/2024, 11:27 AM
Is there any workaround to enable a node to have the same dataset as input and output?
n

Nok Lam Chan

03/24/2024, 11:34 AM
Can you elaborate why you want to do that? This is not possible by design to prevent circular dependencies.
e

Elior Cohen

03/24/2024, 11:45 AM
Yep sure
I have a use case where I edit some large JSON file, basically adding fields multiple nested places
next time I run the pipeline, I need to run it on the altered-JSON, original one is no longer relevant
So now every time I find myself editing the catalog again and again after each change
n

Nok Lam Chan

03/24/2024, 12:43 PM
I see, then your pipeline should starts with the altered JSON, it doesn’t have to be the same dataset from what I understand here. What’s the problem to keep them as separate dataset? If you don’t need to trigger the steps of editing json you can simply use tag or separate it out from the default pipeline and only run it when necessary
i

Iñigo Hidalgo

03/24/2024, 2:40 PM
Hi Elior, This is a common need for me in some pipelines. Kedro only cares that you do not have the same dataset name in a circular dependency, but there is nothing preventing you from having dataset_a: filepath: same_path.json dataset_b: filepath: same_path.json Kedro will treat those as two separate datasets. But you should be aware that this can lead to unexpected behavior
e

Elior Cohen

03/24/2024, 2:41 PM
god that's so straightforward I don't know how I couldn't think of this myself facepalming
i

Iñigo Hidalgo

03/24/2024, 2:43 PM
i had a similar lightbulb moment at some point when i first encountered this need hahaha. to avoid repeating yourself in yaml config you can use yaml anchoring
👍🏼 1
Copy code
_common_definition: &common_definition
  path: some_json.json

dataset_a:
  <<: *common_definition

dataset_b:
  <<: *common_definition
quoting from memory, if you search the kedro docs for “anchor” you should find this
m

Matthias Roels

03/25/2024, 6:28 PM
Just to add to the discussion. Although the proposed workarounds are solid. Be aware that when you overwrite an existing dataset, your pipelines are no longer idempotent (running it twice produces the same result). You could achieve a similar result by just versioning your dataset somehow (input
v
, output
v+1
) where the version number v is used in the filepath and supplied at runtime (e.g as an env var). This way, your pipelines remain idempotent
this 2
i

Iñigo Hidalgo

03/25/2024, 6:29 PM
that is a very very good point. in our usecase these datasets are append-only tables which helps preserve idempotency
👍 1