https://kedro.org/ logo
#questions
Title
# questions
i

Iñigo Hidalgo

01/16/2024, 7:47 PM
Has anybody tested something along these lines? I try to keep kedro nodes as atomic at possible: single nodes do single things. This is nice because it gives me a lot of traceability from Viz, but sometimes it’s a bit of a hassle. I’ll have a list of nodes which directly output from one to the next, but the intermediate outputs are wholly unnecessary to keep track of. I am forced to give them names, and write each name twice over, as an output and the input to the next node. This is especially bad because if for some reason I want to reorganize some nodes I’ll have to change things in multiple places. I am debating building some function which takes a list of functions and returns a list of nodes with the intermediate datasets having randomly generated names. Does this sound like a good idea?
First question on my mind is: what about additional parameters for intermediate functions?
n

Nok Lam Chan

01/16/2024, 11:26 PM
Definitely thought about this, having some higher level nodes or equivalent. It also reminds me a bit the layer API in Keras where you only specify the layer you need and the computation graph is taken care by the library.
This also goes with the use case sometimes you just really want a few functions run sequentially, and there are no way that you can enforce the order without dummy inputs
r

Rashida Kanchwala

01/17/2024, 10:15 AM
@Iñigo Hidalgo - I would love to understand more. How do we define intermediary datasets? Is it all datasets except that ones that are free inputs (not produced as an output of a node) and free outputs (not used as inputs to any node) ? Also yesterday, I created two quick experiments to see how we show/hide MemoryDatasets, fade MemoryDatasets so that we can distinguish between persistent/non-persistent data. Let me know your thoughts.
i

Iñigo Hidalgo

01/17/2024, 10:46 AM
@Rashida Kanchwala within the scope of this feature, it would just be the dataset which is a direct output of one of my nodes, goes directly into another node and then never gets consumed anymore. On the other hand, from your definition there are datasets which I wouldn't want to give random names: those which are inputs to more than one node. In the context of dataframes it would be a similar concept to the pipe method .
👍 1
d

datajoely

01/17/2024, 11:59 AM
The “n-plus” operator dbt has would be useful here right? https://docs.getdbt.com/reference/node-selection/graph-operators#the-n-plus-operator
i

Iñigo Hidalgo

01/17/2024, 12:32 PM
ive never used dbt but from what i can see that's more of a way of slicing "pipelines", which would probably be useful functionality, but it's only tangentially related to what I would propose. my proposal is more on the pipeline creation side than on slicing already created pipelines.
👍 1