Has anybody tested something along these lines I try to keep Kedro #questions

Has anybody tested something along these lines? I...

Iñigo Hidalgo

01/16/2024, 7:47 PM

Has anybody tested something along these lines? I try to keep kedro nodes as atomic at possible: single nodes do single things. This is nice because it gives me a lot of traceability from Viz, but sometimes it’s a bit of a hassle. I’ll have a list of nodes which directly output from one to the next, but the intermediate outputs are wholly unnecessary to keep track of. I am forced to give them names, and write each name twice over, as an output and the input to the next node. This is especially bad because if for some reason I want to reorganize some nodes I’ll have to change things in multiple places. I am debating building some function which takes a list of functions and returns a list of nodes with the intermediate datasets having randomly generated names. Does this sound like a good idea?

Iñigo Hidalgo

01/16/2024, 7:49 PM

First question on my mind is: what about additional parameters for intermediate functions?

Nok Lam Chan

01/16/2024, 11:26 PM

Definitely thought about this, having some higher level nodes or equivalent. It also reminds me a bit the layer API in Keras where you only specify the layer you need and the computation graph is taken care by the library.

Nok Lam Chan

01/16/2024, 11:27 PM

This also goes with the use case sometimes you just really want a few functions run sequentially, and there are no way that you can enforce the order without dummy inputs

Rashida Kanchwala

01/17/2024, 10:15 AM

@Iñigo Hidalgo - I would love to understand more. How do we define intermediary datasets? Is it all datasets except that ones that are free inputs (not produced as an output of a node) and free outputs (not used as inputs to any node) ? Also yesterday, I created two quick experiments to see how we show/hide MemoryDatasets, fade MemoryDatasets so that we can distinguish between persistent/non-persistent data. Let me know your thoughts.

Iñigo Hidalgo

01/17/2024, 10:46 AM

@Rashida Kanchwala within the scope of this feature, it would just be the dataset which is a direct output of one of my nodes, goes directly into another node and then never gets consumed anymore. On the other hand, from your definition there are datasets which I wouldn't want to give random names: those which are inputs to more than one node. In the context of dataframes it would be a similar concept to the pipe method .

👍 1

datajoely

01/17/2024, 11:59 AM

The “n-plus” operator dbt has would be useful here right? https://docs.getdbt.com/reference/node-selection/graph-operators#the-n-plus-operator

Iñigo Hidalgo

01/17/2024, 12:32 PM

ive never used dbt but from what i can see that's more of a way of slicing "pipelines", which would probably be useful functionality, but it's only tangentially related to what I would propose. my proposal is more on the pipeline creation side than on slicing already created pipelines.

👍 1

3 Views

Open in Slack

Previous Next