Hi fellows I would like your insights on designing pipelines Kedro #questions

Hi fellows, I would like your insights on designin...

Flavien

10/17/2023, 7:42 AM

Hi fellows, I would like your insights on designing pipelines with common nodes. Let's imagine that two different pipelines need as a first node the parsing of a parameter (the same function). Both pipelines are independent. I imagine two ways of doing that: • have a common local library with utilities functions and create a node on each pipeline which uses the dedicated function; • create a modular pipeline with a single node and build the full pipelines at the registry level. My team has used the second approach so far but it feels like an overkill to me, having several single-node pipeline. How would you approach the design?

👍 1

Iñigo Hidalgo

10/17/2023, 8:32 AM

Definitely the first approach, that's part of the beauty of kedro nodes. What pushed your team towards the second solution, did you have issues just calling the same function in different nodes?

Flavien

10/17/2023, 9:04 AM

No,

kedro

is a new tool for the team and a node seemed to be the smallest unit to be reused. I think that we need to find the right spot between a large pipeline with everything and every single node to be a pipeline. In my opinion, a pipeline, if it needs to be reused, should encapsulate some kind of business or implementation logic. But it is a struggle. 😅

Iñigo Hidalgo

10/17/2023, 9:21 AM

Okay, yeah for me the smallest pipelines I define are 2 nodes, since otherwise you don't gain anything vs just reusing the underlying function and applying it to different nodes in different pipelines.

👍 1

Flavien

10/17/2023, 9:25 AM

It seems like a good criteria. Thanks for the replies. 👍

Nok Lam Chan

10/17/2023, 9:37 AM

A single mode for modular pipeline doesn't seem ideal, at that level you are probably having big scripts stitches together and it wouldn't be easy to reuse functions among pipelines.

Nok Lam Chan

10/17/2023, 9:38 AM

For pipeline registry you can also make use of the pipeline API to gain some flexibility instead of manually stitching. For example using tags or some nodes arithmetic.

Iñigo Hidalgo

10/17/2023, 9:49 AM

@Nok Lam Chan could you explain a bit more what you mean by 1. pipeline API vs manually stitching 2. tags and nodes arithmetic? related to this https://docs.kedro.org/en/stable/nodes_and_pipelines/slice_a_pipeline.html ?

Nok Lam Chan

10/17/2023, 10:26 AM

They are the same thing. 1. Manual - pipeine_a + pipeline_b + pipeline_c + pipeline_d + … 2. Slicing Pipelines - https://docs.kedro.org/en/stable/nodes_and_pipelines/slice_a_pipeline.html

have a common local library with utilities functions and create a node on each pipeline which uses the dedicated function;

It may help to shift the focus of the conversation towards these dimensions rather than how big a pipeline/node should be • Reusability • Easy to test • Performance (more opportunity to parallelise in node level) • Easy to debug (i.e. if something goes wrong in the node, how would you debug? Is there some checkpoint you can immediately do

catalog.load

to recover the data or you have to rerun a 3 hour pipeline?)

Flavien

10/18/2023, 7:25 AM

Thanks @Nok Lam Chan. 👍

K 1

Open in Slack

Previous Next