Hi fellows, I would like your insights on designin...
# questions
f
Hi fellows, I would like your insights on designing pipelines with common nodes. Let's imagine that two different pipelines need as a first node the parsing of a parameter (the same function). Both pipelines are independent. I imagine two ways of doing that: • have a common local library with utilities functions and create a node on each pipeline which uses the dedicated function; • create a modular pipeline with a single node and build the full pipelines at the registry level. My team has used the second approach so far but it feels like an overkill to me, having several single-node pipeline. How would you approach the design?
👍 1
i
Definitely the first approach, that's part of the beauty of kedro nodes. What pushed your team towards the second solution, did you have issues just calling the same function in different nodes?
f
No,
kedro
is a new tool for the team and a node seemed to be the smallest unit to be reused. I think that we need to find the right spot between a large pipeline with everything and every single node to be a pipeline. In my opinion, a pipeline, if it needs to be reused, should encapsulate some kind of business or implementation logic. But it is a struggle. 😅
i
Okay, yeah for me the smallest pipelines I define are 2 nodes, since otherwise you don't gain anything vs just reusing the underlying function and applying it to different nodes in different pipelines.
👍 1
f
It seems like a good criteria. Thanks for the replies. 👍
n
A single mode for modular pipeline doesn't seem ideal, at that level you are probably having big scripts stitches together and it wouldn't be easy to reuse functions among pipelines.
For pipeline registry you can also make use of the pipeline API to gain some flexibility instead of manually stitching. For example using tags or some nodes arithmetic.
i
@Nok Lam Chan could you explain a bit more what you mean by 1. pipeline API vs manually stitching 2. tags and nodes arithmetic? related to this https://docs.kedro.org/en/stable/nodes_and_pipelines/slice_a_pipeline.html ?
n
They are the same thing. 1. Manual - pipeine_a + pipeline_b + pipeline_c + pipeline_d + … 2. Slicing Pipelines - https://docs.kedro.org/en/stable/nodes_and_pipelines/slice_a_pipeline.html
have a common local library with utilities functions and create a node on each pipeline which uses the dedicated function;
It may help to shift the focus of the conversation towards these dimensions rather than how big a pipeline/node should be • Reusability • Easy to test • Performance (more opportunity to parallelise in node level) • Easy to debug (i.e. if something goes wrong in the node, how would you debug? Is there some checkpoint you can immediately do
catalog.load
to recover the data or you have to rerun a 3 hour pipeline?)
f
Thanks @Nok Lam Chan. 👍
K 1