Bit of a non-technical question, more on the pipel...
# questions
i
Bit of a non-technical question, more on the pipeline design side: discussing with a coworker we realized we have a very different approach to designing pipelines: I tend to reduce individual nodes to the smallest logical increment: e.g. in my feature engineering pipeline I have one node which generates time-based features (feature engineering on the datetime index) then have another node which generates synthetic variables, then another which generates lags, another for feature aggregations etc, and pass the datasets between these nodes as memory datasets. whereas my coworker tends to make these nodes larger, basically encompassing an entire step, for example intermediate (or raw) to primary in one node, so performing all the different cleaning steps in one node, then all the feature engineering I described above in another node, and he doesn't use memory datasets frequently, as usually the inputs and outputs of each node will be in a form he wants to persist. In my view, each of these approaches has different benefits and drawbacks: Pros: - for mine, it is easier to see from kedro-viz what is going on and how different nodes depend on other steps - for his, the pipeline view is a lot cleaner, and if you want to dig, you have the code available Cons: - mine, my pipeline definitions start to become a bit unwieldy as their size grows, and refactoring them becomes more difficult since not having these dependencies defined "in code" means IDEs and linters can't help me spot issues - his, there is much less visibility into the different steps performed, and if for example down the line we want to persist an intermediate dataset it's basically impossible. I was wondering if there's been any prior discussion regarding this, either internal within QB or some article or documentation I could refer to, and was curious to hear the thoughts of other people who work with Kedro daily, particularly with pipelines on the bigger side.
d
I'm not aware of any documented best practices, and think it does boil down to personal preference at a point (Kedro isn't that prescriptive). My preference/past use is more aligned with your approach; I was never a fan of projects (e.g. from some verticals) that handle a lot of steps in a node. I think it's an important point is to reduce duplication. E.g. if you have less granular nodes and find yourself copy-pasting code between them, then that could be an indication to use smaller nodes. However, it's also fair to use a helper function rather than an explicit node. Last but not least, I don't think Kedro pipeline design should be determined by how it will be deployed in production; e.g. I've heard arguments for not using `MemoryDataSet`s and having larger nodes because of how they map to workflow orchestrators, but I think that's more a problem of the way the mapping/deployment is done, and you should maintain the "optimal" logical pipeline design independent of that.
j
totally agree, also take into account extremes are undesirable. My two cents: • If you have very big data and/or a slow storage, having lots of nodes can increase the time of your I/O operations a lot. • In the catalog you can create layers to organize nodes, but too many layers and kedro-viz won't show them
d
• If you have very big data and/or a slow storage, having lots of nodes can increase the time of your I/O operations a lot.
This isn't an issue if using `MemoryDataSet`s, though. Would never recommend breaking something down into a lot of nodes and persisting each output.
j
*in the case you need/want for some reason to store the nodes outputs 🙃 or use
CachedDataSet
Like one of my teachers said "Imagine that the one that will maintain your code is a very angry person that knows where you live" hehehe
I tend to avoid nodes with lots of inputs and nodes that do similar things
n
@Stephanie Kaiser
👍 1
i
The workflow orchestrator question and its partial incompatibility with memorydatasets is what got us talking about this. The bigger nodes with persisted inputs and outputs match the job-per-node deployment pattern. A pattern I personally prefer is single pipelines as jobs, and then within the orchestrator you would stitch together different pipelines, where the only data passed in between is the persisted data at the end/start of each sub-pipeline.
I don't think Kedro pipeline design should be determined by how it will be deployed in production
for us this is 95% of our kedro usage, so it's an important consideration. at the moment our pipelines run as a single container scheduled through cron, but we've been evaluating orchestrators and some of our design patterns clash with the way orchestrators do things
d
The workflow orchestrator question and its partial incompatibility with memorydatasets is what got us talking about this.
I don't think the node-to-task Kedro-to-orchestrator mapping is reasonable, and it makes more sense to map modular pipelines to orchestrator tasks. I assume you have this question because you're doing node-to-task?
i
We're still at the evaluation phase, so we aren't doing either. But if I'm not mistaken, in the Prefect deployment documentation, each individual node is a task.
👍🏼 1
d
But if I'm not mistaken, in the Prefect deployment documentation, each individual node is a task.
You're not mistaken, and most of the deployment docs do suggest a per-node deployment; however, they're also rather outdated (the Prefect deployment doc doesn't support Prefect 2.0, and in general a major pain point for the deployment docs is keeping them up to date and in line with the latest thinking). If you're specifically interested in Prefect, here are some of my thoughts: https://kedro-org.slack.com/archives/C03RKP2LW64/p1678116063310559?thread_ts=1678056372.557949&cid=C03RKP2LW64 Take them with a grain of salt, because I look into some of these things without time to prototype them (but, in my defense, I have spent more time than 99% of people thinking about deployment of Kedro to orchestrators :P). In general, I'd say the GetInData team is currently a better authority for deployment to workflow orchestrators than the official Kedro docs, and we hope to collaborate with them further to have better answers. (I believe, in our last conversation with them, they were looking to add support for mapping modular pipelines to tasks, but I could be mistaken and haven't checked their plugins lately to see if that is the case.)