I don’t really know if this is the right place to ...
# user-research
m
I don’t really know if this is the right place to post this (happy to move it to a different channel), but here goes… We have been working with kedro for 2+ years now (mainly 0.16.6 and 0.17.7) in a highly customised setup to support ML pipelines for multiple product x country combinations. So I thought it would be a good idea to share some feedback. To be honest, I have always had a love-hate relationship with kedro (and especially our setup!), but lately I have started to like kedro more. Here’s why; if have aways compared kedro to python-based orchestration engines such as Prefect or Dagster. In reality, this is an unfair comparison and it makes much more sense to view kedro as the dbt (https://www.getdbt.com) for ML. In that regard, kedro is easy to use, enforces SWE best practices and plays nicely with DS tooling, which makes it an excellent package to use in an ML platform. However, I do think kedro can still pick up a few ideas from e.g Dagster/Prefect, and that brings me to the stuff I don’t like: • running nodes conditionally based on e.g the outcome of an upstream node or an config param • Having an easy way to create a map-reduce type pattern in a pipeline where you use one function multiple times in parallel with different inputs. • Without kedro-viz, it is really hard to see the dependencies between nodes in a pipeline due to the design of the pipeline API and that’s annoying if you just want to do some lookup or review a PR introducing a new pipeline. • Talking about kedro-viz, it would be nice if you could use it to track runtime stats of your pipeline (and nodes). Just like the Prefect API does. • One of the most impressive features in kedro is definitely its I/O abstraction with the catalog. However, in many big projects the catalog becomes huge. Drawing inspiration from Dagster’s software defined assets, I think we should strive to reduce the catalog to only contain the relevant datasets (your data assets from your data product if you think in terms of “data mesh”). I am just wondering if there is a way to enforce this more in a kedro project… Hope the feedback is useful! Apologies for the long post 🙈
👍🏾 1
👍 10
👍🏽 1
Oh and one more thing: We use Argo Workflows (on k8s) as our orchestration engine and I really like it for several reasons I won’t dive into here. Argo Workflows simply runs a DAG of pods. To use it efficiently, the tasks running inside one pod cannot be too big (otherwise you might as well use a k8s job to run your whole flow), but they cannot be too atomic either (otherwise you waste a lot of time on overhead on pod setup). Hence, we define a task to be a (modular) kedro pipeline and our workflow runs a couple of kedro pipelines on after another (or in parallel if possible). Browsing through the GitHub issues, I have found this one https://github.com/kedro-org/kedro/issues/770 which contains a really nice idea for a feature “_After having your modular pipelines structured analogous to folders and nodes to files, we can provide a uniform deployment plugin where users can decide the level at which their nodes will be run in the orchestrator, e.g. imagine something like kedro deploy airflow --level 2 which will make sure that the output configuration will run each node separately, but collapse the nodes at level 3 as singular tasks in the orchestrator._” I have thought about creating a python package that does just that because it would make the UX for Argo Workflows so much better!!! Any ideas on how we can move this idea forward? Happy to contribute in any way possible!
🎉 1
👍 2
a
I think @Ivan Danov has ideas for this and enhancing kedro’s deployment model - not sure where it currently stands in terms of priorities though.
Also thank you very much for all the feedback! It’s super helpful, and several of the things you’ve mentioned are already on our radar so we’re definitely thinking along similar lines to you!
👍 1
e
I strongly agree with the first two points
i
@Matthias Roels Thank you for sharing your ideas! We really love seeing our community members sharing their ideas with us and we take inspiration from the community very often! If you are interested, here’s some comments on some of your ideas here: 1. running nodes conditionally - we’ve consciously not included conditional node runs. Adding conditional runs makes pipelines much harder to reason about and it can quickly spiral out of control. For
n
2-way conditions, you can generate up to
n^2
possible run graphs, which takes only 3 conditionals to make it incomprehensible and intractable to debug in anger. Kedro pipelines are in a way “*What You See Is What You Get*” in terms of execution and that gives you a nice comfort that when something goes wrong, you won’t be chasing why certain subgraph did not run at all. If you need conditionals, you could still achieve the same thing by using an orchestrator allowing that and running your different Kedro pipelines depending on the outcome of something. However that should be only your last resort, since it will make finding problems harder for you when running in production for months and years to come. 2. easy way to create a map-reduce type pattern - this is something we’ve been thinking about and it’s on our radar. Currently in Kedro the main blocker for such a construct is the way we define catalogs (as pointed by you in 5.), but we believe that by refactoring our catalog loading and showing a few ways to simplify your catalogs, we can achieve that soon enough. In fact, it is already possible, but not widely known and not easy to get there by yourself, especially if you’d like to save data to disk in the intermediary stages. 3. see the dependencies between nodes - we and most users like the pipeline definition API, but it is true that sometimes it might become a bit difficult to follow it. We’ve seen some users who defined alternative syntax, including
yaml
pipelines and a syntax like
output1, output2 = node(func)(input1, input2)
. All of those have their own drawbacks though, so when pipelines get too big, the best way to see that is through Kedro Viz. 4. runtime stats of your pipeline (and nodes) in Kedro Viz - this is a common request we see and we are still unsure how to feel about this, because comparing Kedro with Prefect is not entirely accurate - Prefect is not just a pipeline definition framework, but also an execution platform (orchestrator). Kedro aims to be just the framework for building your pipelines, so you can deploy it to any other platforms / orchestrators. And as such, Kedro cannot report those stats, because execution environments can vary wildly from one user to another. On the other hand, we can definitely provide those stats for local runs, thus making Kedro Viz into something more than just a pipeline visualiser and making it more akin to a Pipeline development IDE. We are evaluating different futures for Kedro Viz and if one weighs more than another, we might consider adding this. 5. the catalog to only contain the relevant datasets - technically there’s no need to have anything more than the first inputs and final outputs of your pipeline in your catalog. However we’ve seen that lots of people define many intermediate datasets, which cause an explosion of the catalog files. This is currently being addressed by our redesign of the configuration we have and the idea is to enable users to define intermediate datasets with less entries than now. Hopefully this will address this problem. 6. deploying at different pipeline granularity - this one is a major focus of the team this year, so stay tuned - we’ll definitely work on it, since it’s becoming more and more needed by everyone.
m
Nice! I look forward to how this project is evolving! Especially number 6. is important for us. But I am also interested to how 3-5 will play out. One additional point though, suppose you deploy your kedro pipeline to an orchestrator who’s job it is to run a DAG of the kedro pipeline’s sub-pipelines (with a kedro run cmd in every step of the DAG), I think it is still valuable to have runtime stats of individual kedro nodes (and maybe sub-pipelines) as this is something that cannot be exposed in the orchestrator (simply because he isn’t aware of what’s going on in every step, and nor should he). This way, it is easy to spot bottlenecks.
👍 1