Matthias Roels
01/14/2023, 9:04 PMAntony Milne
01/16/2023, 12:47 PMElior Cohen
01/22/2023, 11:48 AMIvan Danov
01/24/2023, 12:11 PMn
2-way conditions, you can generate up to n^2
possible run graphs, which takes only 3 conditionals to make it incomprehensible and intractable to debug in anger. Kedro pipelines are in a way “*What You See Is What You Get*” in terms of execution and that gives you a nice comfort that when something goes wrong, you won’t be chasing why certain subgraph did not run at all. If you need conditionals, you could still achieve the same thing by using an orchestrator allowing that and running your different Kedro pipelines depending on the outcome of something. However that should be only your last resort, since it will make finding problems harder for you when running in production for months and years to come.
2. easy way to create a map-reduce type pattern - this is something we’ve been thinking about and it’s on our radar. Currently in Kedro the main blocker for such a construct is the way we define catalogs (as pointed by you in 5.), but we believe that by refactoring our catalog loading and showing a few ways to simplify your catalogs, we can achieve that soon enough. In fact, it is already possible, but not widely known and not easy to get there by yourself, especially if you’d like to save data to disk in the intermediary stages.
3. see the dependencies between nodes - we and most users like the pipeline definition API, but it is true that sometimes it might become a bit difficult to follow it. We’ve seen some users who defined alternative syntax, including yaml
pipelines and a syntax like output1, output2 = node(func)(input1, input2)
. All of those have their own drawbacks though, so when pipelines get too big, the best way to see that is through Kedro Viz.
4. runtime stats of your pipeline (and nodes) in Kedro Viz - this is a common request we see and we are still unsure how to feel about this, because comparing Kedro with Prefect is not entirely accurate - Prefect is not just a pipeline definition framework, but also an execution platform (orchestrator). Kedro aims to be just the framework for building your pipelines, so you can deploy it to any other platforms / orchestrators. And as such, Kedro cannot report those stats, because execution environments can vary wildly from one user to another. On the other hand, we can definitely provide those stats for local runs, thus making Kedro Viz into something more than just a pipeline visualiser and making it more akin to a Pipeline development IDE. We are evaluating different futures for Kedro Viz and if one weighs more than another, we might consider adding this.
5. the catalog to only contain the relevant datasets - technically there’s no need to have anything more than the first inputs and final outputs of your pipeline in your catalog. However we’ve seen that lots of people define many intermediate datasets, which cause an explosion of the catalog files. This is currently being addressed by our redesign of the configuration we have and the idea is to enable users to define intermediate datasets with less entries than now. Hopefully this will address this problem.
6. deploying at different pipeline granularity - this one is a major focus of the team this year, so stay tuned - we’ll definitely work on it, since it’s becoming more and more needed by everyone.Matthias Roels
01/25/2023, 2:06 PM