Hi! I come from R where for my data science pipeli...
# questions
f
Hi! I come from R where for my data science pipelines, I use a package named "targets". In a similar way to kedro, it is possible to look at the directed acyclic graph (DAG). However, in contrast with Kedro, in targets, it is possible to see which node/datasets are up to date and which ones are outdated. The package will say that a node is outdated is one of the nodes upstream have changed. For instance, if node "A" changes, then all of the nodes downstream will be outdated. See this link for an example: https://www.google.com/search?q=targets+package&sca_esv=597009428&rlz=1C1GCEB_enCA10[…]D1QQ_AUoAnoECAIQBA&biw=1920&bih=929&dpr=1&safe=active&ssui=on As can be seen, the targets package detected that the function "test_results" have changed and told us that everything downstream is outdated. I was wondering if there's a similar functionality in Kedro. It would be quite useful because some nodes in Kedro can take a long time to run. Let's say that node "X" takes a day to run. If you change the code in another node, you don't really know automatically which nodes it impacts, and wether or not you need to rerun node "X". In other words, targets will skip all nodes that are already up to date when running the whole pipeline.
m
Hi @Francis Duval, are you talking about some way of seeing this in Kedro Viz? Or just in the terminal where your run is running?
f
@Merel, when you run the command
kedro run --pipeline pipeline_name
, then the results in your pipeline are up to date since you just ran it. But then, you edit your pipeline and some parts of it (those that are downstream the node/dataset you modified) will be outdated. In order for your pipeline to be up to date again, you'll have to rerun it. However, if you use again
kedro run --pipeline pipeline_name
, it will run everything, including nodes that are not impacted (not downstream) by the node/dataset you modified. This can be a problem if your pipeline contains nodes that take a long time to compute. I would like to have a tool that tells me which parts of my pipeline are up to date and which ones are outdated. Would be nice to have this in kedro viz indeed! A command like
kedro run ---only_outdated
that would only run outdated nodes would be great too. 🙂
m
Thanks for clarifying. We don't have functionality like this at the moment, but have been talking about features like this (cc: @Rashida Kanchwala) We're currently exploring what features like this would mean for Kedro and if it would make us go too much in the orchestrator realm. But it's always good to hear what kind of features are of interest.
f
Thanks for your answer Merel! Kedro works a bit differently from Targets in R, so the way of interacting with it may be different. Maybe that caching datasets that have a long running time can be a way to do it.
m
Yes maybe
CachedDataset
https://docs.kedro.org/en/stable/_modules/kedro/io/cached_dataset.html# can help. But the automatic detection of "outdated" nodes isn't something we support right now.
kedro run
does offer several options to run a pipeline only from a certain node.
kedro run --help
shows all of them.
❤️ 1
f
I'm pretty sure that in Targets, all datasets, or "targets" as they call them, are "cached" automatically. I am wondering why we wouldn't want this in Kedro. Maybe there are some drawbacks...
Oh, I think that this plugin does exactly what I want. It seems to cache all dataset that are in the catalog.yml: https://pypi.org/project/kedro-cache/