Hello Kedro Community, I am wondering if there is ...
# questions
c
Hello Kedro Community, I am wondering if there is a way to provide a custom function for checking whether the output(s) of a node need to be re-build and thus allow skipping certain transformation steps? Clearly there should also be a way to "force" building such nodes even if the function returns that outputs should not be rebuild.
n
This is interesting, this is very much possible but at the moment you need some manual effort to track the pipeline. It only works if you have persisted the data, this could be build with a custom runner I think.
To add more, kedro run does not have "memory" or it is a stateless pipeline, it doesn't know what exist before the run, it doesn't remember the last run have produced which outputs. It can be made stateful, but there are lots of edge cases that are hard to deal with, code change/data change etc.
If i remembered correctly, @David Stanley has a similar question a few weeks ago?
c
Hi Nok, thanks for referencing David's question. It is indeed the same use-case. The solution proposed by @datajoely to create a plugin seems like a possible way forward. I was looking at the available hooks and don't see a clear location for a hook: • before_node_run - could perform the necessary "up-to-date" checks but it does not really have the power to skip node execution. Clearly it could modify the inputs in some unholy manner but it's not exactly clean •
before_pipeline_run
- does not seem to be able to influence the running process either.
before_node_run seems to me like the best candidate, but it needs to be able to signal to the runner that there is no need to run the node
n
Just think about this quickly I don’t think it’s possible just with hook. If this an interest I suggest to create an GitHub issue to layout the design. You can create a runner plugin. I have made a custom runner plugin before, seehttps://kedro.org/blog/build-a-custom-kedro-runner
f
I thought about this for some time too but never got around to spend much time on it. Some years ago we had a similar internal tool and we used hashes of both the (node) code and the dataset to keep track of what parts of our pipelines needed execution. Especially when not all data is persisted this can help for big pipelines but it’s not trivial to implement. One thing we were on the fence as well is to include versions of the libraries (e.g. pandas) in the hash or not. We were doing some monte carlo simulations so these were important when it came to some of the randomisation in our code (might be just our use cases in the past but just adding it here). Would be interested in contributing 🙂
d
🚀
n
@Florian d I think that should work tho it may be challenging to ensure the data is not changed externally (i.e. database). The logic of checking which node affect which datasets is relatively straight forward, you can probably get it will the
Pipeline
API already.
f
agreed I think there might be some limitations in terms of support of dataset types for this functionality. I have tried already the node hashing before and it worked quite smoothly
d
My previous thoughts were to riff off of --to-nodes, and after the sub-DAG from that is created (I am assuming that's what it does), then work either backwards or forwards to check for missing data to run nodes to create. I guess given we can have virtual catalog entries (non saved files) might have to work forward and just check if output already exists, skip nodes if it does. Any nodes reliant on new data that has been generated through this process would also be rerun even if they have preexisting outputs. Maybe we make an empty set of regenerated catalog entries at start to append to and reference check for that. Maybe a more clearly defined use case would help guide us here though. For my part, I am thinking of when new sub-pipelines and extra nodes have been added on top of the existing pipeline by others. I have numerous tables and pipelines already run and outputs saved. I want to locally create missing output tables for the new overall pipeline, and I don't want to have to figure out what could be a very long list of input nodes to specify to do so, using --from-nodes for different elements that have been added. So I want to do kedro run or --to-nodes '[target-end-node]' but not to run the whole pipeline just to run what is needed to fill the gaps, because my pipeline is very big and would take a long time to run e2e.
n
What would be the definition of checking if the dataset exist and in a way that guarantee it won't accidentally run on outdated data?
The feature kinda exist as when you fail the pipeline, it will produce a log that suggest how you can recover from the pipeline. It already taken care of persist data or MemoryDataSet, so we can pretty much reuse that.
d
What would be the definition of checking if the dataset exist and in a way that guarantee it won't accidentally run on outdated data?
No guarantee - not bothered about running on outdated data in that use-case (for my actual use-case, it was not possible for raw to change or if changes to existing pipeline also then errors would flag). Perhaps would just check if a file exists at the catalog entry filepath, that would do as a starting point functionality-wise, I should think.
The feature kinda exist as when you fail the pipeline, it will produce a log that suggest how you can recover from the pipeline.
Although one improvement that would be nice is to make that more directly copy-pasteable. Recent experience has been that it does not copy-paste well into terminal, meaning have to manually go and fix all the lines.
👍 1
n
No guarantee - not bothered about running on outdated data in that use-case (for my actual use-case, it was not possible for raw to change or if changes to existing pipeline also then errors would flag). Perhaps would just check if a file exists at the catalog entry filepath, that would do as a starting point functionality-wise, I should think.
Yep, that would be a good head start. Let me know if you ever started working on this or need some extra help.
K 1
@David Stanley Valid comment, I think I have encountered this but haven’t heard too many complaints. I created an issue https://github.com/kedro-org/kedro/issues/3276, let see if this is a more common problem and we can prioritise it.
d
Yep, that would be a good head start. Let me know if you ever started working on this or need some extra help.
Not started working yet sorry, bit busy with things, will let you know if/when I do though. My initial thinking is to wrap the node functions, use their input and output catalog entries, check inputs against set of reran node outputs (initially empty), if any match add outputs to that set and run node as normal, else then check whether file exists at output, if so then skip the node else run node as normal.
🥳 1
Valid comment, I think I have encountered this but haven’t heard too many complaints. I created an issue https://github.com/kedro-org/kedro/issues/3276, let see if this is a more common problem and we can prioritise it.
Awesome, thanks.