Hi everyone! I am new to Kedro and I am wondering ...
# questions
n
Hi everyone! I am new to Kedro and I am wondering how we can run pipelines conditionally in a scalable way. For example let's say i have 4 pipelines a,b,c,d. First we run a, then upon its values we either run b or c->d. Parallel question can we finish the run based on the output of a pipeline?
n
Hey, welcome!
Do you have a specific example or your are just playing around? The quick answer is you cannot do this, it may sounds disappointing but Kedro was build with the focus of robust pipeline. Dynamic pipeline are generally unpredictable in nature, and we advice against it unless really necessary, usually there are better way to make the pipeline deterministic instead.
There are workaround, but basically they are 2-steps approach (or multi-step). https://demo.kedro.org/, in general Kedro expects an execution graph that is determined before any node get executed.
n
For now I am learning, but the example I am working on requires me to: 1 - Load the data 2 - Classify the data 3- Depending on classification apply post-processing a or b. I know i could create a single pipeline for the two post processing but it doesnt sound scalable, if after the classification i want to apply 10 nodes for example
j
you can trigger the Kedro pipelines conditionally though! pseudocode:
Copy code
kedro run -p load_and_classify

if GET(<s3://results>)["post_processing"] == "a":
    kedro run -p post_processing_a
else:
    kedro run -p post_processing_b
๐Ÿ‘๐Ÿผ 1
๐Ÿ‘ 1
in other words, lack of conditional nodes in Kedro should not be a stopper for you to apply conditional logic outside of Kedro
n
Interesting! So for example I could change the run.py script?
n
What is the run.py script?
n
kedro run
n
I think the idea is that, you don't need to modify Kedro to do conditional. Instead, you apply the conditional logics outside of Kedro. In fact this is very reasonable as you scale out, it's likely you will split out your pipeline into multipart, expose them through an API or services. It could be as simple as a script like this:
Copy code
if result == "a":
   trigger_kedro_pipeline_a
if result == "b":
   trigger_kedro_pipeline_b
Within each Kedro pipeline, it is deterministic.
๐Ÿ‘ 2
n
what if I want to vizualize it though with the kedro viz? and visualize all the output data from the nodes that were triggered
n
You can visualise part of the pipeline, but you won't have a single view that stitch up the entire run, as you triggers multiple kedro run.
n
very clear response, thank you so much guys, cant wait to work on real projects with it
๐Ÿฅณ 2
To come back on your answer @Nok Lam Chan would you have a script example that would do this? to experiment a bit with this solution. I am not sure i get how to do it without creating a service but just with a python script. Should I just run CLI, store the results and load them? It doesnt seem efficient
n
So there is two main way of running Kedro: 1. CLI, i.e.
kedro run
2. Python API, which you usually create a
session
first and then do a
session.run
(it's what
kedro run
do behind the scene anyway)
๐Ÿ‘ 1
session.run()
would return a dictionary of
free outputs
(sorry for the bad terminology it's hard to explains this precisely). But you can basically take this dictionary and do your python conditions
n
Sorry to ask that much but seems very interesting, would you have an example script? you can write some dummy variables names like pipeline_a, i will understand
n
let me create an example, probably useful for others later too.
n
you're very nice, thank you so much
n
Sorry got distracted for a bit. You can find it here now, see the README instruction. https://github.com/noklam/kedro-example/tree/master/conditional-kedro-runs
n
Thank you so much very helpful!
m
There is an alternative approach that allows you to do everything in kedro, with a custom hook. But it only works in specific cases. We have a very specific use-case where we have to skip some nodes. In the end, we created a dedicated
kedro_env
with a specific global arg that is used in a before pipeline run hook that replaces the nodes to skip with a dummy node. Itโ€™s not really conditional based on node output like your example, but just wanted to share whatโ€™s possible
๐Ÿ‘€ 1
n
could you elaborate on the method/use-case?
@Nok Lam Chan following your example, if the return of a pipeline is in the catalog it will be written into disk. But then I see that we don't get it in the result memory of the pipeline. When I remove it from catalog then I can actually see the result and check the condition. Is there a way to do both?
Also is it possible to access all the results of all the nodes?
n
@Noah Sarfati If datasets are not in the result, you need to load it with
catalog.load(<dataset_name>)
. This is because during kedro run we try to optimise memory and throw away dataset as soon as they are not needed.
๐Ÿ‘ 1
You can do that with a custom version of Runner, but there are no out of the box solution.
o
In another entry I asked about a turn around for an issue: I'm facing troubles because the Kedro pipeline is running in a Google App Engine, and
Copy code
kedro run
tries to read pyproject.toml, but in App Engine the filesystem is in read-only mode. So, I was perusing the documentation and it looks like if creating the session with the argument save_on_close=False could solve it. Make it sense?
n
Are you using the session at all? You can check your settings.py. By default if you are not using viz it shouldn't write to anything.
o
I've just tried in local with from kedro.framework.session import KedroSession from kedro.framework.startup import bootstrap_project from pathlib import Path bootstrap_project(Path.cwd()) with KedroSession.create(save_on_close=False) as session: session.run() Following this link: https://docs.kedro.org/en/stable/kedro_project_setup/session.html Thanks for the hint. I'm not using viz, because I'm just running a pipeline in Google App Engine. Even if the arisen error ("ERROR Failed to read the file: plugin.py:111 /workspace/pyproject.toml. [Errno 30] Read-only file system: '/workspace/pyproject.toml' ") doesn't stop the execution, I want to fix it, and overall to understand it well.
Also, I have to check if packaging the project as a .whl solves the issue. So far I have been just cloning the code to the container (Google App Engine).
j
@Oscar Villa what version of
kedro
and
kedro-telemetry
packages you have? (if
kedro-telemetry
is not installed in GAE, please say so too)
o
Hi, @Juan Luis. Thanks for taking the time. I deleted the App Engine service completely and now I don't have a way to be sure of the version I used that time. I'm going to try again and be back with the packages versions and detailed errors.
๐Ÿ™๐Ÿผ 1
Hi, @Juan Luis. I'm back with the data: kedro~=0.19.6 kedro-telemetry>=0.3.1 Both are in the requirements.txt used for the App Engine deployment. Regarding to what @Nok Lam Chan said [here](https://kedro-org.slack.com/archives/C03RKP2LW64/p1721523395326809?thread_ts=1721307721.801689&cid=C03RKP2LW64), I was installing kedro-viz>=6.7.0 I'm waiting Google App Engine propagate the app to try again.
j
thanks @Oscar Villa , I suppose you donโ€™t have a full traceback? also, Kedro-telemetry is not installed?
o
I was editing/updating the message while App Engine loads...
j
okay, I suspect that
kedro-telemetry
is trying to modify your
pyproject.toml
and then you observe that
Read-only file system
error cc @Elena Khaustova
๐Ÿ‘€ 1
o
Yes, it was the issue: if I remove telemetry from the requirements.txt, the error disappears. Thank you a lot, @Juan Luis for your support. I can go ahead with Kedro as our DS framework in a clean fashion ๐Ÿš€.
๐Ÿฅณ 1
j
opened https://github.com/kedro-org/kedro-plugins/issues/781 to track this, gracias @Oscar Villa!
โœ… 1