Hi everyone I am new to Kedro and I am wondering how we can Kedro #questions

Hi everyone! I am new to Kedro and I am wondering ...

Noah Sarfati

07/18/2024, 1:02 PM

Hi everyone! I am new to Kedro and I am wondering how we can run pipelines conditionally in a scalable way. For example let's say i have 4 pipelines a,b,c,d. First we run a, then upon its values we either run b or c->d. Parallel question can we finish the run based on the output of a pipeline?

Nok Lam Chan

07/18/2024, 1:06 PM

Hey, welcome!

Nok Lam Chan

07/18/2024, 1:09 PM

Do you have a specific example or your are just playing around? The quick answer is you cannot do this, it may sounds disappointing but Kedro was build with the focus of robust pipeline. Dynamic pipeline are generally unpredictable in nature, and we advice against it unless really necessary, usually there are better way to make the pipeline deterministic instead.

Nok Lam Chan

07/18/2024, 1:10 PM

There are workaround, but basically they are 2-steps approach (or multi-step). https://demo.kedro.org/, in general Kedro expects an execution graph that is determined before any node get executed.

Noah Sarfati

07/18/2024, 1:12 PM

For now I am learning, but the example I am working on requires me to: 1 - Load the data 2 - Classify the data 3- Depending on classification apply post-processing a or b. I know i could create a single pipeline for the two post processing but it doesnt sound scalable, if after the classification i want to apply 10 nodes for example

Juan Luis

07/18/2024, 1:14 PM

you can trigger the Kedro pipelines conditionally though! pseudocode:

Copy code

kedro run -p load_and_classify

if GET(<s3://results>)["post_processing"] == "a":
    kedro run -p post_processing_a
else:
    kedro run -p post_processing_b

👍🏼 1

👍 1

Juan Luis

07/18/2024, 1:14 PM

in other words, lack of conditional nodes in Kedro should not be a stopper for you to apply conditional logic outside of Kedro

Noah Sarfati

07/18/2024, 1:16 PM

Interesting! So for example I could change the run.py script?

Nok Lam Chan

07/18/2024, 1:17 PM

What is the run.py script?

Noah Sarfati

07/18/2024, 1:17 PM

kedro run

Nok Lam Chan

07/18/2024, 1:21 PM

I think the idea is that, you don't need to modify Kedro to do conditional. Instead, you apply the conditional logics outside of Kedro. In fact this is very reasonable as you scale out, it's likely you will split out your pipeline into multipart, expose them through an API or services. It could be as simple as a script like this:

Copy code

if result == "a":
   trigger_kedro_pipeline_a
if result == "b":
   trigger_kedro_pipeline_b

Within each Kedro pipeline, it is deterministic.

👍 2

Noah Sarfati

07/18/2024, 1:22 PM

what if I want to vizualize it though with the kedro viz? and visualize all the output data from the nodes that were triggered

Nok Lam Chan

07/18/2024, 1:24 PM

You can visualise part of the pipeline, but you won't have a single view that stitch up the entire run, as you triggers multiple kedro run.

Noah Sarfati

07/18/2024, 1:25 PM

very clear response, thank you so much guys, cant wait to work on real projects with it

🥳 2

Noah Sarfati

07/18/2024, 1:30 PM

To come back on your answer @Nok Lam Chan would you have a script example that would do this? to experiment a bit with this solution. I am not sure i get how to do it without creating a service but just with a python script. Should I just run CLI, store the results and load them? It doesnt seem efficient

Nok Lam Chan

07/18/2024, 1:32 PM

So there is two main way of running Kedro: 1. CLI, i.e.

kedro run

2. Python API, which you usually create a

session

first and then do a

session.run

(it's what

kedro run

do behind the scene anyway)

👍 1

Nok Lam Chan

07/18/2024, 1:33 PM

session.run()

would return a dictionary of

free outputs

(sorry for the bad terminology it's hard to explains this precisely). But you can basically take this dictionary and do your python conditions

Noah Sarfati

07/18/2024, 1:34 PM

Sorry to ask that much but seems very interesting, would you have an example script? you can write some dummy variables names like pipeline_a, i will understand

Nok Lam Chan

07/18/2024, 1:34 PM

let me create an example, probably useful for others later too.

Noah Sarfati

07/18/2024, 1:35 PM

you're very nice, thank you so much

Nok Lam Chan

07/18/2024, 2:10 PM

Sorry got distracted for a bit. You can find it here now, see the README instruction. https://github.com/noklam/kedro-example/tree/master/conditional-kedro-runs

Noah Sarfati

07/18/2024, 4:18 PM

Thank you so much very helpful!

Matthias Roels

07/18/2024, 8:01 PM

There is an alternative approach that allows you to do everything in kedro, with a custom hook. But it only works in specific cases. We have a very specific use-case where we have to skip some nodes. In the end, we created a dedicated

kedro_env

with a specific global arg that is used in a before pipeline run hook that replaces the nodes to skip with a dummy node. It’s not really conditional based on node output like your example, but just wanted to share what’s possible

👀 1

Noah Sarfati

07/19/2024, 6:36 AM

could you elaborate on the method/use-case?

Noah Sarfati

07/19/2024, 6:46 AM

@Nok Lam Chan following your example, if the return of a pipeline is in the catalog it will be written into disk. But then I see that we don't get it in the result memory of the pipeline. When I remove it from catalog then I can actually see the result and check the condition. Is there a way to do both?

Noah Sarfati

07/19/2024, 6:53 AM

Also is it possible to access all the results of all the nodes?

Nok Lam Chan

07/19/2024, 9:46 AM

@Noah Sarfati If datasets are not in the result, you need to load it with

catalog.load(<dataset_name>)

. This is because during kedro run we try to optimise memory and throw away dataset as soon as they are not needed.

👍 1

Nok Lam Chan

07/19/2024, 9:47 AM

You can do that with a custom version of Runner, but there are no out of the box solution.

Oscar Villa

07/20/2024, 7:48 PM

In another entry I asked about a turn around for an issue: I'm facing troubles because the Kedro pipeline is running in a Google App Engine, and

Copy code

kedro run

tries to read pyproject.toml, but in App Engine the filesystem is in read-only mode. So, I was perusing the documentation and it looks like if creating the session with the argument save_on_close=False could solve it. Make it sense?

Nok Lam Chan

07/21/2024, 12:56 AM

Are you using the session at all? You can check your settings.py. By default if you are not using viz it shouldn't write to anything.

Oscar Villa

07/21/2024, 1:58 AM

I've just tried in local with from kedro.framework.session import KedroSession from kedro.framework.startup import bootstrap_project from pathlib import Path bootstrap_project(Path.cwd()) with KedroSession.create(save_on_close=False) as session: session.run() Following this link: https://docs.kedro.org/en/stable/kedro_project_setup/session.html Thanks for the hint. I'm not using viz, because I'm just running a pipeline in Google App Engine. Even if the arisen error ("ERROR Failed to read the file: plugin.py:111 /workspace/pyproject.toml. [Errno 30] Read-only file system: '/workspace/pyproject.toml' ") doesn't stop the execution, I want to fix it, and overall to understand it well.

Oscar Villa

07/21/2024, 2:08 AM

Also, I have to check if packaging the project as a .whl solves the issue. So far I have been just cloning the code to the container (Google App Engine).

Juan Luis

07/21/2024, 10:51 AM

@Oscar Villa what version of

kedro

and

kedro-telemetry

packages you have? (if

kedro-telemetry

is not installed in GAE, please say so too)

Oscar Villa

07/22/2024, 12:37 AM

Hi, @Juan Luis. Thanks for taking the time. I deleted the App Engine service completely and now I don't have a way to be sure of the version I used that time. I'm going to try again and be back with the packages versions and detailed errors.

🙏🏼 1

Oscar Villa

07/22/2024, 9:41 PM

Hi, @Juan Luis. I'm back with the data: kedro~=0.19.6 kedro-telemetry>=0.3.1 Both are in the requirements.txt used for the App Engine deployment. Regarding to what @Nok Lam Chan said [here](https://kedro-org.slack.com/archives/C03RKP2LW64/p1721523395326809?thread_ts=1721307721.801689&cid=C03RKP2LW64), I was installing kedro-viz>=6.7.0 I'm waiting Google App Engine propagate the app to try again.

Juan Luis

07/22/2024, 9:53 PM

thanks @Oscar Villa , I suppose you don’t have a full traceback? also, Kedro-telemetry is not installed?

Oscar Villa

07/22/2024, 9:58 PM

I was editing/updating the message while App Engine loads...

Juan Luis

07/22/2024, 10:49 PM

okay, I suspect that

kedro-telemetry

is trying to modify your

pyproject.toml

and then you observe that

Read-only file system

error cc @Elena Khaustova

👀 1

Oscar Villa

07/22/2024, 11:10 PM

Yes, it was the issue: if I remove telemetry from the requirements.txt, the error disappears. Thank you a lot, @Juan Luis for your support. I can go ahead with Kedro as our DS framework in a clean fashion 🚀.

🥳 1

Juan Luis

07/23/2024, 8:56 AM

opened https://github.com/kedro-org/kedro-plugins/issues/781 to track this, gracias @Oscar Villa!

✅ 1

8 Views

Open in Slack

Previous Next