Hi Kedro Community slightly smiling face I have a question r Kedro #plugins-integrations

Hi Kedro Community :slightly_smiling_face: I have ...

Till

06/21/2023, 6:25 AM

Hi Kedro Community 🙂 I have a question regarding the Kedro Pipelines plugins, especially https://github.com/getindata/kedro-azureml. We tried to apply the kedro-vertex plugin but ran into the following bottlenecks: 1. There was a vast IO overhead that our large pipelines were going to introduce. Essentially the performance of our pipelines get affected if every node needs to read and write to disk over network. 2. The other thing is that we're using a kedro extension called

multi-runner

which wraps around the catalog and I'm not sure how compatible it is. I can skim through the plugin implementation to see how it works. Now I am assuming that these issues might be similar for all getindata plugins. So here are my three questions: Is there any way to work around these issue? Is it maybe possible to read & write on IO level only on the Kedro Pipeline, not the Kedro node level? If its not suitable to use the plugin, can you provide someone who is new to Kedro some guidance how to interface Azure ML, instead? Thanks in advance for any help! 🙂

K 1

marrrcin

06/21/2023, 7:33 AM

Hi, thanks for reaching out!

Is there any way to work around these issue? Is it maybe possible to read & write on IO level only on the Kedro Pipeline, not the Kedro node level?

In kedro-airflow-k8s we introduced a concept of groups for the nodes that were using Spark (so that all nodes processing Spark DataFrames could still be separate on Kedro level, but at run time they will be merged, allowing to pass data in-memory - both with MemoryDataSets as well as with lazy Spark DataFrames). We were thinking about introducing the same idea of “groups” based on the node tags in other Kedro plugins, to effectively bring this feature to all of our plugins. As of today, we haven’t implemented that yet, but we’re keen to accept contributions on that - we can also assist you on that path. As for the IO between nodes - this happens in all “modern” ML stacks - Vertex, Azure ML, SageMaker, as the nodes in those systems are usually separate docker containers, meaning you have to materialize the data between them. One workaround would be to squish everything into one node - but then you loose the main benefit of those managed ML tools - ability to make execution of large pipelines parallel.

🙏 1

marrrcin

06/21/2023, 7:37 AM

As for

multi-runner

- I know it is proprietary and I don’t have access to its code base - so I cannot help you with that

🙏 1

Till

06/21/2023, 10:10 AM

What we were doing ourselves on Vertex is that we put whole Kedro Pipelines in seperate containers. Do you think this can be done in a few days with the plugin? Also we would need support for multi-runner. (At the moment I have not enough knowledge so I still have to do some research on that)

marrrcin

06/21/2023, 10:49 AM

Do you think this can be done in a few days with the plugin?

What do you mean by that?

Till

06/21/2023, 10:50 AM

I am just asking how "difficult" these changes are to make to get it running with our Kedro setup?

👍 1

Till

06/21/2023, 10:51 AM

If it would take several weeks of time, then it would be out of scope for us. We basically need to deliver something this week

marrrcin

06/21/2023, 10:52 AM

But what do you want to achieve with the Vertex AI in your use case? Paralleization or just running it in the cloud for the sake of running it in the cloud?

Till

06/21/2023, 10:55 AM

We already have a more hacky way (without the plugin) to to it in Vertex. Right now we are focussing on Azure(ML). Yes both, it needs to run in the cloud and should be able to be parallelized (multi-runner is a requirement)

Nok Lam Chan

06/21/2023, 11:09 AM

Is there any way to work around these issue? Is it maybe possible to read & write on IO level only on the Kedro Pipeline, not the Kedro node level?

For this we usually don’t advise a 1:1 mapping between Kedro’s node to Orchestrator node. The main reason for that is Kedro’s node are usually smaller. It’s more reasonable to deploy a Kedro’s modular pipeline as a “Node” on whatever compute platform you are using. This way all the intermediate node should be passed as memory and you don’t have the I/O issue.

😮 1

marrrcin

06/21/2023, 11:12 AM

Not sure how the

multi-runner

“unrolls” the pipeline, because if it’s static at the pipeline generation time, then the orchestrator will be able to digest that. If it’s somehow at runtime level, then this is an uncharted territory.

marrrcin

06/21/2023, 11:13 AM

How the pipeline is being lauched? Is it simple “kedro run” ?

Nok Lam Chan

06/21/2023, 11:15 AM

I am not an user of multi-runner, but I’ve just checked with the author. Essentially it does some update to the catalog and construct a larger DAGs automatically (think of copying the same pipeline for different pipelines and do the namespacing automatically for you)

Nok Lam Chan

06/21/2023, 11:16 AM

In the sense it is still 1 Kedro run, and the parallelisation bit locally is delegated to ParallelRunner

marrrcin

06/21/2023, 11:26 AM

I wonder where it does this update to “construct a larger DAG” - because if it’s somewhere in the hooks, then it’s too late for the plugins.

Till

06/21/2023, 1:09 PM

I am checking internally if

multi-runner

is compatible 🙂 @Nok Lam Chan I am not sure if I understand your message. What do you mean with "Orchestrator node". Can you specify a bit more what you mean by "deploy a Kedro’s modular pipeline as a Node on whatever compute platform you are using". So you mean deploying the pipeline in a seperate container job without the pluging?

Till

06/21/2023, 1:12 PM

So here you go: @marrrcin "Multi-Runner does 2 main things: 1. Non-destructively updates data catalogue to add data for your custom runs 2. During registry initialisation, it updates the pipeline you are using with MR by duplicating and namespacing some bits (using kedro modular pipelines API) This means that it is compatible with any Kedro extension (like ParallelRunner) that does not do static code analysis."

Nok Lam Chan

06/21/2023, 1:12 PM

@Till Sorry for the confusion. What I mean is to reduce the I/O, it’s recommended to have a modular pipeline -> Task (whatever the name is) You may have a Kedro pipeline that has 1000 nodes, but you can breakdown it to a few sub-pipelines

🙏 1

Nok Lam Chan

06/21/2023, 1:12 PM

Each sub-pipelines will be a “Task” on these platform, Azure/databricks/AWS

🙏 1

Till

06/21/2023, 1:17 PM

And yes, @marrrcin it is launched with

kedro run

marrrcin

06/21/2023, 1:18 PM

And when you look at

kedro viz

- is the pipeline “static” - what I mean by that is whether all of the “multi-runner” expansions are actually materialized and visible, as if it were you’ve defined everything by hand?

Nok Lam Chan

06/21/2023, 1:19 PM

I think the DAG construct is still before pipeline run

marrrcin

06/21/2023, 1:19 PM

@Nok Lam Chan you mean in hooks?

Nok Lam Chan

06/21/2023, 1:19 PM

to your question, you should be able to see it in

kedro-viz

👍 1

marrrcin

06/21/2023, 1:21 PM

OK, so if this generation follows standard Kedro logic = pipeline is defined fully before execution, then our plugins should be able to handle it, with the note that every Kedro node = separate Task in the target cloud’s orchestration tool, which means - data serialization.

K 1

marrrcin

06/21/2023, 1:26 PM

As for the “groups” idea (squashing multiple Kedro nodes inside of a single orchestrator node) it’s here https://github.com/getindata/kedro-airflow-k8s/blob/develop/kedro_airflow_k8s/task_group.py It’s for Airflow on k8s, so far only for Kedro <0.18.

Till

06/21/2023, 1:27 PM

So assuming that I am not in the power to change the Kedro codebase. And I think I do not have the time to create a PR, unfortunately. So still, do you have some guidance how to work around this? (basically without the plugin) I just need a solution at the moment and would love to contribute to the plugin at a later point (regarding the groups feature).

marrrcin

06/21/2023, 1:30 PM

Is it possible to launch only single specific run of “multi-runner” from CLI?

marrrcin

06/21/2023, 1:32 PM

In general approach, what you could do is: 1. Package your Kedro project in docker 2. Use the target cloud / orchestrator to launch this docker image Now, depending on the granularity of “paralleization” that you want to achieve, you can just set different entrypoints to launch the docker image. In our plugins, the entrypoints are usually

kedro run --node=<name of the node to run>

So if you can run

kedro run <params for the multi-runner to launch single job>

you will be able to paralleize on that level.

❤️ 2

marrrcin

06/21/2023, 1:33 PM

Hope that makes sense

Till

06/21/2023, 2:39 PM

Thanks @marrrcin this is helpful! 🙂 At the moment I am not focussing on

multi-runner

and just trying to verify that we have a working sequence of Kedro pipelines that are also able to run Spark jobs

marrrcin

06/22/2023, 7:21 AM

Feel free to reach out if you need any additional assistance

🙏 1

25 Views

Open in Slack

Previous Next