Hi Kedro Community :slightly_smiling_face: I have ...
# plugins-integrations
t
Hi Kedro Community šŸ™‚ I have a question regarding the Kedro Pipelines plugins, especially https://github.com/getindata/kedro-azureml. We tried to apply the kedro-vertex plugin but ran into the following bottlenecks: 1. There was a vast IO overhead that our large pipelines were going to introduce. Essentially the performance of our pipelines get affected if every node needs to read and write to disk over network. 2. The other thing is that we're using a kedro extension called
multi-runner
which wraps around the catalog and I'm not sure how compatible it is. I can skim through the plugin implementation to see how it works. Now I am assuming that these issues might be similar for all getindata plugins. So here are my three questions: Is there any way to work around these issue? Is it maybe possible to read & write on IO level only on the Kedro Pipeline, not the Kedro node level? If its not suitable to use the plugin, can you provide someone who is new to Kedro some guidance how to interface Azure ML, instead? Thanks in advance for any help! šŸ™‚
K 1
m
Hi, thanks for reaching out!
Is there any way to work around these issue? Is it maybe possible to read & write on IO level only on the Kedro Pipeline, not the Kedro node level?
In kedro-airflow-k8s we introduced a concept of groups for the nodes that were using Spark (so that all nodes processing Spark DataFrames could still be separate on Kedro level, but at run time they will be merged, allowing to pass data in-memory - both with MemoryDataSets as well as with lazy Spark DataFrames). We were thinking about introducing the same idea of ā€œgroupsā€ based on the node tags in other Kedro plugins, to effectively bring this feature to all of our plugins. As of today, we havenā€™t implemented that yet, but weā€™re keen to accept contributions on that - we can also assist you on that path. As for the IO between nodes - this happens in all ā€œmodernā€ ML stacks - Vertex, Azure ML, SageMaker, as the nodes in those systems are usually separate docker containers, meaning you have to materialize the data between them. One workaround would be to squish everything into one node - but then you loose the main benefit of those managed ML tools - ability to make execution of large pipelines parallel.
šŸ™ 1
As for
multi-runner
- I know it is proprietary and I donā€™t have access to its code base - so I cannot help you with that
šŸ™ 1
t
What we were doing ourselves on Vertex is that we put whole Kedro Pipelines in seperate containers. Do you think this can be done in a few days with the plugin? Also we would need support for multi-runner. (At the moment I have not enough knowledge so I still have to do some research on that)
m
Do you think this can be done in a few days with the plugin?
What do you mean by that?
t
I am just asking how "difficult" these changes are to make to get it running with our Kedro setup?
šŸ‘ 1
If it would take several weeks of time, then it would be out of scope for us. We basically need to deliver something this week
m
But what do you want to achieve with the Vertex AI in your use case? Paralleization or just running it in the cloud for the sake of running it in the cloud?
t
We already have a more hacky way (without the plugin) to to it in Vertex. Right now we are focussing on Azure(ML). Yes both, it needs to run in the cloud and should be able to be parallelized (multi-runner is a requirement)
n
Is there any way to work around these issue? Is it maybe possible to read & write on IO level only on the Kedro Pipeline, not the Kedro node level?
For this we usually donā€™t advise a 1:1 mapping between Kedroā€™s node to Orchestrator node. The main reason for that is Kedroā€™s node are usually smaller. Itā€™s more reasonable to deploy a Kedroā€™s modular pipeline as a ā€œNodeā€ on whatever compute platform you are using. This way all the intermediate node should be passed as memory and you donā€™t have the I/O issue.
šŸ˜® 1
m
Not sure how the
multi-runner
ā€œunrollsā€ the pipeline, because if itā€™s static at the pipeline generation time, then the orchestrator will be able to digest that. If itā€™s somehow at runtime level, then this is an uncharted territory.
How the pipeline is being lauched? Is it simple ā€œkedro runā€ ?
n
I am not an user of multi-runner, but Iā€™ve just checked with the author. Essentially it does some update to the catalog and construct a larger DAGs automatically (think of copying the same pipeline for different pipelines and do the namespacing automatically for you)
In the sense it is still 1 Kedro run, and the parallelisation bit locally is delegated to ParallelRunner
m
I wonder where it does this update to ā€œconstruct a larger DAGā€ - because if itā€™s somewhere in the hooks, then itā€™s too late for the plugins.
t
I am checking internally if
multi-runner
is compatible šŸ™‚ @Nok Lam Chan I am not sure if I understand your message. What do you mean with "Orchestrator node". Can you specify a bit more what you mean by "deploy a Kedroā€™s modular pipeline as a Node on whatever compute platform you are using". So you mean deploying the pipeline in a seperate container job without the pluging?
So here you go: @marrrcin "Multi-Runner does 2 main things: 1. Non-destructively updates data catalogue to add data for your custom runs 2. During registry initialisation, it updates the pipeline you are using with MR by duplicating and namespacing some bits (using kedro modular pipelines API) This means that it is compatible with any Kedro extension (like ParallelRunner) that does not do static code analysis."
n
@Till Sorry for the confusion. What I mean is to reduce the I/O, itā€™s recommended to have a modular pipeline -> Task (whatever the name is) You may have a Kedro pipeline that has 1000 nodes, but you can breakdown it to a few sub-pipelines
šŸ™ 1
Each sub-pipelines will be a ā€œTaskā€ on these platform, Azure/databricks/AWS
šŸ™ 1
t
And yes, @marrrcin it is launched with
kedro run
m
And when you look at
kedro viz
- is the pipeline ā€œstaticā€ - what I mean by that is whether all of the ā€œmulti-runnerā€ expansions are actually materialized and visible, as if it were youā€™ve defined everything by hand?
n
I think the DAG construct is still before pipeline run
m
@Nok Lam Chan you mean in hooks?
n
to your question, you should be able to see it in
kedro-viz
šŸ‘ 1
m
OK, so if this generation follows standard Kedro logic = pipeline is defined fully before execution, then our plugins should be able to handle it, with the note that every Kedro node = separate Task in the target cloudā€™s orchestration tool, which means - data serialization.
K 1
As for the ā€œgroupsā€ idea (squashing multiple Kedro nodes inside of a single orchestrator node) itā€™s here https://github.com/getindata/kedro-airflow-k8s/blob/develop/kedro_airflow_k8s/task_group.py Itā€™s for Airflow on k8s, so far only for Kedro <0.18.
t
So assuming that I am not in the power to change the Kedro codebase. And I think I do not have the time to create a PR, unfortunately. So still, do you have some guidance how to work around this? (basically without the plugin) I just need a solution at the moment and would love to contribute to the plugin at a later point (regarding the groups feature).
m
Is it possible to launch only single specific run of ā€œmulti-runnerā€ from CLI?
In general approach, what you could do is: 1. Package your Kedro project in docker 2. Use the target cloud / orchestrator to launch this docker image Now, depending on the granularity of ā€œparalleizationā€ that you want to achieve, you can just set different entrypoints to launch the docker image. In our plugins, the entrypoints are usually
kedro run --node=<name of the node to run>
So if you can run
kedro run <params for the multi-runner to launch single job>
you will be able to paralleize on that level.
ā¤ļø 2
Hope that makes sense
t
Thanks @marrrcin this is helpful! šŸ™‚ At the moment I am not focussing on
multi-runner
and just trying to verify that we have a working sequence of Kedro pipelines that are also able to run Spark jobs
m
Feel free to reach out if you need any additional assistance
šŸ™ 1