Till06/21/2023, 6:25 AM
which wraps around the catalog and I'm not sure how compatible it is. I can skim through the plugin implementation to see how it works. Now I am assuming that these issues might be similar for all getindata plugins. So here are my three questions: Is there any way to work around these issue? Is it maybe possible to read & write on IO level only on the Kedro Pipeline, not the Kedro node level? If its not suitable to use the plugin, can you provide someone who is new to Kedro some guidance how to interface Azure ML, instead? Thanks in advance for any help! 🙂
marrrcin06/21/2023, 7:33 AM
Is there any way to work around these issue? Is it maybe possible to read & write on IO level only on the Kedro Pipeline, not the Kedro node level?In kedro-airflow-k8s we introduced a concept of groups for the nodes that were using Spark (so that all nodes processing Spark DataFrames could still be separate on Kedro level, but at run time they will be merged, allowing to pass data in-memory - both with MemoryDataSets as well as with lazy Spark DataFrames). We were thinking about introducing the same idea of “groups” based on the node tags in other Kedro plugins, to effectively bring this feature to all of our plugins. As of today, we haven’t implemented that yet, but we’re keen to accept contributions on that - we can also assist you on that path. As for the IO between nodes - this happens in all “modern” ML stacks - Vertex, Azure ML, SageMaker, as the nodes in those systems are usually separate docker containers, meaning you have to materialize the data between them. One workaround would be to squish everything into one node - but then you loose the main benefit of those managed ML tools - ability to make execution of large pipelines parallel.
- I know it is proprietary and I don’t have access to its code base - so I cannot help you with that
Till06/21/2023, 10:10 AM
marrrcin06/21/2023, 10:49 AM
Do you think this can be done in a few days with the plugin?What do you mean by that?
Till06/21/2023, 10:50 AM
marrrcin06/21/2023, 10:52 AM
Till06/21/2023, 10:55 AM
Nok Lam Chan06/21/2023, 11:09 AM
Is there any way to work around these issue? Is it maybe possible to read & write on IO level only on the Kedro Pipeline, not the Kedro node level?For this we usually don’t advise a 1:1 mapping between Kedro’s node to Orchestrator node. The main reason for that is Kedro’s node are usually smaller. It’s more reasonable to deploy a Kedro’s modular pipeline as a “Node” on whatever compute platform you are using. This way all the intermediate node should be passed as memory and you don’t have the I/O issue.
marrrcin06/21/2023, 11:12 AM
“unrolls” the pipeline, because if it’s static at the pipeline generation time, then the orchestrator will be able to digest that. If it’s somehow at runtime level, then this is an uncharted territory.
Nok Lam Chan06/21/2023, 11:15 AM
marrrcin06/21/2023, 11:26 AM
Till06/21/2023, 1:09 PM
is compatible 🙂 @Nok Lam Chan I am not sure if I understand your message. What do you mean with "Orchestrator node". Can you specify a bit more what you mean by "deploy a Kedro’s modular pipeline as a Node on whatever compute platform you are using". So you mean deploying the pipeline in a seperate container job without the pluging?
Nok Lam Chan06/21/2023, 1:12 PM
Till06/21/2023, 1:17 PM
marrrcin06/21/2023, 1:18 PM
- is the pipeline “static” - what I mean by that is whether all of the “multi-runner” expansions are actually materialized and visible, as if it were you’ve defined everything by hand?
Nok Lam Chan06/21/2023, 1:19 PM
marrrcin06/21/2023, 1:19 PM
Nok Lam Chan06/21/2023, 1:19 PM
marrrcin06/21/2023, 1:21 PM
Till06/21/2023, 1:27 PM
marrrcin06/21/2023, 1:30 PM
So if you can run
kedro run --node=<name of the node to run>
you will be able to paralleize on that level.
kedro run <params for the multi-runner to launch single job>
Till06/21/2023, 2:39 PM
and just trying to verify that we have a working sequence of Kedro pipelines that are also able to run Spark jobs
marrrcin06/22/2023, 7:21 AM