Till
06/21/2023, 6:25 AMmulti-runner
which wraps around the catalog and I'm not sure how compatible it is. I can skim through the plugin implementation to see how it works.
Now I am assuming that these issues might be similar for all getindata plugins. So here are my three questions:
Is there any way to work around these issue? Is it maybe possible to read & write on IO level only on the Kedro Pipeline, not the Kedro node level?
If its not suitable to use the plugin, can you provide someone who is new to Kedro some guidance how to interface Azure ML, instead?
Thanks in advance for any help! šmarrrcin
06/21/2023, 7:33 AMIs there any way to work around these issue? Is it maybe possible to read & write on IO level only on the Kedro Pipeline, not the Kedro node level?In kedro-airflow-k8s we introduced a concept of groups for the nodes that were using Spark (so that all nodes processing Spark DataFrames could still be separate on Kedro level, but at run time they will be merged, allowing to pass data in-memory - both with MemoryDataSets as well as with lazy Spark DataFrames). We were thinking about introducing the same idea of āgroupsā based on the node tags in other Kedro plugins, to effectively bring this feature to all of our plugins. As of today, we havenāt implemented that yet, but weāre keen to accept contributions on that - we can also assist you on that path. As for the IO between nodes - this happens in all āmodernā ML stacks - Vertex, Azure ML, SageMaker, as the nodes in those systems are usually separate docker containers, meaning you have to materialize the data between them. One workaround would be to squish everything into one node - but then you loose the main benefit of those managed ML tools - ability to make execution of large pipelines parallel.
multi-runner
- I know it is proprietary and I donāt have access to its code base - so I cannot help you with thatTill
06/21/2023, 10:10 AMmarrrcin
06/21/2023, 10:49 AMDo you think this can be done in a few days with the plugin?What do you mean by that?
Till
06/21/2023, 10:50 AMmarrrcin
06/21/2023, 10:52 AMTill
06/21/2023, 10:55 AMNok Lam Chan
06/21/2023, 11:09 AMIs there any way to work around these issue? Is it maybe possible to read & write on IO level only on the Kedro Pipeline, not the Kedro node level?For this we usually donāt advise a 1:1 mapping between Kedroās node to Orchestrator node. The main reason for that is Kedroās node are usually smaller. Itās more reasonable to deploy a Kedroās modular pipeline as a āNodeā on whatever compute platform you are using. This way all the intermediate node should be passed as memory and you donāt have the I/O issue.
marrrcin
06/21/2023, 11:12 AMmulti-runner
āunrollsā the pipeline, because if itās static at the pipeline generation time, then the orchestrator will be able to digest that. If itās somehow at runtime level, then this is an uncharted territory.Nok Lam Chan
06/21/2023, 11:15 AMmarrrcin
06/21/2023, 11:26 AMTill
06/21/2023, 1:09 PMmulti-runner
is compatible š
@Nok Lam Chan I am not sure if I understand your message. What do you mean with "Orchestrator node". Can you specify a bit more what you mean by "deploy a Kedroās modular pipeline as a Node on whatever compute platform you are using". So you mean deploying the pipeline in a seperate container job without the pluging?Nok Lam Chan
06/21/2023, 1:12 PMTill
06/21/2023, 1:17 PMkedro run
marrrcin
06/21/2023, 1:18 PMkedro viz
- is the pipeline āstaticā - what I mean by that is whether all of the āmulti-runnerā expansions are actually materialized and visible, as if it were youāve defined everything by hand?Nok Lam Chan
06/21/2023, 1:19 PMmarrrcin
06/21/2023, 1:19 PMNok Lam Chan
06/21/2023, 1:19 PMkedro-viz
marrrcin
06/21/2023, 1:21 PMTill
06/21/2023, 1:27 PMmarrrcin
06/21/2023, 1:30 PMkedro run --node=<name of the node to run>
So if you can run kedro run <params for the multi-runner to launch single job>
you will be able to paralleize on that level.Till
06/21/2023, 2:39 PMmulti-runner
and just trying to verify that we have a working sequence of Kedro pipelines that are also able to run Spark jobsmarrrcin
06/22/2023, 7:21 AM