Hi All I d like to run in parallel on let s say 100 nodes I Kedro #questions

Hi All. I'd like to run in parallel on let's say ...

Robert Lugg

08/08/2024, 11:21 PM

Hi All. I'd like to run in parallel on let's say 100 nodes. I need to use a grid system (such as LSF, UGE, SGE) or ssh. Must be 100% on-prem (there won't be a connection to the internet after install). I don't even understand the basics. I assume Kedro launches workers. Would that be done using a custom runner? I hope I'm clear, I know its an uncommon use case.

Deepyaman Datta

08/08/2024, 11:27 PM

Custom runner should work. You can take inspiration from https://docs.kedro.org/en/stable/deployment/dask.html (or

ThreadRunner

ParallelRunner

implementation; they're all rather similar). I don't have any familiarity with these grid systems, but it seems you can even run Dask Distributed code on them, so using

DaskRunner

on top might also be a possibility.

Robert Lugg

08/09/2024, 1:00 AM

That is helpful. I'm afraid now I don't understand a runner. If I have a pipeline which is a chain of nodes to execute and then I 'kedra run' it what is running where and how often 🙂 I believe the specified runner is executed once on the local machine and passed an in-memory object; Pipeline. That runner then loops through pipeline.nodes and then does "something". What is that something? Would it be starting a process on a remote machine for each node? Presumably a pipeline has dependent nodes and isn't a "straight line" from start to finish. Given that why does the running do a simple loop through the nodes? Thanks for any guidance. I know I'm not explaining it very well.

Robert Lugg

08/09/2024, 1:22 AM

Ah, the dependencies can't be handled by using pipeline.nodes by itself. Looks like you also need to add logic similar to: https://github.com/kedro-org/kedro/blob/62431bb5b0e5e097a53bfb338c689bedfe285408/kedro/runner/parallel_runner.py#L268C14-L284 I still don't understand what is run on the worker.

Laurens Vijnck

08/09/2024, 7:47 AM

Kubernetes on prem + argo an option? https://docs.kedro.org/en/0.17.7/10_deployment/04_argo.html

Deepyaman Datta

08/09/2024, 1:11 PM

That runner then loops through pipeline.nodes and then does "something". What is that something? Would it be starting a process on a remote machine for each node?

It depends on the runner implementation, but often this amounts to submitting a task in the distributed environment.

Presumably a pipeline has dependent nodes and isn't a "straight line" from start to finish. Given that why does the running do a simple loop through the nodes?

I can try to answer better later today, but if you look at most runner code (e.g. see https://docs.kedro.org/en/stable/_modules/kedro/runner/thread_runner.html#ThreadRunner), you'll find that it processes a list of nodes with dependencies, and chooses the next node from a set of "ready" nodes.

Robert Lugg

08/09/2024, 2:57 PM

@Laurens Vijnck. No, not an option. This is only for internal but ultimately, I'd like it to be sellable and we have little control over end-user systems. Plus corporate infrastructure has most machines on LSF.

Robert Lugg

08/09/2024, 3:01 PM

@Deepyaman Datta I got it. Yes, I see that it only loops through the nodes that are ready. Now it makes sense. The Sequential runner is still a bit confusing. It looks like it does just loop through the nodes although maybe I need to study the code a bit more: kedro/runner/sequential_runner.py

Deepyaman Datta

08/09/2024, 4:15 PM

pipeline.nodes

is toposorted, which is why you can just iterate through it

👍 1

Nok Lam Chan

08/09/2024, 9:47 PM

To be clear, pipeline.nodes defined the toposorted order. You can then try to parallelise it with different strategy with runner. With parallel runner, its a subprocess to process the node and the main process keep track of the overall DAGs.

👍 1

11 Views

Open in Slack

Previous Next