Hi All. I'd like to run in parallel on let's say ...
# questions
r
Hi All. I'd like to run in parallel on let's say 100 nodes. I need to use a grid system (such as LSF, UGE, SGE) or ssh. Must be 100% on-prem (there won't be a connection to the internet after install). I don't even understand the basics. I assume Kedro launches workers. Would that be done using a custom runner? I hope I'm clear, I know its an uncommon use case.
d
Custom runner should work. You can take inspiration from https://docs.kedro.org/en/stable/deployment/dask.html (or
ThreadRunner
or
ParallelRunner
implementation; they're all rather similar). I don't have any familiarity with these grid systems, but it seems you can even run Dask Distributed code on them, so using
DaskRunner
on top might also be a possibility.
r
That is helpful. I'm afraid now I don't understand a runner. If I have a pipeline which is a chain of nodes to execute and then I 'kedra run' it what is running where and how often 🙂 I believe the specified runner is executed once on the local machine and passed an in-memory object; Pipeline. That runner then loops through pipeline.nodes and then does "something". What is that something? Would it be starting a process on a remote machine for each node? Presumably a pipeline has dependent nodes and isn't a "straight line" from start to finish. Given that why does the running do a simple loop through the nodes? Thanks for any guidance. I know I'm not explaining it very well.
Ah, the dependencies can't be handled by using pipeline.nodes by itself. Looks like you also need to add logic similar to: https://github.com/kedro-org/kedro/blob/62431bb5b0e5e097a53bfb338c689bedfe285408/kedro/runner/parallel_runner.py#L268C14-L284 I still don't understand what is run on the worker.
l
d
That runner then loops through pipeline.nodes and then does "something". What is that something? Would it be starting a process on a remote machine for each node?
It depends on the runner implementation, but often this amounts to submitting a task in the distributed environment.
Presumably a pipeline has dependent nodes and isn't a "straight line" from start to finish. Given that why does the running do a simple loop through the nodes?
I can try to answer better later today, but if you look at most runner code (e.g. see https://docs.kedro.org/en/stable/_modules/kedro/runner/thread_runner.html#ThreadRunner), you'll find that it processes a list of nodes with dependencies, and chooses the next node from a set of "ready" nodes.
r
@Laurens Vijnck. No, not an option. This is only for internal but ultimately, I'd like it to be sellable and we have little control over end-user systems. Plus corporate infrastructure has most machines on LSF.
@Deepyaman Datta I got it. Yes, I see that it only loops through the nodes that are ready. Now it makes sense. The Sequential runner is still a bit confusing. It looks like it does just loop through the nodes although maybe I need to study the code a bit more: kedro/runner/sequential_runner.py
d
pipeline.nodes
is toposorted, which is why you can just iterate through it
👍 1
n
To be clear, pipeline.nodes defined the toposorted order. You can then try to parallelise it with different strategy with runner. With parallel runner, its a subprocess to process the node and the main process keep track of the overall DAGs.
👍 1