Rob
09/25/2023, 2:40 PM0.17.7
).
Context:
- I have 10 nodes that execute tasks, which depend on each other and I want to parallelize the process without having to do refactoring in PySpark.
- I want to parallelize each of the processes running on the nodes using tuples (category_id
, scenario_id
), running one node un parallel then the other one and so on
Proposal:
- The only idea I have in mind is to use a multiprocessing pool that uses map indexes (category_id
, scenario_id
), just like Python Multiprocessing Pool + functools (partial)
does.
I'm exploring kedro's solutions, however I only found Parallel Runner, but I'm not sure how it works; Whether you only parallelize the execution of nodes; or whether indexes can be defined to run multiple executions of the same node.
Thanks for your help in advance!Nok Lam Chan
09/25/2023, 2:57 PMThreadRunner
because Spark doesn’t do anything on your local computer (where kedro get executed). Computation happened in the Spark Cluster itself.I have 10 nodes that execute tasks, which depend on each otherI am not sure about this, if there are dependencies then you cannot make it parallelize. Maybe I misunderstood the question.
Rob
09/25/2023, 3:02 PMNok Lam Chan
09/25/2023, 3:11 PM