Rob
09/25/2023, 2:40 PM0.17.7
).
Context:
- I have 10 nodes that execute tasks, which depend on each other and I want to parallelize the process without having to do refactoring in PySpark.
- I want to parallelize each of the processes running on the nodes using tuples (category_id
, scenario_id
), running one node un parallel then the other one and so on
Proposal:
- The only idea I have in mind is to use a multiprocessing pool that uses map indexes (category_id
, scenario_id
), just like Python Multiprocessing Pool + functools (partial)
does.
I'm exploring kedro's solutions, however I only found Parallel Runner, but I'm not sure how it works; Whether you only parallelize the execution of nodes; or whether indexes can be defined to run multiple executions of the same node.
Thanks for your help in advance!Nok Lam Chan
09/25/2023, 2:57 PMThreadRunner
because Spark doesn’t do anything on your local computer (where kedro get executed). Computation happened in the Spark Cluster itself.Nok Lam Chan
09/25/2023, 2:58 PMI have 10 nodes that execute tasks, which depend on each otherI am not sure about this, if there are dependencies then you cannot make it parallelize. Maybe I misunderstood the question.
Rob
09/25/2023, 3:02 PMNok Lam Chan
09/25/2023, 3:11 PM