Rob09/25/2023, 2:40 PM
). Context: - I have 10 nodes that execute tasks, which depend on each other and I want to parallelize the process without having to do refactoring in PySpark. - I want to parallelize each of the processes running on the nodes using tuples (
), running one node un parallel then the other one and so on Proposal: - The only idea I have in mind is to use a multiprocessing pool that uses map indexes (
), just like
does. I'm exploring kedro's solutions, however I only found Parallel Runner, but I'm not sure how it works; Whether you only parallelize the execution of nodes; or whether indexes can be defined to run multiple executions of the same node. Thanks for your help in advance!
Python Multiprocessing Pool + functools (partial)
Nok Lam Chan09/25/2023, 2:57 PM
because Spark doesn’t do anything on your local computer (where kedro get executed). Computation happened in the Spark Cluster itself.
I have 10 nodes that execute tasks, which depend on each otherI am not sure about this, if there are dependencies then you cannot make it parallelize. Maybe I misunderstood the question.
Rob09/25/2023, 3:02 PM
Nok Lam Chan09/25/2023, 3:11 PM