Hi everyone, I was wondering if you could help me...
# questions
r
Hi everyone, I was wondering if you could help me choose the best way to structure my kedro pipeline (
0.17.7
). Context: - I have 10 nodes that execute tasks, which depend on each other and I want to parallelize the process without having to do refactoring in PySpark. - I want to parallelize each of the processes running on the nodes using tuples (
category_id
,
scenario_id
), running one node un parallel then the other one and so on Proposal: - The only idea I have in mind is to use a multiprocessing pool that uses map indexes (
category_id
,
scenario_id
), just like
Python Multiprocessing Pool + functools (partial)
does. I'm exploring kedro's solutions, however I only found Parallel Runner, but I'm not sure how it works; Whether you only parallelize the execution of nodes; or whether indexes can be defined to run multiple executions of the same node. Thanks for your help in advance!
n
For Spark workflow - you need to use
ThreadRunner
because Spark doesn’t do anything on your local computer (where kedro get executed). Computation happened in the Spark Cluster itself.
I have 10 nodes that execute tasks, which depend on each other
I am not sure about this, if there are dependencies then you cannot make it parallelize. Maybe I misunderstood the question.
r
Currently all my code is in Python, that's why I thought of using Multiprocessing Pool and avoiding Spark. And you are right, there are dependencies, node 2 uses the outputs of node 1 and so on. So for a situation like this, Parallel Runner is not an option right
n
Indeed, parallelrunner works for independent tasks, which isn’t applicable to your case.
thankyou 1