Hi everyone I was wondering if you could help me choose the Kedro #questions

Hi everyone, I was wondering if you could help me...

Rob

09/25/2023, 2:40 PM

Hi everyone, I was wondering if you could help me choose the best way to structure my kedro pipeline (

0.17.7

). Context: - I have 10 nodes that execute tasks, which depend on each other and I want to parallelize the process without having to do refactoring in PySpark. - I want to parallelize each of the processes running on the nodes using tuples (

category_id

scenario_id

), running one node un parallel then the other one and so on Proposal: - The only idea I have in mind is to use a multiprocessing pool that uses map indexes (

category_id

scenario_id

), just like

Python Multiprocessing Pool + functools (partial)

does. I'm exploring kedro's solutions, however I only found Parallel Runner, but I'm not sure how it works; Whether you only parallelize the execution of nodes; or whether indexes can be defined to run multiple executions of the same node. Thanks for your help in advance!

Nok Lam Chan

09/25/2023, 2:57 PM

For Spark workflow - you need to use

ThreadRunner

because Spark doesn’t do anything on your local computer (where kedro get executed). Computation happened in the Spark Cluster itself.

Nok Lam Chan

09/25/2023, 2:58 PM

I have 10 nodes that execute tasks, which depend on each other

I am not sure about this, if there are dependencies then you cannot make it parallelize. Maybe I misunderstood the question.

Rob

09/25/2023, 3:02 PM

Currently all my code is in Python, that's why I thought of using Multiprocessing Pool and avoiding Spark. And you are right, there are dependencies, node 2 uses the outputs of node 1 and so on. So for a situation like this, Parallel Runner is not an option right

Nok Lam Chan

09/25/2023, 3:11 PM

Indeed, parallelrunner works for independent tasks, which isn’t applicable to your case.

thankyou 1

4 Views

Open in Slack

Previous Next