Kedro is an open-sourced Python framework for creating maintainable and modular data science code.

Kedro

Hey,
I'm trying to better understand when it it's reasonable / feasible to use *`ParallelRunner`* instead of the default *`SequentialRunner`*.

*Are those conclusions correct?*
1. Worst case scenario, *`ParallelRunner`* would just yield same speed as *`SequentialRunner`* . It can't produce different results and manages the execution order in a way that if some node expects outputs from a few nodes, it would wait until them all get generated.
2. *`ParallelRunner`* shines when a pipeline does many similar operations on some already-available input, and it's just a matter of compute time to do each of those operations. In other words, those operations do not sequentially depend on each other. Likely, a pipeline consisting of a few namespace pipelines is a good candidate for that runner.
*And a question:*
3. When would you avoid using *`ParallelRunner`* ?

I think technically the worst case scenario of ParallelRunner is ever so slightly worse than Sequential since there is an overhead pooling, splitting and reconciling the processes - but it shouldn’t be noticible

On #3 in distributed execution contexts like Spark, Snowpark or Dask you should use ThreadRunner instead