Kedro is an open-sourced Python framework for creating maintainable and modular data science code.

Kedro

Do you use `kedro run --runner ParalleRunner` to speed up your pipeline. If not, why? (Other than Spark doesn't work with multiprocess)

Because libraries like xgboost, scikit learn + joblib, polars, … already use parallel processing… I do use the async loading of catalog entries often!

Because when working with a typical single GPU machine you don't want various nodes to try to access the GPU at the same time, which leads to CUDA crashes. This is specific to GPU-heavy workflows though.

<@U05V4QHSCPM> good point about GPU and there is no finer control on which nodes should access the GPU

<@U04HQRFPM0C> good point about async, and libraries are handling this themselves already. Tho I remember pandas was pretty bad at using all your cores.

it sounds like most people don't really need `ParallelRunner` and rely on the libraries itself (more flexible I guess).

My feeling is `async`/`CacheDataset`/`kedro-accelerator` are more likely to bring performance gain.

Cc <@U03S12LHNNQ>

For partitioned datasets I often use async file loading from catalog too, which works very well