https://kedro.org/ logo
#user-research
Title
# user-research
n

Nok Lam Chan

03/13/2024, 5:59 PM
Do you use
kedro run --runner ParalleRunner
to speed up your pipeline. If not, why? (Other than Spark doesn't work with multiprocess)
m

Matthias Roels

03/13/2024, 7:28 PM
Because libraries like xgboost, scikit learn + joblib, polars, … already use parallel processing… I do use the async loading of catalog entries often!
👍 2
👀 2
m

marrrcin

03/14/2024, 8:59 AM
Because multiprocessing is 💩
😂 3
p

Piotr Grabowski

03/14/2024, 10:38 AM
Because when working with a typical single GPU machine you don't want various nodes to try to access the GPU at the same time, which leads to CUDA crashes. This is specific to GPU-heavy workflows though.
n

Nok Lam Chan

03/14/2024, 11:34 AM
@Piotr Grabowski good point about GPU and there is no finer control on which nodes should access the GPU @Matthias Roels good point about async, and libraries are handling this themselves already. Tho I remember pandas was pretty bad at using all your cores.
it sounds like most people don't really need
ParallelRunner
and rely on the libraries itself (more flexible I guess). My feeling is `async`/`CacheDataset`/`kedro-accelerator` are more likely to bring performance gain. Cc @Deepyaman Datta
👍 2
j

Janick Spirig

03/14/2024, 12:30 PM
For partitioned datasets I often use async file loading from catalog too, which works very well
💯 1
6 Views