Clement
01/10/2024, 10:19 AMMerel
01/10/2024, 2:41 PMParallelRunner
? https://docs.kedro.org/en/stable/nodes_and_pipelines/run_a_pipeline.html#parallelrunnerClement
01/10/2024, 2:49 PMMerel
01/12/2024, 4:29 PMDeepyaman Datta
01/12/2024, 4:39 PMIs there a way to parallelize a node inside kedro and somehow scale this node in the cloud ?I feel like I get kind of what you're looking for (a lot of orchestrators let you provide per-node resourcing), but there's no direct way to do something like this in Kedro.
I have thought of parallelization inside the node but it doesn't seem best practice.I think it's not that bad an idea. If you're using
PartitionedDataset
, your code inside the node is already tailored towards handling partitions, so further modifying it to load in parallel isn't a bad idea.
There's --async
to load/save multiple datasets in parallel. Unfortunately, this won't do for partitions in PartitionedDataset
. However, we could consider extending it to process partitions in parallel?
Last but not least, not exactly using PartitionedDataset
, but you could consider using something like Dask (or another distributed processing framework). Depends on whether you need to process these partitions in parallel, too. In Dask, each partition is a pandas DataFrame.