Clement
01/10/2024, 10:19 AMMerel
01/10/2024, 2:41 PMParallelRunner? https://docs.kedro.org/en/stable/nodes_and_pipelines/run_a_pipeline.html#parallelrunnerClement
01/10/2024, 2:49 PMMerel
01/12/2024, 4:29 PMDeepyaman Datta
01/12/2024, 4:39 PMIs there a way to parallelize a node inside kedro and somehow scale this node in the cloud ?I feel like I get kind of what you're looking for (a lot of orchestrators let you provide per-node resourcing), but there's no direct way to do something like this in Kedro.
I have thought of parallelization inside the node but it doesn't seem best practice.I think it's not that bad an idea. If you're using
PartitionedDataset, your code inside the node is already tailored towards handling partitions, so further modifying it to load in parallel isn't a bad idea.
There's --async to load/save multiple datasets in parallel. Unfortunately, this won't do for partitions in PartitionedDataset. However, we could consider extending it to process partitions in parallel?
Last but not least, not exactly using PartitionedDataset, but you could consider using something like Dask (or another distributed processing framework). Depends on whether you need to process these partitions in parallel, too. In Dask, each partition is a pandas DataFrame.