Hello do you have any best practices concerning partionned d Kedro #questions

Hello, do you have any best practices concerning p...

Clement

01/10/2024, 10:19 AM

Hello, do you have any best practices concerning partionned dataset ? Is there a way to optimize the loading of many files ? (it takes hours sequentially). I have seen many plugins and helpers library but they seem deprecated (not updated for years). I have thought of parallelization inside the node but it doesn't seem best practice. Is there a way to parallelize a node inside kedro and somehow scale this node in the cloud ? Thanks

Merel

01/10/2024, 2:41 PM

Hi @Clement, have you had a look at the

ParallelRunner

? https://docs.kedro.org/en/stable/nodes_and_pipelines/run_a_pipeline.html#parallelrunner

Clement

01/10/2024, 2:49 PM

Hello, if i understand correctly, ParallelRunner is unsed to run multiples nodes at once, the problem is that I only have one nodes that loads the PartitionDataset. Is there a way to parallel a specific node or to load ParititionDataset from multiples nodes ?

👍 1

Merel

01/12/2024, 4:29 PM

Ah I see. I'm not quite sure. Maybe @Deepyaman Datta has some ideas for this?

Deepyaman Datta

01/12/2024, 4:39 PM

Is there a way to parallelize a node inside kedro and somehow scale this node in the cloud ?

I feel like I get kind of what you're looking for (a lot of orchestrators let you provide per-node resourcing), but there's no direct way to do something like this in Kedro.

I have thought of parallelization inside the node but it doesn't seem best practice.

I think it's not that bad an idea. If you're using

PartitionedDataset

, your code inside the node is already tailored towards handling partitions, so further modifying it to load in parallel isn't a bad idea. There's

--async

to load/save multiple datasets in parallel. Unfortunately, this won't do for partitions in

PartitionedDataset

. However, we could consider extending it to process partitions in parallel? Last but not least, not exactly using

PartitionedDataset

, but you could consider using something like Dask (or another distributed processing framework). Depends on whether you need to process these partitions in parallel, too. In Dask, each partition is a pandas DataFrame.

2 Views

Open in Slack

Previous Next