Hello, do you have any best practices concerning p...
# questions
c
Hello, do you have any best practices concerning partionned dataset ? Is there a way to optimize the loading of many files ? (it takes hours sequentially). I have seen many plugins and helpers library but they seem deprecated (not updated for years). I have thought of parallelization inside the node but it doesn't seem best practice. Is there a way to parallelize a node inside kedro and somehow scale this node in the cloud ? Thanks
m
c
Hello, if i understand correctly, ParallelRunner is unsed to run multiples nodes at once, the problem is that I only have one nodes that loads the PartitionDataset. Is there a way to parallel a specific node or to load ParititionDataset from multiples nodes ?
👍 1
m
Ah I see. I'm not quite sure. Maybe @Deepyaman Datta has some ideas for this?
d
Is there a way to parallelize a node inside kedro and somehow scale this node in the cloud ?
I feel like I get kind of what you're looking for (a lot of orchestrators let you provide per-node resourcing), but there's no direct way to do something like this in Kedro.
I have thought of parallelization inside the node but it doesn't seem best practice.
I think it's not that bad an idea. If you're using
PartitionedDataset
, your code inside the node is already tailored towards handling partitions, so further modifying it to load in parallel isn't a bad idea. There's
--async
to load/save multiple datasets in parallel. Unfortunately, this won't do for partitions in
PartitionedDataset
. However, we could consider extending it to process partitions in parallel? Last but not least, not exactly using
PartitionedDataset
, but you could consider using something like Dask (or another distributed processing framework). Depends on whether you need to process these partitions in parallel, too. In Dask, each partition is a pandas DataFrame.