Hello, sorry for the question which might sound very beginner level, but even if I read the documentation and looked for keywords here in slack, I haven't found an answer yet.
I have a big dataset (e.g., csv format but the same problem happens with different format or types data) that can't fit in memory, so I need to save it in chunks. To save it in chunks I need to run a for loop, this implies that my function (node) won't return until the for loop is finished. As a result, I end up saving csv chunks within the for loop and my node doesn't output anything. When the next node in the pipeline needs access to the group of csv files, I can use PartitionedDatasets very conveniently as input (the folder path is listed in the catalog yaml of course).
As you can see, the final result is that my pipeline has two consecutive nodes which are not connected, because the first node doesn't have an output (its output is not handled by kedro looking up the folder in the catalog).
How can I deal with big datasets that don't fit in memory and need to be written with a for loop? Why I can read a PartitionedDataset while I can't write one? Something tells me that I'm missing some major python or kedro capability so please enlighten me 😇
Nok Lam Chan
08/25/2023, 5:22 PM
PartitionedDataSet can be read or write.
I suppose you do the chunk save inside the for-loop? PartitionedDataSet support lazy saving.