Hello sorry for the question which might sound very beginner Kedro #questions

Hello, sorry for the question which might sound ve...

Claire M.

08/25/2023, 3:55 PM

Hello, sorry for the question which might sound very beginner level, but even if I read the documentation and looked for keywords here in slack, I haven't found an answer yet. I have a big dataset (e.g., csv format but the same problem happens with different format or types data) that can't fit in memory, so I need to save it in chunks. To save it in chunks I need to run a for loop, this implies that my function (node) won't return until the for loop is finished. As a result, I end up saving csv chunks within the for loop and my node doesn't output anything. When the next node in the pipeline needs access to the group of csv files, I can use PartitionedDatasets very conveniently as input (the folder path is listed in the catalog yaml of course). As you can see, the final result is that my pipeline has two consecutive nodes which are not connected, because the first node doesn't have an output (its output is not handled by kedro looking up the folder in the catalog). How can I deal with big datasets that don't fit in memory and need to be written with a for loop? Why I can read a PartitionedDataset while I can't write one? Something tells me that I'm missing some major python or kedro capability so please enlighten me 😇

Nok Lam Chan

08/25/2023, 5:22 PM

PartitionedDataSet can be read or write.

Nok Lam Chan

08/25/2023, 5:23 PM

I suppose you do the chunk save inside the for-loop? PartitionedDataSet support lazy saving.

Nok Lam Chan

08/25/2023, 5:23 PM

https://docs.kedro.org/en/stable/kedro.io.PartitionedDataset.html#kedro.io.PartitionedDataset

Nok Lam Chan

08/25/2023, 5:24 PM

Lazy Saving: https://docs.kedro.org/en/stable/data/kedro_io.html#partitioned-dataset-lazy-saving

Claire M.

08/28/2023, 5:52 PM

Thank you! I'll look into the proper implementation of lazy saving

4 Views

Open in Slack

Previous Next