I have a question about large data sets I have a collection Kedro #questions

I have a question about large data sets. I have a ...

Galen Seilis

08/25/2023, 3:38 AM

I have a question about large data sets. I have a collection of shapefiles that together take up more space than I have RAM. They're not so large that any one of them is greater in size than my RAM. My question isn't about shapefiles per se as I have followed [this workaround](https://github.com/kedro-org/kedro/issues/695#issuecomment-1188291881) and it seems to work fine. Rather it is the fact that I won't be able to hold them all in memory at once that concerns me. The project is essentially a feature extraction tool. Each data set needs to be loaded and processed. My non-kedro version of this does each file one-at-a-time to avoid memory constraints. But if I make distinct nodes processing this GIS data I am wondering if Kedro will detect that it cannot do all of the jobs at once. Are there are best practices for handling these large files?

Iñigo Hidalgo

08/25/2023, 9:19 AM

So they're individual files which you need to process individually? What will each of those "processes" return? Another individual file/dataframe? [Partitioned datasets](https://docs.kedro.org/en/stable/data/kedro_io.html#partitioned-dataset) might be useful for you

👍 2

Galen Seilis

08/25/2023, 1:57 PM

@Iñigo Hidalgo Yeah, they're individual files that need to be processed individually. Each process would return a

networkx.Graph

object and then I would save it to a

graphml

format. Since the output is a weighted graph, I could change to a different format including a dataframe representing the weighted edge list. I will look at the partitioned data set.

Iñigo Hidalgo

08/25/2023, 1:59 PM

Without being familiar with specifics of networkx, it seems like a good use for the partitioned dataset

👍 1

Nok Lam Chan

08/25/2023, 5:29 PM

Haven’t read the details. Some quick pointers which may help. 1. PartitionedDataSet - https://docs.kedro.org/en/stable/data/kedro_io.html#partitioned-dataset-load 2. Use

yield

- See https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#how-to-use-generator-functions-in-a-node

👍 1

2 Views

Open in Slack

Previous Next