I have a question about large data sets. I have a ...
# questions
g
I have a question about large data sets. I have a collection of shapefiles that together take up more space than I have RAM. They're not so large that any one of them is greater in size than my RAM. My question isn't about shapefiles per se as I have followed [this workaround](https://github.com/kedro-org/kedro/issues/695#issuecomment-1188291881) and it seems to work fine. Rather it is the fact that I won't be able to hold them all in memory at once that concerns me. The project is essentially a feature extraction tool. Each data set needs to be loaded and processed. My non-kedro version of this does each file one-at-a-time to avoid memory constraints. But if I make distinct nodes processing this GIS data I am wondering if Kedro will detect that it cannot do all of the jobs at once. Are there are best practices for handling these large files?
i
So they're individual files which you need to process individually? What will each of those "processes" return? Another individual file/dataframe? [Partitioned datasets](https://docs.kedro.org/en/stable/data/kedro_io.html#partitioned-dataset) might be useful for you
šŸ‘ 2
g
@IƱigo Hidalgo Yeah, they're individual files that need to be processed individually. Each process would return a
networkx.Graph
object and then I would save it to a
graphml
format. Since the output is a weighted graph, I could change to a different format including a dataframe representing the weighted edge list. I will look at the partitioned data set.
i
Without being familiar with specifics of networkx, it seems like a good use for the partitioned dataset
šŸ‘ 1
n
šŸ‘ 1