I have a question about large data sets. I have a ...
# questions
I have a question about large data sets. I have a collection of shapefiles that together take up more space than I have RAM. They're not so large that any one of them is greater in size than my RAM. My question isn't about shapefiles per se as I have followed [this workaround](https://github.com/kedro-org/kedro/issues/695#issuecomment-1188291881) and it seems to work fine. Rather it is the fact that I won't be able to hold them all in memory at once that concerns me. The project is essentially a feature extraction tool. Each data set needs to be loaded and processed. My non-kedro version of this does each file one-at-a-time to avoid memory constraints. But if I make distinct nodes processing this GIS data I am wondering if Kedro will detect that it cannot do all of the jobs at once. Are there are best practices for handling these large files?
So they're individual files which you need to process individually? What will each of those "processes" return? Another individual file/dataframe? [Partitioned datasets](https://docs.kedro.org/en/stable/data/kedro_io.html#partitioned-dataset) might be useful for you
šŸ‘ 2
@IƱigo Hidalgo Yeah, they're individual files that need to be processed individually. Each process would return a
object and then I would save it to a
format. Since the output is a weighted graph, I could change to a different format including a dataframe representing the weighted edge list. I will look at the partitioned data set.
Without being familiar with specifics of networkx, it seems like a good use for the partitioned dataset
šŸ‘ 1
šŸ‘ 1