Hi all, I'm trying to optimize the memory footprin...
# questions
d
Hi all, I'm trying to optimize the memory footprint of a Kedro pipeline. Is there a way to drop MemoryDatasets after the execution of some node?
👍 1
j
hi @Diego Lira! in principle such
MemoryDataset
s are discarded after nodes are done with them, are you observing evidence of the contrary? also, are you running the nodes sequentially or using some of the parallel runners?
👍 1
d
(Just to add, Kedro does not optimize for any of this in determining what order to run nodes in.)
d
@Juan Luis So after the last node using a memory dataset runs, they're automatically released instead of lingering until the end of the pipeline? Interesting...I'll see if shifting a few things solves my problem
👍 1
d
@Diego Lira Yes. If you're curious, it basically maintains a count of how many times a datasets should be loaded, and decrements it each time it actually is, so it knows when it won't be loaded again: https://github.com/kedro-org/kedro/blob/cb51a8a725dc415fd9f397726011fdf0a5175a9c/kedro/runner/sequential_runner.py#L76-L83 If it's a "free output", it will never be discarded/always be kept in memory (since it's returned at the end).
👍 3
👍🏼 1
n
Second Deepyamen, it uses a simple form of reference counting. CacheDataset may helps in some case because loading data especially for pandas will bump up memory for short time if you run into OOM issue. https://noklam.github.io/blog/posts/2021-07-02-kedro-datacatalog.html
👍 1