https://kedro.org/ logo
#questions
Title
# questions
d

Diego Lira

09/20/2023, 9:47 PM
Hi all, I'm trying to optimize the memory footprint of a Kedro pipeline. Is there a way to drop MemoryDatasets after the execution of some node?
👍 1
j

Juan Luis

09/20/2023, 10:04 PM
hi @Diego Lira! in principle such
MemoryDataset
s are discarded after nodes are done with them, are you observing evidence of the contrary? also, are you running the nodes sequentially or using some of the parallel runners?
👍 1
d

Deepyaman Datta

09/20/2023, 10:25 PM
(Just to add, Kedro does not optimize for any of this in determining what order to run nodes in.)
d

Diego Lira

09/20/2023, 11:14 PM
@Juan Luis So after the last node using a memory dataset runs, they're automatically released instead of lingering until the end of the pipeline? Interesting...I'll see if shifting a few things solves my problem
👍 1
d

Deepyaman Datta

09/21/2023, 4:44 AM
@Diego Lira Yes. If you're curious, it basically maintains a count of how many times a datasets should be loaded, and decrements it each time it actually is, so it knows when it won't be loaded again: https://github.com/kedro-org/kedro/blob/cb51a8a725dc415fd9f397726011fdf0a5175a9c/kedro/runner/sequential_runner.py#L76-L83 If it's a "free output", it will never be discarded/always be kept in memory (since it's returned at the end).
👍 3
👍🏼 1
n

Nok Lam Chan

09/21/2023, 8:32 AM
Second Deepyamen, it uses a simple form of reference counting. CacheDataset may helps in some case because loading data especially for pandas will bump up memory for short time if you run into OOM issue. https://noklam.github.io/blog/posts/2021-07-02-kedro-datacatalog.html
👍 1