Hello Everyone I have a question about memory management whi Kedro #questions

Hello Everyone, I have a question about memory man...

Jonathan Dekermanjian

05/15/2025, 7:36 PM

Hello Everyone, I have a question about memory management while using Kedro. I have a kedro project that consists of 2 pipelines (data_processing_pipeline & ML_pipeline). My data processing is done using Spark that gets initialized with Kedro hooks. At the end of my data_processing pipeline the results are written to a SparkDataset to disk. Now, my issue is when I execute a kedro run and kedro is now done with the data_processing pipeline and is executing the ML pipeline the Spark session is still holding on to the memory it utilized during the processing. I know this because 20 minutes into the ML portion I can kill the Spark worker with the Spark UI and this releases a significant amount of memory. My question is this How do I tell kedro to release objects that are no longer needed (the dataset is not used beyond the data_processing step) from memory?

Matthias Roels

05/15/2025, 8:11 PM

I think the most clean way to do this is to split the project into 2 kedro projects; a Spark and non-Spark one. Alternatively, you can use an after node run hook to stop the spark session when the last node requiring spark is completed

thankyou 2

Jonathan Dekermanjian

05/15/2025, 8:30 PM

@Matthias Roels thank you for your response. I didn’t think about a hook to stop the spark session. I think I will give that a try!

Nok Lam Chan

05/19/2025, 3:33 AM

> Now, my issue is when I execute a kedro run and kedro is now done with the data_processing pipeline and is executing the ML pipeline the Spark session is still holding on to the memory it utilized during the processing. I know this because 20 minutes into the ML portion I can kill the Spark worker with the Spark UI and this releases a significant amount of memory. Does Spark session holds memory as long as the session is still alive? > My question is this How do I tell kedro to release objects that are no longer needed (the dataset is not used beyond the data_processing step) from memory? Kedro node normally does not hold unnecessary object. As long as there are no reference this is left to Python garbage collection to clean up the reference.

26 Views

Open in Slack

Previous Next