Hi everyone I have a question about the ParallelRunner I hav Kedro #questions

Hi everyone, I have a question about the Parallel...

Jonathan Dekermanjian

01/09/2025, 3:39 PM

Hi everyone, I have a question about the ParallelRunner. I have a modular pipeline setup with lots of pipelines that I want to run in parallel. At the end of each one of the modular pipelines the computations are saved to a PostgreSQL database. Now what I am noticing is that even though some pipelines are completing, the memory utilization is never going down. A little more info on my setup, I am running 32 worker on a 72 core machine and I have thousands of modular pipelines that I want to run in parallel. My question is this: Does the ParallelRunner hold on to the python objects until ALL of the modular pipelines are complete?

Hall

01/09/2025, 3:39 PM

Someone will reply to you shortly. In the meantime, this might help:

datajoely

01/09/2025, 3:40 PM

You can see the implementation here: https://docs.kedro.org/en/stable/_modules/kedro/runner/parallel_runner.html#ParallelRunner It's also possible to subclass this an use your own runner

datajoely

01/09/2025, 3:41 PM

can you try the

--async

flag?

datajoely

01/09/2025, 3:41 PM

how are you connecting to Postgres? Is it pandas or Ibis?

Jonathan Dekermanjian

01/09/2025, 3:48 PM

Thank you @datajoely I will try the async flag and see if that does anything. I am using a custom dataset that subclasses the SQLTableDataset and I just overload the save method to use polars instead of pandas.

datajoely

01/09/2025, 4:04 PM

oh interesting

datajoely

01/09/2025, 4:05 PM

I think if you use Ibis you can use thread-runner which will definitely be faster

Nok Lam Chan

01/09/2025, 4:06 PM

just thinking quickly, I don't think the ParallelRunner tries to hold any data. In a

kedro run

, once you start the pipeline, what Kedro sees are a bunch of nodes (no more concept of pipeline), and the execution order are determined by solving the dependencies of nodes. If the data is a persisted type,(i.e. CSVDataset), the memory of the data is released immediately after the node. If it's a memory type, it will be released once the data has no more downstream dependencies.

Jonathan Dekermanjian

01/09/2025, 4:11 PM

Oh! @datajoely I will try that. I didn’t know that thread runner is faster than ParallelRunner. @Nok Lam Chan interesting, but then I don’t know how to explain why the memory utilization never goes down? I am trying to run it now with async to see if that makes a difference. I would be happy to share my implementation unfortunately I work in the healthcare field so I can’t really share anything related to patient information.

datajoely

01/09/2025, 4:15 PM

sorry it's only faster in certain conditions (because of python's limitation with concurrency more generally)

Jonathan Dekermanjian

01/09/2025, 4:15 PM

Ah I see, well I will give it a try and see if it speeds up my use case. Thank you!

datajoely

01/09/2025, 4:15 PM

Because Spark and Ibis delegate their processing to other execution engines you can use threads, in Pandas and Polars case you need to work with ParallelRunner on your machine

Jonathan Dekermanjian

01/09/2025, 4:18 PM

I see, currently I am running on a single node server. Where eventually I think I will need to deploy this on a kubernetes cluster. Maybe then the switch to Spark and Ibis will give me a more meaningful boost in performance.

Jonathan Dekermanjian

01/09/2025, 4:19 PM

It would be really cool if kedro had a kedro-kubernetes plugin!

datajoely

01/09/2025, 4:20 PM

I think our airlfow plugin is the best way of achieving that today

datajoely

01/09/2025, 4:20 PM

but also using Ibis makes postgres your workhorse and delegates execution directly to the SQL engine which handles scaling etc there

Jonathan Dekermanjian

01/09/2025, 4:22 PM

Yes, I think so too. I have played around with some plugins and find that to be the most reliable. However, it seems tied to a platform I forget if it was called Astro or something. We are a Microsoft shop so we use Azure and all of our resources and Data Lake are on Azure. I need to find out if the plugin can play nicely with our current infrastructure.

Jonathan Dekermanjian

01/09/2025, 4:24 PM

Oh that would actually be quite a performance boost. I will go ahead and replace the polars SQL implementation to use Ibis then. Thank you for all your help! I am still running with async and will report back if there are any improvements or not.

datajoely

01/09/2025, 4:32 PM

💪

Nok Lam Chan

01/09/2025, 4:52 PM

@Nok Lam Chan interesting, but then I don’t know how to explain why the memory utilization never goes down?

does it comes down when you are using SequentialRunner (the default)? Is this problem only appears with Parallerunner?

Jonathan Dekermanjian

01/09/2025, 8:34 PM

Hey @Nok Lam Chan, thank you for that tip. It looks like even using the sequential runner the memory was steadily increasing. I narrowed down the issue to a potential memory leak related to JAX. Or they might be caching things to reduce compilation time. I need to dig into it more. But your advice certainly helped me find that issue. @datajoely the async flag didn’t change anything but again if this is a memory leak then it isn’t expected to change anything.

👍🏼 1

👍 1

Nok Lam Chan

01/09/2025, 9:15 PM

Good progress on tracing down the source!

4 Views

Open in Slack

Previous Next