Hi everyone, I have a question about the Parallel...
# questions
j
Hi everyone, I have a question about the ParallelRunner. I have a modular pipeline setup with lots of pipelines that I want to run in parallel. At the end of each one of the modular pipelines the computations are saved to a PostgreSQL database. Now what I am noticing is that even though some pipelines are completing, the memory utilization is never going down. A little more info on my setup, I am running 32 worker on a 72 core machine and I have thousands of modular pipelines that I want to run in parallel. My question is this: Does the ParallelRunner hold on to the python objects until ALL of the modular pipelines are complete?
h
Someone will reply to you shortly. In the meantime, this might help:
d
You can see the implementation here: https://docs.kedro.org/en/stable/_modules/kedro/runner/parallel_runner.html#ParallelRunner It's also possible to subclass this an use your own runner
can you try the
--async
flag?
how are you connecting to Postgres? Is it pandas or Ibis?
j
Thank you @datajoely I will try the async flag and see if that does anything. I am using a custom dataset that subclasses the SQLTableDataset and I just overload the save method to use polars instead of pandas.
d
oh interesting
I think if you use Ibis you can use thread-runner which will definitely be faster
n
just thinking quickly, I don't think the ParallelRunner tries to hold any data. In a
kedro run
, once you start the pipeline, what Kedro sees are a bunch of nodes (no more concept of pipeline), and the execution order are determined by solving the dependencies of nodes. If the data is a persisted type,(i.e. CSVDataset), the memory of the data is released immediately after the node. If it's a memory type, it will be released once the data has no more downstream dependencies.
j
Oh! @datajoely I will try that. I didn’t know that thread runner is faster than ParallelRunner. @Nok Lam Chan interesting, but then I don’t know how to explain why the memory utilization never goes down? I am trying to run it now with async to see if that makes a difference. I would be happy to share my implementation unfortunately I work in the healthcare field so I can’t really share anything related to patient information.
d
sorry it's only faster in certain conditions (because of python's limitation with concurrency more generally)
j
Ah I see, well I will give it a try and see if it speeds up my use case. Thank you!
d
Because Spark and Ibis delegate their processing to other execution engines you can use threads, in Pandas and Polars case you need to work with ParallelRunner on your machine
j
I see, currently I am running on a single node server. Where eventually I think I will need to deploy this on a kubernetes cluster. Maybe then the switch to Spark and Ibis will give me a more meaningful boost in performance.
It would be really cool if kedro had a kedro-kubernetes plugin!
d
I think our airlfow plugin is the best way of achieving that today
but also using Ibis makes postgres your workhorse and delegates execution directly to the SQL engine which handles scaling etc there
j
Yes, I think so too. I have played around with some plugins and find that to be the most reliable. However, it seems tied to a platform I forget if it was called Astro or something. We are a Microsoft shop so we use Azure and all of our resources and Data Lake are on Azure. I need to find out if the plugin can play nicely with our current infrastructure.
Oh that would actually be quite a performance boost. I will go ahead and replace the polars SQL implementation to use Ibis then. Thank you for all your help! I am still running with async and will report back if there are any improvements or not.
d
💪
n
@Nok Lam Chan interesting, but then I don’t know how to explain why the memory utilization never goes down?
does it comes down when you are using SequentialRunner (the default)? Is this problem only appears with Parallerunner?
j
Hey @Nok Lam Chan, thank you for that tip. It looks like even using the sequential runner the memory was steadily increasing. I narrowed down the issue to a potential memory leak related to JAX. Or they might be caching things to reduce compilation time. I need to dig into it more. But your advice certainly helped me find that issue. @datajoely the async flag didn’t change anything but again if this is a memory leak then it isn’t expected to change anything.
👍🏼 1
👍 1
n
Good progress on tracing down the source!