Hi everyone I hope you all had a nice week end slightly smil Kedro #questions

Hi everyone, I hope you all had a nice week-end :...

Marc Gris

07/10/2023, 4:21 PM

Hi everyone, I hope you all had a nice week-end 🙂 This is a subject I had already brought up, which I really think deserves “another shot”: dependencies isolation (nodes, pipelines, namespaces…) (I’m dropping this here in order not to “clutter” the feature requests on github “for nothing”. If you consider it to be a “reasonable” / “sensible” request, I’ll happily put it there) Let’s start with an obvious & bitter truth: Python is wonderful… But like the snake in the garden of eden, the fruit it offers comes at a high cost: 🔥 😈 Dependency Hell 😈 🔥 If it were only a deployment / production issue, I would “happily surrender” to the answer: Just use Airflow’s VirtualenvOperator / ExternalOperator… But, I hope that most of you will also agree that it also happens to be a nightmare in development also… Granted, a “workaround” is always possible, creating multiple venvs in the repo and then “manually switching” between interpreters in the shell with (thx @Iñigo Hidalgo for the tip):

.venv_model_*a*/bin/python -m kedro run --tags=model_*a*

then

.venv_model_*b*/bin/python -m kedro run --tags=model_*b*

etc.. But, this, IMHO, is really far from an optimal “dev-confort-centric” workflow… Hence my initial request / question: Would there be some mechanisms that could allow passing a path to a venv when creating a node / pipeline ? (I must confess that, in my naïveté, I though that this would be “quite easy” using a

before_node_run

callback… But, I quickly had to reckon that my skills were too meager for the task 😅 ) Many thanks in advance for taking the time to consider this suggestion / request. Regards Marc

Deepyaman Datta

07/10/2023, 4:38 PM

In my ideal world view, everybody (who uses Kedro) should define dependencies for modular pipelines inside each pipeline micropackage, run in an environment (dynamically) created from those dependencies (or, if so desired/there's no dependency hell, I suppose installing everything in a shared environment works...), and relish in the fact that their stuff works the same locally and on an orchestrator that creates containers based on those requirements. But, in reality, most users put all the requirements in a central

src/requirements.txt

, pin them, and struggle when they can't resolve the dependencies for a massive pipeline in a single environment. The impression I get is, 80-90% of DS users also don't want to define requirements at the modular pipeline level, or manage multiple environments. So far, discussions I've seen for micropackaging, etc. have talked about trying to extract the necessary requirements from the central requirements file, in order to avoid this. So, I think by wanting to run in isolated environments like this, you're technically doing the right thing, but it's likely not something Kedro itself caters to at this time (i.e. you're a more advanced user). So, what's the right way to do this? I guess a Kedro plugin or, as you've mentioned, orchestrator that can already handle spinning up environments locally, be it Airflow, Prefect, etc., is the right thing to do. I think this could also be informed by the "right"/standardized way to deploy Kedro to orchestrators, and making a plugin/runner to do more-or-less the same thing locally. But be curious to see what @Juan Luis @Nok Lam Chan @marrrcin thinks.

👍 1

Matthias Roels

07/10/2023, 6:17 PM

My 2 cents on this, I find that this is a limitation of Python itself tbh. To make an analogy with a mono-repo where e.g. you would have many micro services whose code base is all in one central repo, potentially with a couple of libraries shared across micro services. Ideally, you would have every micro service declare its own set of dependencies, but do it in such a way that there is consistency over all micro services. What I mean by that is that if micro service A and B both use package X, they both use the same version! As far as I am aware, there is no way to enforce this (pip, poetry,…). This would be similar for different kedro pipelines, each having its own set of dependencies. But as mentioned before, there is just a lack of tooling in the Python ecosystem to enable this. FYI: my idea is inspired by the concept of Workspaces in Rust https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html

👀 2

Iñigo Hidalgo

07/11/2023, 2:13 AM

My thinking is that if dependencies are so unresolvable that you need to have separate environments to live in, these pipelines really should be thought of as totally independent entities. Trying to manage them from within the same running environment is going to be a hassle regardless of the solution you find. I like the other 2 responses’ points of view regarding looking at pipelines as microservices/micropackages, but I don’t think that is going to get around your developer experience/comfort question. Kedro needs to run within whichever environment you’re running your pipeline in, so it’s not gonna be able to manage that for you.

👍 1

Juan Luis

07/11/2023, 8:47 AM

very brief, but (1) I agree there's a fundamental problem in how Python dependencies are managed - other ecosystems (thinking of JS) allow different versions of the same dep to coexist, Python doesn't. hence venvs (2) I think the desire to run different nodes with different dependencies is reasonable, there might be good use cases for it. in Python, that means "different venvs per node" (see (1)) (3) splitting pipelines into micropackages still doesn't solve the fundamental problem: I don't think Kedro knows how to orchestrate nodes or pipelines in different venvs. Kedro is designed around the fact that it can import the code that it's about to run. this is also the reason why Kedro is a Python-specific orchestrator and you can't, say, have pipelines in R. changing this would require Kedro to be installed outside of the target venv, and for it to be able to run pipelines without importing them. essentially, you'd need a Kedro "superproject" with nodes calling subprocesses. I don't see how hooks can be a solution for this problem, it's a fundamentally different architecture. so, three workarounds exist: (a) manually switching venvs, as per @Marc Gris first comment (b) having separate Kedro projects for different venv specifications, and make them share some data. for example, the output of the first project can be the input of the second project. but this doesn't solve @Marc Gris original request very nicely: you'd need to "split" the nodes in two separate Kedro projects. (c) creating a Kedro project that calls other Kedro projects as subprocesses. I haven't tried it, only saying that it's theoretically possible. but you'd also lose a lot of granularity. others can chime in with their opinions but I'd say trying to solve this problem is out of scope for Kedro

marrrcin

07/11/2023, 10:01 AM

I agree with @Iñigo Hidalgo:

if dependencies are so unresolvable that you need to have separate environments to live in, these pipelines really should be thought of as totally independent entities

and also agree with @Matthias Roels that it’s a similar case to running something from monorepo. Imho, Kedro is not a tool to handle (or workaround) the limitations of Python in this area. At this point of complexity, you should probably have an orchestrator on top, that will connect multiple Kedro pipelines (=separate projects with isolated requirements) on a “business logic” level. I would definitely go with containerization + orchestration with sth like Airflow / Argo or even Kubeflow.

👍 1

Marc Gris

07/11/2023, 9:20 PM

Hi everyone, Thanks for all those thoughtful responses. I’ll write back “properly” tomorrow 🙂

8 Views

Open in Slack

Previous Next