Hi everyone, I hope you all had a nice week-end :...
# questions
m
Hi everyone, I hope you all had a nice week-end 🙂 This is a subject I had already brought up, which I really think deserves “another shot”: dependencies isolation (nodes, pipelines, namespaces…) (I’m dropping this here in order not to “clutter” the feature requests on github “for nothing”. If you consider it to be a “reasonable” / “sensible” request, I’ll happily put it there) Let’s start with an obvious & bitter truth: Python is wonderful… But like the snake in the garden of eden, the fruit it offers comes at a high cost: 🔥 😈 Dependency Hell 😈 🔥 If it were only a deployment / production issue, I would “happily surrender” to the answer: Just use Airflow’s VirtualenvOperator / ExternalOperator… But, I hope that most of you will also agree that it also happens to be a nightmare in development also… Granted, a “workaround” is always possible, creating multiple venvs in the repo and then “manually switching” between interpreters in the shell with (thx @Iñigo Hidalgo for the tip):
.venv_model_*a*/bin/python -m kedro run --tags=model_*a*
then
.venv_model_*b*/bin/python -m kedro run --tags=model_*b*
etc.. But, this, IMHO, is really far from an optimal “dev-confort-centric” workflow… Hence my initial request / question: Would there be some mechanisms that could allow passing a path to a venv when creating a node / pipeline ? (I must confess that, in my naïveté, I though that this would be “quite easy” using a
before_node_run
callback… But, I quickly had to reckon that my skills were too meager for the task 😅 ) Many thanks in advance for taking the time to consider this suggestion / request. Regards Marc
d
In my ideal world view, everybody (who uses Kedro) should define dependencies for modular pipelines inside each pipeline micropackage, run in an environment (dynamically) created from those dependencies (or, if so desired/there's no dependency hell, I suppose installing everything in a shared environment works...), and relish in the fact that their stuff works the same locally and on an orchestrator that creates containers based on those requirements. But, in reality, most users put all the requirements in a central
src/requirements.txt
, pin them, and struggle when they can't resolve the dependencies for a massive pipeline in a single environment. The impression I get is, 80-90% of DS users also don't want to define requirements at the modular pipeline level, or manage multiple environments. So far, discussions I've seen for micropackaging, etc. have talked about trying to extract the necessary requirements from the central requirements file, in order to avoid this. So, I think by wanting to run in isolated environments like this, you're technically doing the right thing, but it's likely not something Kedro itself caters to at this time (i.e. you're a more advanced user). So, what's the right way to do this? I guess a Kedro plugin or, as you've mentioned, orchestrator that can already handle spinning up environments locally, be it Airflow, Prefect, etc., is the right thing to do. I think this could also be informed by the "right"/standardized way to deploy Kedro to orchestrators, and making a plugin/runner to do more-or-less the same thing locally. But be curious to see what @Juan Luis @Nok Lam Chan @marrrcin thinks.
👍 1
m
My 2 cents on this, I find that this is a limitation of Python itself tbh. To make an analogy with a mono-repo where e.g. you would have many micro services whose code base is all in one central repo, potentially with a couple of libraries shared across micro services. Ideally, you would have every micro service declare its own set of dependencies, but do it in such a way that there is consistency over all micro services. What I mean by that is that if micro service A and B both use package X, they both use the same version! As far as I am aware, there is no way to enforce this (pip, poetry,…). This would be similar for different kedro pipelines, each having its own set of dependencies. But as mentioned before, there is just a lack of tooling in the Python ecosystem to enable this. FYI: my idea is inspired by the concept of Workspaces in Rust https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html
👀 2
i
My thinking is that if dependencies are so unresolvable that you need to have separate environments to live in, these pipelines really should be thought of as totally independent entities. Trying to manage them from within the same running environment is going to be a hassle regardless of the solution you find. I like the other 2 responses’ points of view regarding looking at pipelines as microservices/micropackages, but I don’t think that is going to get around your developer experience/comfort question. Kedro needs to run within whichever environment you’re running your pipeline in, so it’s not gonna be able to manage that for you.
👍 1
m
I agree with @Iñigo Hidalgo:
if dependencies are so unresolvable that you need to have separate environments to live in, these pipelines really should be thought of as totally independent entities
and also agree with @Matthias Roels that it’s a similar case to running something from monorepo. Imho, Kedro is not a tool to handle (or workaround) the limitations of Python in this area. At this point of complexity, you should probably have an orchestrator on top, that will connect multiple Kedro pipelines (=separate projects with isolated requirements) on a “business logic” level. I would definitely go with containerization + orchestration with sth like Airflow / Argo or even Kubeflow.
👍 1
m
Hi everyone, Thanks for all those thoughtful responses. I’ll write back “properly” tomorrow 🙂