DEPENDENCIES ISOLATION Hi everyone, Let’s assum...
# questions
m
DEPENDENCIES ISOLATION Hi everyone, Let’s assume that my
data_processing_node
and my
model_training_node
have conflicting dependencies. How would you handle such a (unfortunately common) situation ? I know that in MLFlow it is possible to have task-specific-venv… Does kedro offer such a possibility ? If not, how would could one circumvent the issue ? 🙂 Many thanks in advance, M.
i
How are you running your kedro pipelines? Kedro doesn't exactly offer this functionality, but it is "environment agnostic", you could set up one venv to run your data_processing pipeline and another venv to run your training pipeline.
n
Kedro doesn't offer this functionality out of the box. What libraries are having conflicting dependencies? For development you may do as @Iñigo Hidalgo said. If you are using it for development, most orchestrator offer this function and you can use it with Kedro https://docs.kedro.org/en/stable/deployment/index.html
m
Thx @Iñigo Hidalgo & @Nok Lam Chan ! So far, I’m still very much at the bottom of the “kedro learning curve”, slowly experimenting and re-writing one of our pipeline in / with kedro… I’m running things localy, on my machine, via kedro’s cli commands. when Iñigo says: _“you could set up one venv to run your data_processing pipeline and another venv to run your training pipeline.”_ Does it mean that this can be done in a single project, with modular pipelines and
kedro run
? If so how ? Or do you suggest having separate (i.e non-modular) projects / repos ? Regarding deployment: I know that, for example, Airflow offers PythonVirtualEnvOperator etc… But if I read the packaging section of the doc correctly, once a kedro pipeline has been packaged, it must be installed in the same venv as airflow, right ? Therefore, if we developed separate kedro pipelines with conflicting dependencies, we will run into trouble if we try to install their respective package in airflow’s venv, isn’t it ? Many thanks in advance for your help / suggestions M. P.S: @ Nok => regarding our conflicting dependencies, as an example, some of the libraries we work with require older versions of numpy while some other libraries require more recent versions of numpy etc…
i
I reread your original message and saw you referred to each step as _node, not _pipeline. Are they different nodes in the same pipeline? If that's the case I can't think of a way you could easily do that. If they have totally different dependencies I wouldn't consider them as parts of the same pipeline tbh. If on the other hand they're separate pipelines, you could do
Copy code
python -m venv .venv_data_processing
python -m venv .venv_model
.venv_data_processing/bin/python -m pip install . # (and specific data requirements)
.venv_model/bin/python -m pip install . # (and specific model requirements)

.venv_data_processing/bin/python -m kedro run --pipeline data_processing
.venv_model/bin/python -m kedro run --pipeline model
But this doesn't seem very scalable tbh, and just seeing the snippet is making my eyes bleed 😅
m
Thx a lot Iñigo 🙂 (and sorry for your eyes ! 😜 ) You’re totally correct: I’m being inconsistent 😅 🤦🏻‍♂️ I actually meant pipelines (“sub-pipelines” actually within a single project / repo) Thx for your suggestion above, it shall definitely “do the trick” for local development 👍
@Iñigo Hidalgo & @Nok Lam Chan If it’s not too much asking: How would you handle this in deployment ? I’m looking into kedro-airflow, and read the following: Step 3: Package and install the Kedro pipeline in the Airflow executor’s environment After generating and deploying the DAG file, you will then need to package and install the Kedro pipeline into the Airflow executor’s environment. These last words leave me worried: Even though Airflow does offer PythonVirtualEnvOperator, ExternalOperator etc… those operators won’t be of any help if I have 2 packaged kedro pipelines with conflicting dependencies: Those won’t “peacefully co-exist” in Airflow executor’s env… Is this correct ?
i
Sorry, I'm not familiar with Airflow-Kedro deployment patterns so I can't be of specific help. But it isn't necessarily a problem to have the pipeline code itself installed together, that's what I suggested in the snippet above, you'd just need to be careful to install the correct dependencies into each environment. But honestly, I would try to resolve these inconsistencies if this is a long-term project, it will only complicate things down the line as things grow. If these dependencies aren't resolvable, then I would think of splitting data processing and modeling into different projects which can be deployed independently. Data processing will probably need to run before train, so you could use airflow to trigger first the data processing pipeline in its own project and then the modeling in its own other project. The specific how I cannot be of help unfortunately. Maybe somebody else in the channel has an idea.
👍🏼 1
m
Then those your comments / suggestion Iñigo. Have a nice week-end, M.