I'm still reviewing Kedro and how it might be used...
# questions
r
I'm still reviewing Kedro and how it might be used to architect my team's data science/MLOPS projects. I was wondering if there were some clear direction how kedro users typically lay out their projects for ease of deployment of the different phases of the MLOPS lifecycle. For example: Let's say I have three "base" pipelines: data preprocessing, model training, and model scoring. The training and scoring pipelines would both be dependent on data preprocessing, and with scoring running in some regular batch cadence wheras training only needing to run intermittently. You may also have a related pipeline for model monitoring/reporting. I imagine this to be a very common use case. How are you architecting your kedro project in order to most neatly allow for the deployment of multiple related pipelines in such a use case? Or is the intention that these would actually still be held in separate projects? It kind of seems like the examples I've seen so far assume you're running every pipeline that you develop in one deployment run? Such as in the iris example provided, there are two pipelines, but it seems like they would both be run in sequence every time. BTW, thanks in advance! Love all the helpful activity in this channel!
y
Hey! This is a great question. I'll lean on the community to provide other perspectives but I can help with two: • I'm happy to be corrected but all of those pipelines should be in the same project. You can trigger pipeline runs using the CLI or Python and only run sections of your project i.e. if I was using the CLI, I could do
kedro run --pipeline=data_preprocessing
and
kedro run --pipeline=model_training
. This is best seen in our CLI guide. We mention this in our longer Spaceflights tutorial. • And then you're pretty free to pick how you'd like these pipelines orchestrated and that's outside the scope of Kedro.
n
Pipelines are flexible object that you can ensemble easily. So you can structure your pipeline as preprocess, training, reporting etc. Using the pipeline argument you can select a specific one, but you can also have a main pipeline which is a sum of everything
Iris is the simple project which only has one project. If you look at the spaceflights project. https://github.com/kedro-org/kedro-starters/blob/main/spaceflights/{{%20cookiecutter.repo_name%20}}/src/{{%20cookiecutter.python_package%20}}/pipeline_registry.py The default pipeline is literally using the sum function to ensemble data engineering plus data science pipeline.
In addition, you can also use tags or just use from-nodes / to-nodes arguments. The CLI guide above provides the list of all options.
r
Thanks for the replies! It sounds like my intuition is correct that the pipelines are truly meant to be all in the same project. I've played with adding tags as well as fussing with the model_registry.py to get creative on what pipeline-chains can be created. If I could modify my question slightly: so when I've got these pipelines designated and I want to now deploy this on some sort of cloud environment: it looks like the various deployment paths I've read still all seem to go down the path of containerizing the whole project and at the end of the dockerfile specifying the pipeline you want to launch with a closing 'kedro run' command. So the question I'm wondering is there anywhere that shows how to neatly create these deployments of the related but separate pipelines?
y
If you forget the "mlflow" part (or not, after all), this tutorial tries to answer these exact questions and give some example on how I personally envision things. This is obviously not the only way to do it, but I think it's worth reading : https://github.com/Galileo-Galilei/kedro-mlflow-tutorial
m
So the question I'm wondering is there anywhere that shows how to neatly create these deployments of the related but separate pipelines?
I think the answer depends on whether you are ok with sharing dependencies across pipelines (TensorFlow, Spark, … are big packages that you might not want to include in a container image if you don’t need them). If you are OK with sharing (which I think will be the case for the majority of kedro users), you can create one container image (Dockerfile in the root of the project). And then use that image in your deployment with overriding the
CMD
with the appropriate kedro run cmd
r
Both of these comments give me something to think about. We should probably use something like mlflow as well. And I think that's right @Matthias Roels we probably don't mind if packages are shared in the different images. I think I can envision how that would work but do you happen to have any examples?
m
Well it’s actually one container image that you reuse for each pipeline. In my specific case, I use Argo Workflows for orchestration so I would use the same image in each task and set the appropriate command e.g
kedro run —pipeline=inference
. Another pattern I sometime use (usually non-prod) is to just use a container image without my project code and use a git artifact to clone the repo in a specific folder mounted to the container. From there, I can then run the image.
👍 2
r
Gotcha thanks for the insight!