With MLflow, you have to create a custom `PythonMo...
# questions
m
With MLflow, you have to create a custom
PythonModel
in case you want to store a model combined with its preprocessing steps (which you always have to do imo). How can you do that with kedro (or kedro-mlflow)? The problem is that you probably fitted preprocessors in earlier nodes and persisted the result. As far as I can tell from the docs, MLflow requires its artifacts in a custom models to be persisted on disk (which you can do with the catalog) but these path strings are not readily available in the kedro nodes to be passed to the constructor of pyfunc
 Any tips, ideas welcome 😀
đŸ€© 1
y
Oh my favorite question. It pops up from time to time but it's very difficult to search for.
Fun fact: kedro-mlflow was built exactly to solve this problem, before experiment tracking
So you have a
KedroPipelineModel
class in kedro mlflow which enable to create a custom model from any kedro pipeline
❀ 1
But the recommended way is to use the pipeline_ml_factory function to create a Pipeline ml object. It behaves like a standars kedro pipeline, but kedro-mlflow hook will recognize it and automatically log the entire pipeline as a custom model at the end of training
You have a very detailed tutorial here : https://github.com/Galileo-Galilei/kedro-mlflow-tutorial
(just read the readme, it should be quite self explanatory - basically the only thing to do is to convert you training pipeline with pipeline_ml_factory in the pipeline_registry.py)
I'd be happy to get feedback on this
Hi @Matthias Roels did you have any chance to try this?
m
Yes and no, I did some experimental testing and did a deep dive in the code base. Overall the plugin is really great! One point of immediate improvement that I see; use
mlflow-skinny
instead of
mlflow
as a dependency (and potentially declare
mlflow
as an optional dependency)
y
Thanks for the feedback, glad to have more in depth thoughts if you try to experiment more. Unfortunately I've been asked a lot to expose mlflow-skinny instead of mlflow, but it breaks some functionalities (local UI, model registry) and I am a bit reluctant to remove them by default. I did not find a good way to have a op tin functionality because mlflow does not expose it as optional requirements but as a different package, which create namespace conflicts in python.
m
You are right, that’s a tricky point

Today I was playing around with it a bit more and stumbled upon an issue I couldn’t resolve. When you train an xgboost model, you ideally want to log it in
ubj
format as that format is guaranteed to be compatible across different xgboost versions (which is useful for later reuse). However, there is no kedro dataset to store the model in such a way