In my project I would like to have a pipeline using a pre-trained model that is on MLflow. The problem I am facing is that I don't know how to convey the transformation information on the data in the preprocessing. For example, for normalisation we need a minimum and a maximum value to be able to transform the test set in the same way. Is there a solution to this issue? Do you recommend using sklearn Pipelines ?
In my use case sklearn Pipelines were enough. I export the pipeline as a compressed joblib file/pickle along with the info I might need regarding the model. Maybe not the best way but it gets the job done in our case. Maybe you can get better feedback from more experienced users! 🙂
You should store all artifacts crucial to the inference (like the min/max values you have) in MLflow artifacts, yo you can retrieve them later on
Thanks for your answers 🙂
What if you want to combine sklearn pipelines and Kedro pipelines, is there any recommended way to do so?
There's no recommended way to do so, to the best of my knowledge; the default recommendation would be to use a sklearn pipeline within a node, but that's obviously not ideal. I did think quite a bit about implementing a subclass of Kedro pipeline that you can construct given a sklearn pipeline, and the execution would get delegated to sklearn for that piece (but visible in Kedro-Viz as separate steps, etc.). Never did get around to trying it. However, I heard somebody else did something similar and ran into the issue that sklearn pipelines aren't guaranteed to be DAGs. I think you could still restrict to DAG-like pipelines potentially? Or maybe Kedro should support cycles. :)
@Deepyaman Datta that sounds awesome! Well my guess is 90% of sklearn pipelines are DAGs anyway 🙂 Even with the initial version limited to DAG-like pipelines, this would still be an awesome integration.
@Ben Horsburgh in case you have any thoughts, since IIRC you were the one who tried something like this in the past/brought up the DAG thing
I think SKLearn pipelines and Kedro pipelines are very different things, and shouldn't really be equated with each other. I may want an sklearn pipeline that does:
   Remove outliers
Which could be the implementation of a single kedro pipeline node. By defining these as an SKLearn pipeline I get access to the SKLearn ecosystem and can do things like hyperparameter optimization, which by definition is a non-DAG process cycling over the pipeline many times. If I were to define a kedro pipeline with the above steps as different nodes, that is also ok. In this instance though I cannot co-tune the logical SKLearn pipeline steps. Instead from an SKLearn perspective it would look like:
   Remove outliers
What are the pros and cons of each? • Single SKLearn pipeline ◦ + hypterparameter tuning over entire pipeline ◦ + export best pipeline model ◦ - complex parameterization ◦ - complex search space • Multiple SKLearn pipeline ◦ + simple to parameterize ◦ + simple search space ◦ + export best transformer ◦ - No holistic tuning Which to chose depends very much on the problem you are trying to solve.
For the original question - as mentioned by Lorenzo you can simply output then later input the artifact. Pickle works well for a simple approach. Just make sure that the transformer you are using will deal with previously unseen / out-of-bounds data in a graceful way! For example, if in your test set you see a value greater than the max in your training, do you clip or scale it?
Thanks Ben! We're currently testing all three ways: 1. Kedro only + artifacts (via pickle) 2. Kedro + sklearn, each sklearn step is wrapped in a Kedro step 3. Kedro + sklearn, the whole pipeline in a single node So far, the last option seems best for most cases. I really like the pros and cons you listed, however I believe that all listed benefits of the multiple sklearn pipeline option is still easily achievable in the single sklearn pipeline option. On the other hand, I would add lack of ability to visualize pipeline in kedro-viz as a huge drawback of the last solution. It would look like this: 1. Kedro + artifacts ◦ + No dependency on sklearn ◦ + Visible in kedro-viz ◦ --- Need to reproduce all sklearn features, depending on the case (optimizations, handling out-of-scope values, pickling, etc.) 2. Multiple Sklearn pipeline ◦ + Visible in kedro-viz ◦ --- No holistic tuning 3. Single Sklearn pipeline ◦ + Most intuitive for sklearn users ◦ + Easy to migrate old projects ◦ + Single pickle for all sklearn steps ◦ +++ Leverages all sklearn features (holistic tuning, transformers, etc.) ◦ - Can't mix sklearn steps with Kedro steps (but would you even need that?) ◦ -- Can't see sklearn steps in kedro-viz
Hi @Michał Stachowicz, sorry for being late to the party but kedro-mlflow hanlde this use case very well. In fact, this is the reason why the plugin was originally built , before experiment tracking. You can look at the ``kedro mlflow modelify`` command which convert automatically a kedro pipeline as a mlflow model (and handles automatically the pickling of required artifacts). If you want this to be done automatically each time you run your training pipeline, you can look at the ``pipeline_ml_factory`` function. The main advantage of going this is that you can benefit automatically of all mlflow serving capabilities (as a batch, as an API, interactively, as a kedro dataset...). There is also a step by step tutorial which explains exactly what you try to achieve. If you need more reference, you can read this thread.
I'm new with Kedro and I was looking for this explanation of how to package the preprocessing steps and modeling in one artifact as sklearn pipelines do. Thanks @Yolan Honoré-Rougé you saved my life, I'll read more about kedro-mlflow to do this.
