In my project I would like to have a pipeline usin...
# questions
m
In my project I would like to have a pipeline using a pre-trained model that is on MLflow. The problem I am facing is that I don't know how to convey the transformation information on the data in the preprocessing. For example, for normalisation we need a minimum and a maximum value to be able to transform the test set in the same way. Is there a solution to this issue? Do you recommend using sklearn Pipelines ?
K 3
l
In my use case sklearn Pipelines were enough. I export the pipeline as a compressed joblib file/pickle along with the info I might need regarding the model. Maybe not the best way but it gets the job done in our case. Maybe you can get better feedback from more experienced users! 🙂
👍 1
m
You should store all artifacts crucial to the inference (like the min/max values you have) in MLflow artifacts, yo you can retrieve them later on
👍 2
m
Thanks for your answers 🙂
p
What if you want to combine sklearn pipelines and Kedro pipelines, is there any recommended way to do so?
d
There's no recommended way to do so, to the best of my knowledge; the default recommendation would be to use a sklearn pipeline within a node, but that's obviously not ideal. I did think quite a bit about implementing a subclass of Kedro pipeline that you can construct given a sklearn pipeline, and the execution would get delegated to sklearn for that piece (but visible in Kedro-Viz as separate steps, etc.). Never did get around to trying it. However, I heard somebody else did something similar and ran into the issue that sklearn pipelines aren't guaranteed to be DAGs. I think you could still restrict to DAG-like pipelines potentially? Or maybe Kedro should support cycles. :)
p
@Deepyaman Datta that sounds awesome! Well my guess is 90% of sklearn pipelines are DAGs anyway 🙂 Even with the initial version limited to DAG-like pipelines, this would still be an awesome integration.
d
@Ben Horsburgh in case you have any thoughts, since IIRC you were the one who tried something like this in the past/brought up the DAG thing
b
I think SKLearn pipelines and Kedro pipelines are very different things, and shouldn't really be equated with each other. I may want an sklearn pipeline that does:
Copy code
Pipeline:
   Remove outliers
   Impute
   Normalize
Which could be the implementation of a single kedro pipeline node. By defining these as an SKLearn pipeline I get access to the SKLearn ecosystem and can do things like hyperparameter optimization, which by definition is a non-DAG process cycling over the pipeline many times. If I were to define a kedro pipeline with the above steps as different nodes, that is also ok. In this instance though I cannot co-tune the logical SKLearn pipeline steps. Instead from an SKLearn perspective it would look like:
Copy code
Pipeline:
   Remove outliers
Pipeline:
   Impute
Pipeline:
   Normalize
What are the pros and cons of each? • Single SKLearn pipeline ◦ + hypterparameter tuning over entire pipeline ◦ + export best pipeline model ◦ - complex parameterization ◦ - complex search space • Multiple SKLearn pipeline ◦ + simple to parameterize ◦ + simple search space ◦ + export best transformer ◦ - No holistic tuning Which to chose depends very much on the problem you are trying to solve.
For the original question - as mentioned by Lorenzo you can simply output then later input the artifact. Pickle works well for a simple approach. Just make sure that the transformer you are using will deal with previously unseen / out-of-bounds data in a graceful way! For example, if in your test set you see a value greater than the max in your training, do you clip or scale it?
p
Thanks Ben! We're currently testing all three ways: 1. Kedro only + artifacts (via pickle) 2. Kedro + sklearn, each sklearn step is wrapped in a Kedro step 3. Kedro + sklearn, the whole pipeline in a single node So far, the last option seems best for most cases. I really like the pros and cons you listed, however I believe that all listed benefits of the multiple sklearn pipeline option is still easily achievable in the single sklearn pipeline option. On the other hand, I would add lack of ability to visualize pipeline in kedro-viz as a huge drawback of the last solution. It would look like this: 1. Kedro + artifacts ◦ + No dependency on sklearn ◦ + Visible in kedro-viz ◦ --- Need to reproduce all sklearn features, depending on the case (optimizations, handling out-of-scope values, pickling, etc.) 2. Multiple Sklearn pipeline ◦ + Visible in kedro-viz ◦ --- No holistic tuning 3. Single Sklearn pipeline ◦ + Most intuitive for sklearn users ◦ + Easy to migrate old projects ◦ + Single pickle for all sklearn steps ◦ +++ Leverages all sklearn features (holistic tuning, transformers, etc.) ◦ - Can't mix sklearn steps with Kedro steps (but would you even need that?) ◦ -- Can't see sklearn steps in kedro-viz
y
Hi @Michał Stachowicz, sorry for being late to the party but kedro-mlflow hanlde this use case very well. In fact, this is the reason why the plugin was originally built , before experiment tracking. You can look at the ``kedro mlflow modelify`` command which convert automatically a kedro pipeline as a mlflow model (and handles automatically the pickling of required artifacts). If you want this to be done automatically each time you run your training pipeline, you can look at the ``pipeline_ml_factory`` function. The main advantage of going this is that you can benefit automatically of all mlflow serving capabilities (as a batch, as an API, interactively, as a kedro dataset...). There is also a step by step tutorial which explains exactly what you try to achieve. If you need more reference, you can read this thread.
👍 1
🥳 1
s
I'm new with Kedro and I was looking for this explanation of how to package the preprocessing steps and modeling in one artifact as sklearn pipelines do. Thanks @Yolan Honoré-Rougé you saved my life, I'll read more about kedro-mlflow to do this.
👍 1