In my project I would like to have a pipeline using a pre tr Kedro #questions

In my project I would like to have a pipeline usin...

Michał Stachowicz

10/27/2022, 9:56 AM

In my project I would like to have a pipeline using a pre-trained model that is on MLflow. The problem I am facing is that I don't know how to convey the transformation information on the data in the preprocessing. For example, for normalisation we need a minimum and a maximum value to be able to transform the test set in the same way. Is there a solution to this issue? Do you recommend using sklearn Pipelines ?

K 3

Lorenzo Castellino

10/27/2022, 9:59 AM

In my use case sklearn Pipelines were enough. I export the pipeline as a compressed joblib file/pickle along with the info I might need regarding the model. Maybe not the best way but it gets the job done in our case. Maybe you can get better feedback from more experienced users! 🙂

👍 1

marrrcin

10/27/2022, 10:07 AM

You should store all artifacts crucial to the inference (like the min/max values you have) in MLflow artifacts, yo you can retrieve them later on

👍 2

Michał Stachowicz

10/27/2022, 10:26 AM

Thanks for your answers 🙂

Paweł Lis

10/27/2022, 10:33 AM

What if you want to combine sklearn pipelines and Kedro pipelines, is there any recommended way to do so?

Deepyaman Datta

10/27/2022, 11:10 AM

There's no recommended way to do so, to the best of my knowledge; the default recommendation would be to use a sklearn pipeline within a node, but that's obviously not ideal. I did think quite a bit about implementing a subclass of Kedro pipeline that you can construct given a sklearn pipeline, and the execution would get delegated to sklearn for that piece (but visible in Kedro-Viz as separate steps, etc.). Never did get around to trying it. However, I heard somebody else did something similar and ran into the issue that sklearn pipelines aren't guaranteed to be DAGs. I think you could still restrict to DAG-like pipelines potentially? Or maybe Kedro should support cycles. :)

Paweł Lis

10/27/2022, 4:08 PM

@Deepyaman Datta that sounds awesome! Well my guess is 90% of sklearn pipelines are DAGs anyway 🙂 Even with the initial version limited to DAG-like pipelines, this would still be an awesome integration.

Deepyaman Datta

10/27/2022, 4:22 PM

@Ben Horsburgh in case you have any thoughts, since IIRC you were the one who tried something like this in the past/brought up the DAG thing

Ben Horsburgh

11/02/2022, 9:39 AM

I think SKLearn pipelines and Kedro pipelines are very different things, and shouldn't really be equated with each other. I may want an sklearn pipeline that does:

Copy code

Pipeline:
   Remove outliers
   Impute
   Normalize

Which could be the implementation of a single kedro pipeline node. By defining these as an SKLearn pipeline I get access to the SKLearn ecosystem and can do things like hyperparameter optimization, which by definition is a non-DAG process cycling over the pipeline many times. If I were to define a kedro pipeline with the above steps as different nodes, that is also ok. In this instance though I cannot co-tune the logical SKLearn pipeline steps. Instead from an SKLearn perspective it would look like:

Copy code

Pipeline:
   Remove outliers
Pipeline:
   Impute
Pipeline:
   Normalize

What are the pros and cons of each? • Single SKLearn pipeline ◦ + hypterparameter tuning over entire pipeline ◦ + export best pipeline model ◦ - complex parameterization ◦ - complex search space • Multiple SKLearn pipeline ◦ + simple to parameterize ◦ + simple search space ◦ + export best transformer ◦ - No holistic tuning Which to chose depends very much on the problem you are trying to solve.

Ben Horsburgh

11/02/2022, 9:41 AM

For the original question - as mentioned by Lorenzo you can simply output then later input the artifact. Pickle works well for a simple approach. Just make sure that the transformer you are using will deal with previously unseen / out-of-bounds data in a graceful way! For example, if in your test set you see a value greater than the max in your training, do you clip or scale it?

Paweł Lis

11/02/2022, 8:18 PM

Thanks Ben! We're currently testing all three ways: 1. Kedro only + artifacts (via pickle) 2. Kedro + sklearn, each sklearn step is wrapped in a Kedro step 3. Kedro + sklearn, the whole pipeline in a single node So far, the last option seems best for most cases. I really like the pros and cons you listed, however I believe that all listed benefits of the multiple sklearn pipeline option is still easily achievable in the single sklearn pipeline option. On the other hand, I would add lack of ability to visualize pipeline in kedro-viz as a huge drawback of the last solution. It would look like this: 1. Kedro + artifacts ◦ + No dependency on sklearn ◦ + Visible in kedro-viz ◦ --- Need to reproduce all sklearn features, depending on the case (optimizations, handling out-of-scope values, pickling, etc.) 2. Multiple Sklearn pipeline ◦ + Visible in kedro-viz ◦ --- No holistic tuning 3. Single Sklearn pipeline ◦ + Most intuitive for sklearn users ◦ + Easy to migrate old projects ◦ + Single pickle for all sklearn steps ◦ +++ Leverages all sklearn features (holistic tuning, transformers, etc.) ◦ - Can't mix sklearn steps with Kedro steps (but would you even need that?) ◦ -- Can't see sklearn steps in kedro-viz

Yolan Honoré-Rougé

11/23/2022, 9:09 PM

Hi @Michał Stachowicz, sorry for being late to the party but kedro-mlflow hanlde this use case very well. In fact, this is the reason why the plugin was originally built , before experiment tracking. You can look at the ``kedro mlflow modelify`` command which convert automatically a kedro pipeline as a mlflow model (and handles automatically the pickling of required artifacts). If you want this to be done automatically each time you run your training pipeline, you can look at the ``pipeline_ml_factory`` function. The main advantage of going this is that you can benefit automatically of all mlflow serving capabilities (as a batch, as an API, interactively, as a kedro dataset...). There is also a step by step tutorial which explains exactly what you try to achieve. If you need more reference, you can read this thread.

👍 1

🥳 1

Sebastian Cardona Lozano

01/20/2023, 12:47 AM

I'm new with Kedro and I was looking for this explanation of how to package the preprocessing steps and modeling in one artifact as sklearn pipelines do. Thanks @Yolan Honoré-Rougé you saved my life, I'll read more about kedro-mlflow to do this.

👍 1

24 Views

Open in Slack

Previous Next