I am trying to understand how users do model prepr...
# user-research
d
I am trying to understand how users do model preprocessing using Kedro. • The usual process is to create some train/test (or train/test/val) split. Then, perform some preprocessing steps on train set, and train a model (maybe using cross-validation). In addition to model parameters, any learned preprocessing parameters must be recorded for use during inference. Does this sound right? • Are preprocessing steps modeled as individual Kedro nodes (and separate from model training)? Why or why not? • In some cases, I've seen a separate abstraction for preprocessing (and training e.g. XGBoost model), as a reusable Kedro pipeline. What does the structure of this look like? • Do people use scikit-learn pipelines in Kedro? If so, how? Any insight would be much appreciated!
K 2
c
Hey! I'll answer from my case: 1. My current process is similar to the described, first I do some data processing and feature engineering, and lastly modelling. I am using separate pipelines for these 3 processes. The last process manages train/test/val splitting, model creation, model validation for threshold extraction (I mainly do unsupervised anomaly detection), model explainability and experiment tracking. 2. For preprocessing and feature eng, I normally aggregate all steps in a node and use it in the pipeline, although I sometimes think about splitting it in several nodes for better code reusability. 3. I am currently working on separating the model logic (writting wrappers for my models and putting them in a src/project_name/models folder)from the model train/test/val pipeline, so that pipeline is reusable for different models with a common interface. 4. I am planning to use sklearn pipelines in Kedro very soon, because for neural nets I use a preprocessing sklearn pipeline (for scaling/normalization), that I need to register to MLflow to apply the same pipeling that I have fitted to training data to process new data during inference. Any insights on this would be appreciated, since I am fairly new to Kedro!
💯 2
d
> 2. For preprocessing and feature eng, I normally aggregate all steps in a node and use it in the pipeline, although I sometimes think about splitting it in several nodes for better code reusability. Would it be easy to have it as several nodes, or would that result in other issues/inefficiencies? On larger projects, I have traditionally had each feature (or group of feature variants, like
avg_transaction_amount_1_day
,
avg_transaction_amount_7_days
,
avg_transaction_amount_30_days
) be one node, but I think some other teams I'm aware of prefer to do more feature engineering in a node. (But I also didn't use sklearn pipelines for this a long time back, when doing so. Something where another framework is handling a set of transformations that need to go together could be a reason to have it in a single node.)
c
I think i would be easy, but with respect to the efficiency, I am not sure, I would have to try it out! Also one thing that I'm thinking is that tutorials like the spaceship one kind of encourage newcomers to aggregate multiple processing steps into a single step (or at least that was the impression that I had when I saw the preprocess_companies and preprocess_shuttles examples). Not sure if what I'm saying makes sense 😅
👀 1
s
what I do is build some Kedro nodes for the pre-processing steps that don't need to infer anything from the data (e.g. calculate the age as today) and are very basic. For other feature engineering steps like normalization, standardization, feature selection, I wrap them in a sklearn pipeline using libraries like sklearn, imblearn, feature-engine, etc. The advantage is that you avoid data leakage in the cross validation phase, can optimize all the hyperparameters of your training pipeline if you want, and the output will be one .pkl or .joblib file with your trained pipeline that stores all the hyperparameters. Anyways, the line between using a pure sklearn pipeline or a pure Kedro pipeline is not clear
👀 1