Hello I have a pipeline training that fits a model splink en Kedro #questions

Hello! I have a pipeline "training" that fits a m...

Francis Duval

12/20/2024, 3:21 PM

Hello! I have a pipeline "training" that fits a model (splink entity resolution model) and another one "inference" that takes the trained model and applies it to an inference dataset. In both pipelines, I want to use a node (or modular pipeline) "preprocess" for preprocessing the data before inputting it in the model (either for training or for inference). I obviously don't want to copy-paste the same "preprocess" function in both training/nodes.py and inference/nodes.py. I was wondering about the best practices around this. The following would probably work:

Copy code

# training/nodes.py

def preprocess(data):
  # Preprocessing logic here

Copy code

# training/pipeline.py
from kedro.pipeline import Pipeline, node, pipeline
from .nodes import preprocess


def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess,
                inputs='data_training',
                outputs='preprocessed_data_training',
                name='preprocess_training',
            )
        ]
    )

Copy code

# inference/pipeline.py
from kedro.pipeline import Pipeline, node, pipeline
from ..training.nodes import preprocess ## IMPORT FROM training/nodes.py


def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess,
                inputs='data_inference',
                outputs='preprocessed_data_inference',
                name='preprocess_inference',
            )
        ]
    )

However, I feel like this is not elegant and probably not optimal. Is there a better way of doing this? Maybe a "meta" nodes.py that can be used by all pipelines? Maybe rearranging the whole pipeline? Thanks!

Hall

12/20/2024, 3:21 PM

Someone will reply to you shortly. In the meantime, this might help:

Juan Luis

12/20/2024, 3:33 PM

a typical thing people do is

Copy code

from ...utils.preprocessing import preprocess

even if having a

utils

package/module is not very elegant, the point is still to not tie it to any pipeline. it could be

src/utils

, or

src/preprocessing

, or anything else that makes sense. I'd save

src/pipelines

for modular pipelines. does it make sense?

Francis Duval

12/20/2024, 3:50 PM

Thanks! Makes sense, yes, but I'm not sure to get what you say about the modular pipelines. Suppose I want to reuse not a function, but a whole "preprocessing" modular pipeline containing many nodes in both "training" and "inference" pipelines. If I understand, you would recommend to make another pipeline:

kedro pipeline create utils

in which I would define a "preprocessing" modular pipeline:

Copy code

# utils/pipeline.py
from kedro.pipeline import Pipeline, node, pipeline
from .nodes import preprocess_func_1, preprocess_func_2

def preprocess_template() -> Pipeline:
    return pipeline(
        pipe=[
            node(
                func=preprocess_func_1,
                inputs='raw_data',
                outputs='preprocessed_data_1',
                name='preprocess_1'
            ),
            node(
                func=preprocess_func_2,
                inputs='preprocessed_data_1',
                outputs='preprocessed_data_2',
                name='preprocess_2'
            )
        ]
    )

that I would then import and use in both training/pipeline.py and inference/pipeline.py?

Copy code

from ..utils/pipeline.py import preprocess_template

Juan Luis

12/20/2024, 3:52 PM

oh, what I said applies to reusing functions. about reusing pipelines, I'd definitely not call the pipeline

utils

, but something more meaningful. in addition, you can parametrize the

create_pipeline

function to your needs, so that you can instantiate such pipeline with, say, different inputs

Francis Duval

12/20/2024, 3:56 PM

Ok thanks! I'll try that

Yolan Honoré-Rougé

12/20/2024, 7:48 PM

And ... kedro-mlflow to the rescue ;) https://kedro-org.slack.com/archives/C03RKP2LW64/p1732608611175909?thread_ts=1732562032.566199&cid=C03RKP2LW64

❤️ 1

3 Views

Open in Slack

Previous Next