Hello! I have a pipeline "training" that fits a m...
# questions
f
Hello! I have a pipeline "training" that fits a model (splink entity resolution model) and another one "inference" that takes the trained model and applies it to an inference dataset. In both pipelines, I want to use a node (or modular pipeline) "preprocess" for preprocessing the data before inputting it in the model (either for training or for inference). I obviously don't want to copy-paste the same "preprocess" function in both training/nodes.py and inference/nodes.py. I was wondering about the best practices around this. The following would probably work:
Copy code
# training/nodes.py

def preprocess(data):
  # Preprocessing logic here
Copy code
# training/pipeline.py
from kedro.pipeline import Pipeline, node, pipeline
from .nodes import preprocess


def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess,
                inputs='data_training',
                outputs='preprocessed_data_training',
                name='preprocess_training',
            )
        ]
    )
Copy code
# inference/pipeline.py
from kedro.pipeline import Pipeline, node, pipeline
from ..training.nodes import preprocess ## IMPORT FROM training/nodes.py


def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess,
                inputs='data_inference',
                outputs='preprocessed_data_inference',
                name='preprocess_inference',
            )
        ]
    )
However, I feel like this is not elegant and probably not optimal. Is there a better way of doing this? Maybe a "meta" nodes.py that can be used by all pipelines? Maybe rearranging the whole pipeline? Thanks!
h
Someone will reply to you shortly. In the meantime, this might help:
j
a typical thing people do is
Copy code
from ...utils.preprocessing import preprocess
even if having a
utils
package/module is not very elegant, the point is still to not tie it to any pipeline. it could be
src/utils
, or
src/preprocessing
, or anything else that makes sense. I'd save
src/pipelines
for modular pipelines. does it make sense?
f
Thanks! Makes sense, yes, but I'm not sure to get what you say about the modular pipelines. Suppose I want to reuse not a function, but a whole "preprocessing" modular pipeline containing many nodes in both "training" and "inference" pipelines. If I understand, you would recommend to make another pipeline:
kedro pipeline create utils
in which I would define a "preprocessing" modular pipeline:
Copy code
# utils/pipeline.py
from kedro.pipeline import Pipeline, node, pipeline
from .nodes import preprocess_func_1, preprocess_func_2

def preprocess_template() -> Pipeline:
    return pipeline(
        pipe=[
            node(
                func=preprocess_func_1,
                inputs='raw_data',
                outputs='preprocessed_data_1',
                name='preprocess_1'
            ),
            node(
                func=preprocess_func_2,
                inputs='preprocessed_data_1',
                outputs='preprocessed_data_2',
                name='preprocess_2'
            )
        ]
    )
that I would then import and use in both training/pipeline.py and inference/pipeline.py?
Copy code
from ..utils/pipeline.py import preprocess_template
j
oh, what I said applies to reusing functions. about reusing pipelines, I'd definitely not call the pipeline
utils
, but something more meaningful. in addition, you can parametrize the
create_pipeline
function to your needs, so that you can instantiate such pipeline with, say, different inputs
f
Ok thanks! I'll try that
y