Hi Continuing with my experimentation with namespaces and in Kedro #questions

Hi, Continuing with my experimentation with namesp...

Shah

10/09/2025, 4:47 PM

Hi, Continuing with my experimentation with namespaces and inheriting/extending pipelines, I have a situation. My current workflow is following. I have namespaces implemented for each of the demo (train and evaluate LR model) and extended (train and model RF model), pipelines. (Continued in the replies to this message... )

Shah

10/09/2025, 4:48 PM

Collapsed view of these namespaces is this:

Shah

10/09/2025, 4:53 PM

In this, the

demo_modelling_pipeline

receives the

presence_motion_co2_combined_cleaned_neg_removed

and applies

split_data_node

and the training and evaluation of

LinearRegression

. After this, in the

extended_modelling_pipeline

, these split dataframes (

X_train, X_test, y_train, y_test

) are passed as inputs, and two nodes execute training and evaluation of

RandomForestRegressor

Shah

10/09/2025, 5:13 PM

I want to better organise this in a way that I have two separate paths after splitting the Data Frame, one for LR and another for RF. For this, I'm applying the following logic. In

data_science/pipeline.py

file: I have base_data_science pipeline structure defined as -

Copy code

base_data_science = Pipeline(
    [
        Node(
            func=show_data,
            inputs=["presence_motion_co2_combined_cleaned_neg_removed","parameters"],
            outputs=None,
            name="show_data_node"
        ),
        Node(
            func=split_data,
            inputs=["presence_motion_co2_combined_cleaned_neg_removed","params:split_data"],
            outputs=["X_train", "X_test", "y_train", "y_test"],
            name="split_data_node"
        )

    ]
)

After that, I am creating the

demo_modelling_pipeline

to execute just the LR mode training and evaluation, combining the previous two nodes (show and split) with the train and eval of LR.

Copy code

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        [
        base_data_science,
        Node(
            func=train_model_LR,
            inputs=["X_train", "y_train", "params:train_data"],
            outputs="model_LR",
            name="train_model_LR_node"
        ),
        Node(
            func=evaluate_model,
            inputs=["model_LR", "X_test", "y_test", "params:eval_model"],
            outputs="metrics_LR",
            name="evaluate_model_LR_node"
        )],
        namespace="demo_modelling_pipeline",
        prefix_datasets_with_namespace=False,
        parameters={"params:train_data": "params:demo_train_data", "params:eval_model": "params:demo_eval_model"},
        inputs={"presence_motion_co2_combined_cleaned_neg_removed"}
        # inputs={"X_train", "y_train", "X_test", "y_test"}
    )

Additionally, in the

data_science_ext/pipeline.py

file, I have another pipeline to execute the RF model training and evaluation after show and split.

Copy code

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        [
        base_data_science,
        Node(
            func=train_model_RF,
            inputs=["X_train", "y_train", "params:train_data"],
            outputs="model_RF",
            name="train_model_RF_node"
        ),
        Node(
            func=evaluate_model,
            inputs=["model_RF", "X_test", "y_test", "params:eval_model"],
            outputs="metrics_RF",
            name="evaluate_model_RF_node"
        )],
        namespace="extended_modelling_pipeline",
        prefix_datasets_with_namespace=False,
        parameters={"params:train_data": "params:ext_train_data", "params:eval_model": "params:ext_eval_model"},
        inputs={"presence_motion_co2_combined_cleaned_neg_removed"}
        # inputs={"X_train", "y_train", "X_test", "y_test"}
    )

Thus, both the pipelines have similar structures: • show data, • split data, • train model (LR/RF), • eval model (LR/RF). Both of them start with the same

presence_motion_co2_combined_cleaned_neg_removed

table and the train-test split sets (X, y) are created internally. However, I am getting the following error while trying to run

kedro viz

kedro registry list

. OutputNotUniqueError: Output(s) ['X_test', 'X_train', 'y_test', 'y_train'] are returned by more than one nodes. Node outputs must be unique.
I even tried creating separate unique nodes for these 4 outputs, each with the corresponding namespace prefixes. Yet, the error continues.

Shah

10/09/2025, 5:24 PM

My goal is to have two nodes coming out after the split_data, like this

Shah

10/09/2025, 5:26 PM

However, the only time I could do it, these

X_train, y_train, X_test, y_test

nodes are not shown as outputs from the previous table or

split_data_node

, but somehow hanging in the air, which is strange, since they are created as outputs from the

split_data_node

Shah

10/09/2025, 5:31 PM

On the other hand, if I switch the input as following, in both the pipelines

Copy code

# inputs={"presence_motion_co2_combined_cleaned_neg_removed"}
inputs={"X_train", "y_train", "X_test", "y_test"}

the error is: PipelineError: Inputs must not be outputs from another node in the same pipeline
which is equally strange, as the inputs must be outputs from another node in the same pipeline, as far as I understand.

Shah

10/09/2025, 6:08 PM

Update: allowing the namespace prefixes (as is the default case), has done the trick. Now, the

data_science/pipeline.py

contains:

Copy code

namespace="demo_ds",
        # prefix_datasets_with_namespace=False,
        parameters={"params:split_data": "params:demo_split_data", "params:train_data": "params:demo_train_data", "params:eval_model": "params:demo_eval_model"},
        inputs={"presence_motion_co2_combined_cleaned_neg_removed"}
        # inputs={"X_train", "y_train", "X_test", "y_test"}

and

data_science_ext/pipeline.py

contains:

Copy code

namespace="ext_ds",
        # prefix_datasets_with_namespace=False,
        parameters={"params:split_data": "params:ext_split_data", "params:train_data": "params:ext_train_data", "params:eval_model": "params:ext_eval_model"},
        inputs={"presence_motion_co2_combined_cleaned_neg_removed"}
        # inputs={"X_train", "y_train", "X_test", "y_test"}

and the overall pipeline is displayed as:

Rashida Kanchwala

10/09/2025, 8:35 PM

great, was going to suggest the same the datasets needed to be namespaced 🙂

😊 1

Shah

10/10/2025, 8:48 AM

Thanks. I was avoiding it, because in the earlier attempt (as shared at the beginning of the thread), keeping

prefix_datasets_with_namespace=False

was working well. So it seemed logical that the same would continue since I was supplying a unique set of parameters. But this time the same inputs are shared by two different nodes. Perhaps that was the pain point. Suggestion: the above two error messages were confusing and somewhat misguiding. In particular, the second message, because, normally the output of one node is the input to the next node, so that error seemed inappropriate. I may be wrong in my understanding though.

Ankita Katiyar

10/10/2025, 10:18 AM

Hey Shah, thanks for the detailed explanation of your issue. The idea with

prefix_datasets_with_namespaces

was to disable the prefixing in situations when you’re using the namespaces as a “deployment unit”. When re-using a base pipeline with different parameters/inputs - it is recommended to keep the datasets also namespaced to avoid ambiguity (or provide explicit mapping in

inputs

and

outputs

) I’ll take a closer look soon, the error messages do seem a bit cryptic

👍 1

Shah

10/10/2025, 4:41 PM

Hi Ankita, Right, thanks for that confirmation, and also for the consideration. Additionally, is it possible to manually move the nodes' default placing in the visualisation, to make it look more appealing and cleaner? Like in a mind-mapping software.

Shah

10/10/2025, 6:41 PM

Now, in my experimental journey towards eliminating the redundant functions, and reusing nodes, I am trying to 'rejoin' the two pipelines after the model training is over. In the above case, the evaluation method is common, so I thought to create a flow where the separate pipelines (

demo_ds

and

ext_ds

), after the training of individual models, come together for the evaluation method. I am not sure if that's even possible. If so, could you please elaborate, what's the best way to achieve it? I tried creating the

evaluate_model_node

in both the pipelines, with the exact same input and output. I even changed the

model_LR

and

model_RF

to just model, to refer to the same file. Yet, the best I could get was the following (two separate nodes for each pipeline, although with the same name):

3 Views

Open in Slack

Previous Next