Hi, Continuing with my experimentation with namesp...
# questions
s
Hi, Continuing with my experimentation with namespaces and inheriting/extending pipelines, I have a situation. My current workflow is following. I have namespaces implemented for each of the demo (train and evaluate LR model) and extended (train and model RF model), pipelines. (Continued in the replies to this message... )
Collapsed view of these namespaces is this:
In this, the
demo_modelling_pipeline
receives the
presence_motion_co2_combined_cleaned_neg_removed
and applies
split_data_node
and the training and evaluation of
LinearRegression
. After this, in the
extended_modelling_pipeline
, these split dataframes (
X_train, X_test, y_train, y_test
) are passed as inputs, and two nodes execute training and evaluation of
RandomForestRegressor
.
I want to better organise this in a way that I have two separate paths after splitting the Data Frame, one for LR and another for RF. For this, I'm applying the following logic. In
data_science/pipeline.py
file: I have base_data_science pipeline structure defined as -
Copy code
base_data_science = Pipeline(
    [
        Node(
            func=show_data,
            inputs=["presence_motion_co2_combined_cleaned_neg_removed","parameters"],
            outputs=None,
            name="show_data_node"
        ),
        Node(
            func=split_data,
            inputs=["presence_motion_co2_combined_cleaned_neg_removed","params:split_data"],
            outputs=["X_train", "X_test", "y_train", "y_test"],
            name="split_data_node"
        )

    ]
)
After that, I am creating the
demo_modelling_pipeline
to execute just the LR mode training and evaluation, combining the previous two nodes (show and split) with the train and eval of LR.
Copy code
def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        [
        base_data_science,
        Node(
            func=train_model_LR,
            inputs=["X_train", "y_train", "params:train_data"],
            outputs="model_LR",
            name="train_model_LR_node"
        ),
        Node(
            func=evaluate_model,
            inputs=["model_LR", "X_test", "y_test", "params:eval_model"],
            outputs="metrics_LR",
            name="evaluate_model_LR_node"
        )],
        namespace="demo_modelling_pipeline",
        prefix_datasets_with_namespace=False,
        parameters={"params:train_data": "params:demo_train_data", "params:eval_model": "params:demo_eval_model"},
        inputs={"presence_motion_co2_combined_cleaned_neg_removed"}
        # inputs={"X_train", "y_train", "X_test", "y_test"}
    )
Additionally, in the
data_science_ext/pipeline.py
file, I have another pipeline to execute the RF model training and evaluation after show and split.
Copy code
def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        [
        base_data_science,
        Node(
            func=train_model_RF,
            inputs=["X_train", "y_train", "params:train_data"],
            outputs="model_RF",
            name="train_model_RF_node"
        ),
        Node(
            func=evaluate_model,
            inputs=["model_RF", "X_test", "y_test", "params:eval_model"],
            outputs="metrics_RF",
            name="evaluate_model_RF_node"
        )],
        namespace="extended_modelling_pipeline",
        prefix_datasets_with_namespace=False,
        parameters={"params:train_data": "params:ext_train_data", "params:eval_model": "params:ext_eval_model"},
        inputs={"presence_motion_co2_combined_cleaned_neg_removed"}
        # inputs={"X_train", "y_train", "X_test", "y_test"}
    )
Thus, both the pipelines have similar structures: • show data, • split data, • train model (LR/RF), • eval model (LR/RF). Both of them start with the same
presence_motion_co2_combined_cleaned_neg_removed
table and the train-test split sets (X, y) are created internally. However, I am getting the following error while trying to run
kedro viz
or
kedro registry list
.
OutputNotUniqueError: Output(s) ['X_test', 'X_train', 'y_test', 'y_train'] are returned by more than one nodes. Node outputs must be unique.
I even tried creating separate unique nodes for these 4 outputs, each with the corresponding namespace prefixes. Yet, the error continues.
My goal is to have two nodes coming out after the split_data, like this
However, the only time I could do it, these
X_train, y_train, X_test, y_test
nodes are not shown as outputs from the previous table or
split_data_node
, but somehow hanging in the air, which is strange, since they are created as outputs from the
split_data_node
.
On the other hand, if I switch the input as following, in both the pipelines
Copy code
# inputs={"presence_motion_co2_combined_cleaned_neg_removed"}
inputs={"X_train", "y_train", "X_test", "y_test"}
the error is:
PipelineError: Inputs must not be outputs from another node in the same pipeline
which is equally strange, as the inputs must be outputs from another node in the same pipeline, as far as I understand.
Update: allowing the namespace prefixes (as is the default case), has done the trick. Now, the
data_science/pipeline.py
contains:
Copy code
namespace="demo_ds",
        # prefix_datasets_with_namespace=False,
        parameters={"params:split_data": "params:demo_split_data", "params:train_data": "params:demo_train_data", "params:eval_model": "params:demo_eval_model"},
        inputs={"presence_motion_co2_combined_cleaned_neg_removed"}
        # inputs={"X_train", "y_train", "X_test", "y_test"}
and
data_science_ext/pipeline.py
contains:
Copy code
namespace="ext_ds",
        # prefix_datasets_with_namespace=False,
        parameters={"params:split_data": "params:ext_split_data", "params:train_data": "params:ext_train_data", "params:eval_model": "params:ext_eval_model"},
        inputs={"presence_motion_co2_combined_cleaned_neg_removed"}
        # inputs={"X_train", "y_train", "X_test", "y_test"}
and the overall pipeline is displayed as:
r
great, was going to suggest the same the datasets needed to be namespaced 🙂
😊 1
s
Thanks. I was avoiding it, because in the earlier attempt (as shared at the beginning of the thread), keeping
prefix_datasets_with_namespace=False
was working well. So it seemed logical that the same would continue since I was supplying a unique set of parameters. But this time the same inputs are shared by two different nodes. Perhaps that was the pain point. Suggestion: the above two error messages were confusing and somewhat misguiding. In particular, the second message, because, normally the output of one node is the input to the next node, so that error seemed inappropriate. I may be wrong in my understanding though.
a
Hey Shah, thanks for the detailed explanation of your issue. The idea with
prefix_datasets_with_namespaces
was to disable the prefixing in situations when you’re using the namespaces as a “deployment unit”. When re-using a base pipeline with different parameters/inputs - it is recommended to keep the datasets also namespaced to avoid ambiguity (or provide explicit mapping in
inputs
and
outputs
) I’ll take a closer look soon, the error messages do seem a bit cryptic
👍 1
s
Hi Ankita, Right, thanks for that confirmation, and also for the consideration. Additionally, is it possible to manually move the nodes' default placing in the visualisation, to make it look more appealing and cleaner? Like in a mind-mapping software.
Now, in my experimental journey towards eliminating the redundant functions, and reusing nodes, I am trying to 'rejoin' the two pipelines after the model training is over. In the above case, the evaluation method is common, so I thought to create a flow where the separate pipelines (
demo_ds
and
ext_ds
), after the training of individual models, come together for the evaluation method. I am not sure if that's even possible. If so, could you please elaborate, what's the best way to achieve it? I tried creating the
evaluate_model_node
in both the pipelines, with the exact same input and output. I even changed the
model_LR
and
model_RF
to just model, to refer to the same file. Yet, the best I could get was the following (two separate nodes for each pipeline, although with the same name):