Hi I was reviewing a code with a pattern where they define s Kedro #questions

Hi, I was reviewing a code with a pattern where th...

Artur Dobrogowski

08/23/2024, 11:09 AM

Hi, I was reviewing a code with a pattern where they define sub pipelines and use the pipeline's outputs to define the inputs to the next layer/pipeline. That's when I discovered that internal nodes can't be specified as outputs actually. Is this anti pattern or is this a bug? Picture in the thread

Artur Dobrogowski

08/23/2024, 11:09 AM

Screenshot 2024-08-23 at 13.07.01.png

Artur Dobrogowski

08/23/2024, 11:09 AM

^ kedro new project with example pipeline modified to have namespace and outputs

Artur Dobrogowski

08/23/2024, 11:10 AM

So what's happening is that this

outputs=

stripts the namespace as it should, but when you reference to the pipeline's outputs it's not listed.

Artur Dobrogowski

08/23/2024, 11:10 AM

So the solution here could be to add a node that does nothing and inputs/outputs the thing but it's ugly and additional slow down on processing the pipeline. Another solution would be to not rely on pipeline's outputs and define them statically.

Artur Dobrogowski

08/23/2024, 11:10 AM

I think in general dynamically referencing to the other pipeline's outputs (p. A) as all inputs to another pipeline (p. B) is an anti-pattern. That's because if you edit the pipeline A to do something extra and have more outputs, then you automatically break pipeline B that now has an extra input that does not get consumed (and that's error in Kedro's validation)

Artur Dobrogowski

08/23/2024, 11:17 AM

Still, can't help but think that's something wrong when I look at the picture ^

Dmitry Sorokin

08/23/2024, 12:57 PM

Hi Artur, so currently Pipeline output method works like this:

*def* outputs(self) -> set[str]:

"""The names of outputs produced when the whole pipeline is run.

Does not include intermediate outputs that are consumed by

other pipeline nodes. Resolves transcoded names where necessary.

Returns:

The set of final pipeline outputs.

"""

As you can see, it only returns the final outputs. I'm trying to understand why this is an issue for you. Could you clarify what you're trying to achieve with this method?

Artur Dobrogowski

08/23/2024, 2:50 PM

The only thing I want to clarify is to highlight that

pipeline.outputs()

pipeline(..., outputs=...)

and that it's confusing. I was expecting to

pipeline.outputs()

to include intermediary datasets if they were specified explicite as

pipeline(..., outputs=)

, or perhaps just take the defined outputs if they are defined

Artur Dobrogowski

08/23/2024, 2:50 PM

and I wanted to discuss whether it should be like that or not

Dmitry Sorokin

08/23/2024, 3:08 PM

I agree that the difference between the two outputs can be confusing, and this is likely because they serve distinct purposes: •

pipeline.outputs()

is designed to return the final outputs of the pipeline, as described in the documentation. • The

outputs

parameter inside the

pipeline()

creation function, on the other hand, is used for mapping outputs, particularly when namespaces are involved (as detailed in Kedro's documentation on namespaces). In the example you provided, it's used to avoid changing the output name by adding the

.namespace_name

prefix. Even if you don't specify anything in the

outputs

parameter of the

pipeline()

creation function,

pipeline.outputs()

will still return the final outputs. These two types of outputs currently serve different purposes, but we will be happy to hear new ideas that they should be connected.

👍 1

Nok Lam Chan

08/23/2024, 4:20 PM

https://docs.kedro.org/en/stable/nodes_and_pipelines/pipeline_introduction.html

pipelines.outputs()

has nothing to do with invidual nodes. If you want all outputs from nodes, you can iterate through the pipeline with

pipeline.nodes

and get the attribute from node.

🧐 1

2 Views

Open in Slack

Previous Next