Hi, I was reviewing a code with a pattern where th...
# questions
a
Hi, I was reviewing a code with a pattern where they define sub pipelines and use the pipeline's outputs to define the inputs to the next layer/pipeline. That's when I discovered that internal nodes can't be specified as outputs actually. Is this anti pattern or is this a bug? Picture in the thread
Screenshot 2024-08-23 at 13.07.01.png
^ kedro new project with example pipeline modified to have namespace and outputs
So what's happening is that this
outputs=
stripts the namespace as it should, but when you reference to the pipeline's outputs it's not listed.
So the solution here could be to add a node that does nothing and inputs/outputs the thing but it's ugly and additional slow down on processing the pipeline. Another solution would be to not rely on pipeline's outputs and define them statically.
I think in general dynamically referencing to the other pipeline's outputs (p. A) as all inputs to another pipeline (p. B) is an anti-pattern. That's because if you edit the pipeline A to do something extra and have more outputs, then you automatically break pipeline B that now has an extra input that does not get consumed (and that's error in Kedro's validation)
Still, can't help but think that's something wrong when I look at the picture ^
d
Hi Artur, so currently Pipeline output method works like this:
*def* outputs(self) -> set[str]:
"""The names of outputs produced when the whole pipeline is run.
Does not include intermediate outputs that are consumed by
other pipeline nodes. Resolves transcoded names where necessary.
Returns:
The set of final pipeline outputs.
"""
As you can see, it only returns the final outputs. I'm trying to understand why this is an issue for you. Could you clarify what you're trying to achieve with this method?
a
The only thing I want to clarify is to highlight that
pipeline.outputs()
!=
pipeline(..., outputs=...)
and that it's confusing. I was expecting to
pipeline.outputs()
to include intermediary datasets if they were specified explicite as
pipeline(..., outputs=)
, or perhaps just take the defined outputs if they are defined
and I wanted to discuss whether it should be like that or not
d
I agree that the difference between the two outputs can be confusing, and this is likely because they serve distinct purposes: •
pipeline.outputs()
is designed to return the final outputs of the pipeline, as described in the documentation. • The
outputs
parameter inside the
pipeline()
creation function, on the other hand, is used for mapping outputs, particularly when namespaces are involved (as detailed in Kedro's documentation on namespaces). In the example you provided, it's used to avoid changing the output name by adding the
.namespace_name
prefix. Even if you don't specify anything in the
outputs
parameter of the
pipeline()
creation function,
pipeline.outputs()
will still return the final outputs. These two types of outputs currently serve different purposes, but we will be happy to hear new ideas that they should be connected.
šŸ‘ 1
n
https://docs.kedro.org/en/stable/nodes_and_pipelines/pipeline_introduction.html
pipelines.outputs()
has nothing to do with invidual nodes. If you want all outputs from nodes, you can iterate through the pipeline with
pipeline.nodes
and get the attribute from node.
🧐 1