mattia.paterna
06/06/2024, 2:24 PM"<namespace>.<name>
, e.g. train.docs
, evaluate.docs
, etc.
• one global data catalog where the datasets are defined without namespaces—ideally, they are shared across pipelines, e.g. full-data
.
I then compose and run the pipelines, however, I notice that 2 artifacts are created for the dataset defined inside the global data catalog: train-full-data
and evaluate-full-data
.
I read from the documentation that you can create a dataset factory pattern if you have e.g. the same output for namespaced modular pipelines. What about namespaced modular pipelines that share the same input instead?
I would expect a behaviour for which
• if datasets have a namespace, they are associated to the pipeline with the corresponding namespace
• if datasets do not have a namespace, they are shared across all namespaced modular pipelines that reference them.
I hope it makes sense.Ian Whalen
06/06/2024, 2:43 PMWhat about namespaced modular pipelines that share the same input instead?If I'm understanding correctly, I think you want the
inputs
keyword in the pipeline
function. (See docs here)
So defining:
train_pipeline = pipeline(
base_pipeline,
namespace="train",
inputs={"full-data"}, # Note this is a set.
)
Then your pipeline will read from full-data
not train.full-data
.
Does that help?mattia.paterna
06/06/2024, 3:04 PMoutputs
keyword for composing disconnected pipelines, but I was not aware of inputs
.
I will give it a try and let you know—thanks! 🙌mattia.paterna
06/11/2024, 2:48 PM