Hello people :wave: I have the following setup: - ...
# questions
m
Hello people 👋 I have the following setup: • 2 modular pipelines, train and evaluate defined with their own namespace, i.e. train and evaluate • 2 data catalogs, one for each pipeline, where the dataset names look like
"<namespace>.<name>
, e.g.
train.docs
,
evaluate.docs
, etc. • one global data catalog where the datasets are defined without namespaces—ideally, they are shared across pipelines, e.g.
full-data
. I then compose and run the pipelines, however, I notice that 2 artifacts are created for the dataset defined inside the global data catalog:
train-full-data
and
evaluate-full-data
. I read from the documentation that you can create a dataset factory pattern if you have e.g. the same output for namespaced modular pipelines. What about namespaced modular pipelines that share the same input instead? I would expect a behaviour for which • if datasets have a namespace, they are associated to the pipeline with the corresponding namespace • if datasets do not have a namespace, they are shared across all namespaced modular pipelines that reference them. I hope it makes sense.
i
What about namespaced modular pipelines that share the same input instead?
If I'm understanding correctly, I think you want the
inputs
keyword in the
pipeline
function. (See docs here) So defining:
Copy code
train_pipeline = pipeline(
    base_pipeline, 
    namespace="train", 
    inputs={"full-data"},  # Note this is a set.
)
Then your pipeline will read from
full-data
not
train.full-data
. Does that help?
m
Brilliant! I was aware of the
outputs
keyword for composing disconnected pipelines, but I was not aware of
inputs
. I will give it a try and let you know—thanks! 🙌
👍 1
🥳 1
So, it seems to work as you explained and I can see that the artifact is not namespaced. 👌
🥳 1
👍 1