mattia.paterna
05/23/2024, 8:34 AMdef register_pipelines() -> Dict[str, Pipeline]:
pipelines = find_pipelines()
return {
"__default__": pipelines["preprocess"] + pipelines["train"],
}
but I would like to choose the pipelines to concatenate at runtime.
I know that I could e.g. provide the pipeline names via the --params
argument, but I fail to understand how I can use it inside the code.
Is what I would like to do possible and good practice in Kedro?
Thank you! πdatajoely
05/23/2024, 8:35 AMdatajoely
05/23/2024, 8:36 AMkedro run --pipeline A & kedro run --pipleine B
(use &
for concurrent, &&
for in siries IIRC)datajoely
05/23/2024, 8:36 AMmattia.paterna
05/23/2024, 8:43 AM&&
, however, I believe this has a shortcoming if you don't know beforehand the number of pipelines.
Ideally, I am looking at a solution where the end user defines which pipelines and their order of execution programmatically, then the CI/CD orchestration pipeline is passing this information to Kedro.mattia.paterna
05/23/2024, 8:44 AMNok Lam Chan
05/23/2024, 8:56 AMNok Lam Chan
05/23/2024, 9:01 AMpipeline_registry.py
don't have access of params
or runtime parameters, IIRC there are some hacky way to do it but I cannot recall now.
2. "End user defines which pipelines and their order of execution programmatically", this requires some more clear definition before I can suggest any solutionmattia.paterna
05/23/2024, 9:16 PMthis requires some more clear definition before I can suggest any solutionI can try: let's say the project has 4 Kedro pipelines: _preprocess, train, evaluate,_and deploy. The user only wants to run the ML workflow up to model training and invokes
kedro run --pipeline_chain="preprocess,train"
.
I would rather talk about dependency than order of execution: what is important is that the data preprocessing pipeline is executed prior to the training pipeline.
I hope it makes more sense.datajoely
05/24/2024, 9:29 AM,
and then find the pipelines and +
/ sum
them togetherNok Lam Chan
05/24/2024, 10:08 AMI would rather talk about dependency than order of execution: what is important is that the data preprocessing pipeline is executed prior to the training pipeline.
I hope it makes more sense.I am not sure if i understand this, Kedro resolves dependencies base on the inputs/outputs pairs. This should be the case already. I think one thing to note here is conceptually
kedro run
is always ONE pipeline, let's say you have kedro run --pipeline="preprocess,train"
as Joel suggested, they will be merged into one Pipeline object as the link I shared eariler, so there is nothing to worry about the dependency.Nok Lam Chan
05/24/2024, 10:09 AMkedro run --tags preprocess,train
. I think there is a subtle difference here, most operators applied in a AND
fashion, only tags
is using the OR
logic.
Cc @datajoely?mattia.paterna
05/24/2024, 3:00 PMpreprocess+train==train+preprocess
? (It might be a noob question, apologies if it is.)
I agree that conceptually a composite pipeline shall become one, which is also reflected into one execution and e.g. one run registered inside an experiment tracker.
@datajoely at the moment, we decided to follow this approach:
β’ We derived a version of _PipelineRegistry
that we initialise it by passing it some extra runtime parameters such as e.g. --composite
.
β’ We then use the parameter inside the implementation for register_pipelines()
, we create the pipeline composition, and we register it as __default__
.
The CLI commands looks like the following:
poetry run kedro run --params composition=preprocess+train,run_name=some-run-"$(date +%s)",...
datajoely
05/24/2024, 3:02 PM__add__
is literally to combine the set()
of nodes in the group and then recalculate an execution order
https://github.com/kedro-org/kedro/blob/d219e403a7e2929dd710ea8781ec0ae30ccec0df/kedro/pipeline/pipeline.py#L177datajoely
05/24/2024, 3:02 PMmultiple
--pipeline
arguments?Nok Lam Chan
05/24/2024, 6:36 PM