Hi everyone :wave: I am new to Kedro and have been...
# questions
m
Hi everyone πŸ‘‹ I am new to Kedro and have been exploring the pipeline registry. I am interested in concatenating several pipelines into one. Say that I want this behaviour:
Copy code
def register_pipelines() -> Dict[str, Pipeline]:
    pipelines = find_pipelines()

    return {
        "__default__": pipelines["preprocess"] + pipelines["train"],
    }
but I would like to choose the pipelines to concatenate at runtime. I know that I could e.g. provide the pipeline names via the
--params
argument, but I fail to understand how I can use it inside the code. Is what I would like to do possible and good practice in Kedro? Thank you! πŸ™
d
hiya this is interesting
so the dumbest way to do two different CLI commands
kedro run --pipeline A & kedro run --pipleine B
(use
&
for concurrent,
&&
for in siries IIRC)
πŸ‘ 1
you can probably get the pipeline registry to react to CLI commands but it’s a bit more complex
m
Interesting the use of
&&
, however, I believe this has a shortcoming if you don't know beforehand the number of pipelines. Ideally, I am looking at a solution where the end user defines which pipelines and their order of execution programmatically, then the CI/CD orchestration pipeline is passing this information to Kedro.
how would you go with using CLI commands? We do specify already CLI arguments, as we are developing a Kedro plugin.
n
On pipeline concat: https://noklam.github.io/blog/posts/kedro-pipeline-slicing-pipeline/2024-03-06-Kedro-Pipeline-Slicing-Pipeline.html#more-notes On Execution order: https://github.com/kedro-org/kedro/discussions/3758 It is very much possible to concat at runtime, it seems to be uncommon as defining it static is enough most of the time.
I think there is 2 key questions here: 1.
pipeline_registry.py
don't have access of
params
or runtime parameters, IIRC there are some hacky way to do it but I cannot recall now. 2. "End user defines which pipelines and their order of execution programmatically", this requires some more clear definition before I can suggest any solution
m
@Nok Lam Chan I read about the pipeline arithmetic, very interesting. I think this is what I am into, but in a dynamic fashion.
this requires some more clear definition before I can suggest any solution
I can try: let's say the project has 4 Kedro pipelines: _preprocess, train, evaluate,_and deploy. The user only wants to run the ML workflow up to model training and invokes
kedro run --pipeline_chain="preprocess,train"
. I would rather talk about dependency than order of execution: what is important is that the data preprocessing pipeline is executed prior to the training pipeline. I hope it makes more sense.
d
I think we should consider adding this, it has come up a few times. I think if we changed the logic here it would suddenly work in the CLI and other places: https://github.com/kedro-org/kedro/blob/545cab7bd8dbc6194f8c18bde11c851e6d4eeace/kedro/framework/session/session.py#L338 Essentially split on
,
and then find the pipelines and
+
/
sum
them together
n
I think implementation is simple, we need to decide the syntax. @mattia.paterna
I would rather talk about dependency than order of execution: what is important is that the data preprocessing pipeline is executed prior to the training pipeline.
I hope it makes more sense.
I am not sure if i understand this, Kedro resolves dependencies base on the inputs/outputs pairs. This should be the case already. I think one thing to note here is conceptually
kedro run
is always ONE pipeline, let's say you have
kedro run --pipeline="preprocess,train"
as Joel suggested, they will be merged into one Pipeline object as the link I shared eariler, so there is nothing to worry about the dependency.
πŸ‘ 1
I suspect you can already do this with tags,
kedro run --tags preprocess,train
. I think there is a subtle difference here, most operators applied in a
AND
fashion, only
tags
is using the
OR
logic. Cc @datajoely?
m
@Nok Lam Chan okay, so this means that pipeline composition is associative, e.g.
preprocess+train==train+preprocess
? (It might be a noob question, apologies if it is.) I agree that conceptually a composite pipeline shall become one, which is also reflected into one execution and e.g. one run registered inside an experiment tracker. @datajoely at the moment, we decided to follow this approach: β€’ We derived a version of
_PipelineRegistry
that we initialise it by passing it some extra runtime parameters such as e.g.
--composite
. β€’ We then use the parameter inside the implementation for
register_pipelines()
, we create the pipeline composition, and we register it as
__default__
. The CLI commands looks like the following:
poetry run kedro run --params composition=preprocess+train,run_name=some-run-"$(date +%s)",...
d
__add__
is literally to combine the
set()
of nodes in the group and then recalculate an execution order https://github.com/kedro-org/kedro/blob/d219e403a7e2929dd710ea8781ec0ae30ccec0df/kedro/pipeline/pipeline.py#L177
πŸ‘ 1
our of interest, do we support
multiple
--pipeline
arguments?
n
No