Hi everyone wave I am new to Kedro and have been exploring t Kedro #questions

Hi everyone :wave: I am new to Kedro and have been...

mattia.paterna

05/23/2024, 8:34 AM

Hi everyone 👋 I am new to Kedro and have been exploring the pipeline registry. I am interested in concatenating several pipelines into one. Say that I want this behaviour:

Copy code

def register_pipelines() -> Dict[str, Pipeline]:
    pipelines = find_pipelines()

    return {
        "__default__": pipelines["preprocess"] + pipelines["train"],
    }

but I would like to choose the pipelines to concatenate at runtime. I know that I could e.g. provide the pipeline names via the

--params

argument, but I fail to understand how I can use it inside the code. Is what I would like to do possible and good practice in Kedro? Thank you! 🙏

datajoely

05/23/2024, 8:35 AM

hiya this is interesting

datajoely

05/23/2024, 8:36 AM

so the dumbest way to do two different CLI commands

kedro run --pipeline A & kedro run --pipleine B

(use

for concurrent,

&&

for in siries IIRC)

👍 1

datajoely

05/23/2024, 8:36 AM

you can probably get the pipeline registry to react to CLI commands but it’s a bit more complex

mattia.paterna

05/23/2024, 8:43 AM

Interesting the use of

&&

, however, I believe this has a shortcoming if you don't know beforehand the number of pipelines. Ideally, I am looking at a solution where the end user defines which pipelines and their order of execution programmatically, then the CI/CD orchestration pipeline is passing this information to Kedro.

mattia.paterna

05/23/2024, 8:44 AM

how would you go with using CLI commands? We do specify already CLI arguments, as we are developing a Kedro plugin.

Nok Lam Chan

05/23/2024, 8:56 AM

On pipeline concat: https://noklam.github.io/blog/posts/kedro-pipeline-slicing-pipeline/2024-03-06-Kedro-Pipeline-Slicing-Pipeline.html#more-notes On Execution order: https://github.com/kedro-org/kedro/discussions/3758 It is very much possible to concat at runtime, it seems to be uncommon as defining it static is enough most of the time.

Nok Lam Chan

05/23/2024, 9:01 AM

I think there is 2 key questions here: 1.

pipeline_registry.py

don't have access of

params

or runtime parameters, IIRC there are some hacky way to do it but I cannot recall now. 2. "End user defines which pipelines and their order of execution programmatically", this requires some more clear definition before I can suggest any solution

mattia.paterna

05/23/2024, 9:16 PM

@Nok Lam Chan I read about the pipeline arithmetic, very interesting. I think this is what I am into, but in a dynamic fashion.

this requires some more clear definition before I can suggest any solution

I can try: let's say the project has 4 Kedro pipelines: _preprocess, train, evaluate,_and deploy. The user only wants to run the ML workflow up to model training and invokes

kedro run --pipeline_chain="preprocess,train"

. I would rather talk about dependency than order of execution: what is important is that the data preprocessing pipeline is executed prior to the training pipeline. I hope it makes more sense.

datajoely

05/24/2024, 9:29 AM

I think we should consider adding this, it has come up a few times. I think if we changed the logic here it would suddenly work in the CLI and other places: https://github.com/kedro-org/kedro/blob/545cab7bd8dbc6194f8c18bde11c851e6d4eeace/kedro/framework/session/session.py#L338 Essentially split on

and then find the pipelines and

sum

them together

Nok Lam Chan

05/24/2024, 10:08 AM

I think implementation is simple, we need to decide the syntax. @mattia.paterna

I would rather talk about dependency than order of execution: what is important is that the data preprocessing pipeline is executed prior to the training pipeline.

I hope it makes more sense.

I am not sure if i understand this, Kedro resolves dependencies base on the inputs/outputs pairs. This should be the case already. I think one thing to note here is conceptually

kedro run

is always ONE pipeline, let's say you have

kedro run --pipeline="preprocess,train"

as Joel suggested, they will be merged into one Pipeline object as the link I shared eariler, so there is nothing to worry about the dependency.

👍 1

Nok Lam Chan

05/24/2024, 10:09 AM

I suspect you can already do this with tags,

kedro run --tags preprocess,train

. I think there is a subtle difference here, most operators applied in a

AND

fashion, only

tags

is using the

OR

logic. Cc @datajoely?

mattia.paterna

05/24/2024, 3:00 PM

@Nok Lam Chan okay, so this means that pipeline composition is associative, e.g.

preprocess+train==train+preprocess

? (It might be a noob question, apologies if it is.) I agree that conceptually a composite pipeline shall become one, which is also reflected into one execution and e.g. one run registered inside an experiment tracker. @datajoely at the moment, we decided to follow this approach: • We derived a version of

_PipelineRegistry

that we initialise it by passing it some extra runtime parameters such as e.g.

--composite

. • We then use the parameter inside the implementation for

register_pipelines()

, we create the pipeline composition, and we register it as

__default__

. The CLI commands looks like the following:

poetry run kedro run --params composition=preprocess+train,run_name=some-run-"$(date +%s)",...

datajoely

05/24/2024, 3:02 PM

__add__

is literally to combine the

set()

of nodes in the group and then recalculate an execution order https://github.com/kedro-org/kedro/blob/d219e403a7e2929dd710ea8781ec0ae30ccec0df/kedro/pipeline/pipeline.py#L177

👍 1

datajoely

05/24/2024, 3:02 PM

our of interest, do we support

multiple

--pipeline

arguments?

Nok Lam Chan

05/24/2024, 6:36 PM

7 Views

Open in Slack

Previous Next