Hello everyone Our team is relatively new to Kedro and we ai Kedro #questions

Hello everyone, Our team is relatively new to Ked...

Matěj Pekár

11/16/2023, 8:22 PM

Hello everyone, Our team is relatively new to Kedro, and we aim to use it primarily for data processing, particularly we are working with large images. Each pipeline is designed to processes a single image. However, when transitioning to production, we would like to apply these pipelines to a whole set of images. Our current approach involves parametrizing the filepath in the data catalog and passing it as a CLI argument. This is then executed by a wrapper script which applies the same pipeline to a whole folder by simply calling kedro with different arguments. Unfortunately, this method is very inefficient and seems suboptimal. We are considering an alternative of wrapping pipelines within another that loads the filenames in a directory and calls the main pipeline with the given filename as an argument. However, this approach appears to lack scalability and may be confusing. Ideally, we envision creating some kind of "collection" dataset that would dynamically generate pipelines with given image at runtime. This should also allow execution on either a single image or a whole set by simply changing the data type in data catalog. While this seems promising, we aren’t sure if Kedro supports such implementation. Any suggestions on how to properly handle this scenario in a scalable and reproducible manner would be appreciated.

Takieddine Kadiri

11/17/2023, 8:18 AM

You can look at the PartitionedDataset

K 1

Nok Lam Chan

11/17/2023, 8:51 AM

Our current approach involves parametrizing the filepath in the data catalog and passing it as a CLI argument. This is then executed by a wrapper script which applies the same pipeline to a whole folder by simply calling kedro with different arguments. Unfortunately, this method is very inefficient and seems suboptimal.

Is that a second-order pipeline? That is you use a kedro pipeline to generate another kedro pipeline? I agree with @Takieddine Kadiri

PartitionedDataset

and additionally dataset factory may help. docs.kedro.org

Matěj Pekár

11/17/2023, 9:58 PM

Thanks for your suggestions, but I think I wasn't quite clear about our scenario, so I drew a little sketch. So, we have the

Tile Pipeline

that we want to run either independently or in the

Image Pipeline

(not quite sure how to do this properly). The second part is being able to execute the "Image Pipeline" for set of images. Currently, we achieve this by parameterizing the data catalog entry and executing Kedro with the file name as CLI parameter. However, this is very inefficient. We have multiple pipelines like this so we are seeking for a general solution. I don't see how

PartitionedDataset

could help us. We are likely dealing with second-order pipelines, but I can't find any information about that in the docs.

Matěj Pekár

11/17/2023, 10:00 PM

Nok Lam Chan

11/18/2023, 5:23 AM

Are you currently doing this in 2 steps? Step1: Run the pipeline upto

create N tiles

Step2: Run individual “tile pipeline” and concat.

Currently, we achieve this by parameterizing the data catalog entry and executing Kedro with the file name as CLI parameter. However, this is very inefficient. We have multiple pipelines like this so we are seeking for a general solution.

Would you be able to give some example how do you do this exactly?

Nok Lam Chan

11/18/2023, 5:26 AM

In my mind, Kedro are dealing mostly with static pipeline. I think it’s inevitably requires 2 steps because N is not determined until the node get executed. Cc @marrrcin to see if you have any idea? Since I know you have experience dealing with image processing pipeline!

Takieddine Kadiri

11/18/2023, 10:12 AM

You can compose explicitely you pipelines execution order, and solve you second order (or dynamic) pipeline by using kedro boot Here is an example of using kedro boot for a similar problem (monte carlo simulation) https://github.com/takikadiri/kedro-boot-examples/tree/main#bonus-example--monte-carlo-simulation You can declare your image pipeline and tile pipeline as an AppPipeline, then use them in a kedro boot app, where you can define explicitely your orchestration logic.

🔥 1

Lukas Innig

11/18/2023, 4:08 PM

Here's another approach, but it relies on knowing the number of sub pipelines in advance: https://getindata.com/blog/kedro-dynamic-pipelines/ Courtesy of @marrrcin

👍 1

Open in Slack

Previous Next