Hi everyone I have a kedro pipeline and want it to run on mu Kedro #questions

Hi everyone, I have a kedro pipeline and want it t...

Jason

04/24/2023, 1:33 PM

Hi everyone, I have a kedro pipeline and want it to run on multiple datasets (the raw (input) data are different but following the same structure. Also want to keep the outputs in the same folder structure). What is the best practice using kedro to deal with this kind of problem?

Copy code

dataset1
|--01_raw
|--02_intermediate
|--03_primary
|--...
dataset2
|--01_raw
|--02_intermediate
|--03_primary
|--...

Iñigo Hidalgo

04/24/2023, 2:17 PM

Modular pipelines are super handy for this. You set up a "template" pipeline and then map the inputs and outputs to different datasets. Btw, generally the folder structure I've seen is

Copy code

01_raw
|--dataset1
|--dataset2
02_intermediate
|--dataset1
|--dataset2
...

This lines up with having different layers in different locations (containers, servers etc)

👍 1

Jason

04/24/2023, 5:12 PM

Thanks @Iñigo Hidalgo, I will take a look of this function

Javier del Villar

04/24/2023, 8:52 PM

Hi @Jason, do you want to run the exact same pipeline but for example changing the country/region, without overwriting your datasets?

Copy code

01_raw
|--dataset_US.csv
|--dataset_EU.csv
02_intermediate
|--dataset_US.csv
|--dataset_EU.csv
...

Jason

04/24/2023, 9:35 PM

Hi @Javier del Villar, thanks for asking. The problem can be simplified as that. Say I have two datasets (which will be expanded to a large number, e.g. 6-10). I want to run a kedro pipeline upon all the datasets and generate outputs (hopefully in an organized manner, e.g., one can collect the outputs for one dataset easily). One way I know is to use ‘namespace’ and programatically create multiple pipelines. One needs to define different catalogs (and parameters if needed). The only limit is that the output follows the overall folder structure. One needs to go through all sub_folder to collect the output for one dataset. I haven’t tried @Iñigo Hidalgo’s Modular pipelines yet. I read it through and think it should work too. Do you have any other suggestion? @Javier del Villar

Javier del Villar

04/25/2023, 12:16 PM

Hi! what I was trying to say, does all "sets" of datasets will go through the same pipelines and nodes? maybe with different parameters, and catalog is configured in such way that the different sets of datasets don't overwrite each other?

23 Views

Open in Slack

Previous Next