Hi everyone, I have a kedro pipeline and want it t...
# questions
j
Hi everyone, I have a kedro pipeline and want it to run on multiple datasets (the raw (input) data are different but following the same structure. Also want to keep the outputs in the same folder structure). What is the best practice using kedro to deal with this kind of problem?
Copy code
dataset1
|--01_raw
|--02_intermediate
|--03_primary
|--...
dataset2
|--01_raw
|--02_intermediate
|--03_primary
|--...
i
Modular pipelines are super handy for this. You set up a "template" pipeline and then map the inputs and outputs to different datasets. Btw, generally the folder structure I've seen is
Copy code
01_raw
|--dataset1
|--dataset2
02_intermediate
|--dataset1
|--dataset2
...
This lines up with having different layers in different locations (containers, servers etc)
👍 1
j
Thanks @Iñigo Hidalgo, I will take a look of this function
j
Hi @Jason, do you want to run the exact same pipeline but for example changing the country/region, without overwriting your datasets?
Copy code
01_raw
|--dataset_US.csv
|--dataset_EU.csv
02_intermediate
|--dataset_US.csv
|--dataset_EU.csv
...
j
Hi @Javier del Villar, thanks for asking. The problem can be simplified as that. Say I have two datasets (which will be expanded to a large number, e.g. 6-10). I want to run a kedro pipeline upon all the datasets and generate outputs (hopefully in an organized manner, e.g., one can collect the outputs for one dataset easily). One way I know is to use ‘namespace’ and programatically create multiple pipelines. One needs to define different catalogs (and parameters if needed). The only limit is that the output follows the overall folder structure. One needs to go through all sub_folder to collect the output for one dataset. I haven’t tried @Iñigo Hidalgo’s Modular pipelines yet. I read it through and think it should work too. Do you have any other suggestion? @Javier del Villar
j
Hi! what I was trying to say, does all "sets" of datasets will go through the same pipelines and nodes? maybe with different parameters, and catalog is configured in such way that the different sets of datasets don't overwrite each other?