Kedro is an open-sourced Python framework for creating maintainable and modular data science code.

Kedro

Screenshot 2024-04-17 at 10.39.45 AM.png

Hello Kedro experts!

Coming here for a bit of advice on Kedro Pipeline design choice.

Consider the following scenario:

1. You have catalog entries for datasets A to E
2.  The datasets have different number of rows and different schema
3. The datasets can not be joined together to form a master table
4. Based on the value of a parameter, I want to pick either of the datasets and run the downstream pipeline with it
5. The pipeline is generic enough to handle either of the datasets
Defining a node to just select the data based on a parameter leads to unnecessary I/O. Any other design choice I could make here?

Thanks! :slightly_smiling_face:

So we don’t encourage this sort of conditional dynamism or event based routing because it breaks all of the assumptions around reproducibility

With Kedro in its current form I’m not sure I’d recommend it here

Can you do modular pipeline instead and either invoke it with namespace or with tags?

<@U045L91RV9D> Thanks for this. I think yes, I agree that namespacing/tags could be a viable way here.