Kedro is an open-sourced Python framework for creating maintainable and modular data science code.

Kedro

Hi! I'm looking for a way to run the entire Kedro pipeline in parallel. Each instance takes a chunk of data from a huge database to speed up the process. It's essentially a map-reduce process. How to do it in the best way?
To give you more context, we have a products database with 8000+ categories. Number of these categories can change. Each category has to be processed separately and results are concatenated at the end.

Someone will reply to you shortly. In the meantime, this might help:

Hi Robert, where you plan to execute your calculations?

&gt;  Huge database 
How big in terms of the size? high TBs? In that case you shouldn't even do the map-reduce in Kedro, and should use tools like Spark and Kedro have built-in connectors for the purpose.

In terms of running the pipeline a current approach is to use Azure Batch. In terms of size these are TBs of data but the size itself is not the problem but the number of subsets (= no of categories). Unfortunately, each subset has to be processed separately (e.g. the process includes embeddings extraction and search tree build). A single run is about 30-60 seconds.