Hi! I'm looking for a way to run the entire Kedro ...
# questions
r
Hi! I'm looking for a way to run the entire Kedro pipeline in parallel. Each instance takes a chunk of data from a huge database to speed up the process. It's essentially a map-reduce process. How to do it in the best way? To give you more context, we have a products database with 8000+ categories. Number of these categories can change. Each category has to be processed separately and results are concatenated at the end.
h
Someone will reply to you shortly. In the meantime, this might help:
d
Hi Robert, where you plan to execute your calculations?
n
Huge database
How big in terms of the size? high TBs? In that case you shouldn't even do the map-reduce in Kedro, and should use tools like Spark and Kedro have built-in connectors for the purpose.
r
In terms of running the pipeline a current approach is to use Azure Batch. In terms of size these are TBs of data but the size itself is not the problem but the number of subsets (= no of categories). Unfortunately, each subset has to be processed separately (e.g. the process includes embeddings extraction and search tree build). A single run is about 30-60 seconds.