Hello, team!
I have some question regarding best practices. I am developing a relatively classic ML solution which reads data from S3, runs ETL, and then trains and serves multiple models. Each model has a different preprocessing pipeline while the ETL contains model-independent logic. I plan to use Kedro with Kedro-MLflow plugin. I think, the
application architecture suggested works great for me but I have doubts about separation of concerns. My main concern is about keeping ETL and ML applications together in one repository. Here are some thoughts and inputs which I think will be useful for the decision:
1. I think each model will have it's own repository with it's own Kedro + Kedro-MLflow usage. The logic btw models and their pipelines is very different and teams working on them are expected to be independent. However, all teams are dependent on the same ETL and therefore will have to sync some contract changes
2. ETL and ML apps will very likely use different infrastructure: for example, AWS batch and AWS SageMaker respectively.
3. Both ETL and ML apps are expected to be managed by Kedro-Airflow
Thank you very much for your help!