Hello everyone. I have a question related to a par...
# questions
v
Hello everyone. I have a question related to a particular use case and its best practices. I'm building different pipelines designed to follow the classic lifecycle of building a model, from preprocessing the training data to fine-tuning the model and evaluating its results. However, I'm now concerned about the case in which I intend to use the model to evaluate new subjects. In particular, this scenario has the following characteristics: - First, the data doesn't arrive en masse, nor is it expected to do so at any point. The cases to be evaluated are limited. - The prediction is generated on-demand and asynchronously (it's understood that for some cases, preprocessing may take time, so the associated routine is executed in a parallel task for the user). - The data would come from a server other than the one where Kedro would be running. Given this, what would be the most recommended complementary tools to serve the model and its results? What would be the most appropriate functional architecture? I have tools like Airflow at my disposal, but I'm not sure if that's enough, if I should use another tool to set up an API, if Kedro alone is enough to do it all, etc. The possibilities are endless, but I want to avoid "rebuilding the wheel" as much as possible. Any recommendations are welcome. Thanks in advance.
n
The prediction is generated on-demand and asynchronously
How does this work? Can you elaborate a bit more?
- The data would come from a server other than the one where Kedro would be running.
When you say server does it means it comes as a web request?
Without more context, I think there are two common ways: • Batch prediction - in case where it's not time sensitive, you may just need to have scheduled job running periodically (hour, day), the output is then written to somewhere, i.e. a database table. Then for your serving it's really just a retrieval from your database corresponding to the request (maybe a user_id say if it's a prediction about user, or an item_id for recommendation) • per request prediction - in this case you probably want to have Kedro pipeline serving behind an endpoint (i.e. a fastapi server calling a Kedro pipeline). Depends on the performance need you may do something more.
😊 1
v
Thank you very much for your response. Perhaps I missed including a little more context. This is an application that uses Django as a backend (with the Django Rest Framework) and Vue.js as a frontend. It was created about 5 years ago as a prototype and is currently being updated and extended. Its main purpose is to classify multiclass subjects based on three types of raw data that characterize them. Some of these three files are several MB in size. Since the raw files used to obtain the attributes are large, processing them takes several minutes, so for a new subject, the response from a classifier cannot be obtained immediately. The data does not appear consistently at a certain frequency (it is not transactional data), nor is a massive number of subjects expected. The application's user base is very limited and specialized, and therefore, classifications will be performed on demand and not in bulk on all subjects uploaded to the application. However, since obtaining the attributes takes a long time, the result will be generated asynchronously. So, one of the functionalities that was planned at some point was to maintain different versions of the prediction models built to classify subjects, for research and documentation purposes. The problem with this is that Django doesn't provide the necessary tools to do this (it would be like building the required environment from scratch), considering that each model could (not necessarily) require a new version of the file processing functions to obtain features (i.e., the attributes between models could be different). To avoid a development disaster, we looked for a more professional option and chose Kedro. To further justify our choice, I should mention that in my organization we have several projects of this nature, where we experiment with machine learning and applied deep learning. The problem is that the experimentation process often involves a lot of messy code scattered all over the place in, for example, Jupyter Notebooks, so keeping track of these experiments becomes chaotic and makes projects like this one very difficult, where there is an interest in maintaining order and different versions of the models. That's why we considered making organizational use of Kedro and other complementary tools such as MLFlow, Airflow, Optuna, etc. So, on a dedicated server with a GPU, we set up a Docker container with Airflow as the base, which also contains tools such as Kedro, MLFlow, Optuna, and various ML and DL packages like scikit-learn, PyTorch, RAPIDS, etc. This container is designed for the final step, when the models move to a production stage, for each specific project. As we are in a learning process, we have not yet used it. In this particular case, development is still being done with Kedro apart from the container. Therefore, the Django application runs on a separate server, and indeed, communication between Django and Kedro pipelines could be built using web requests. So far, using Kedro, MLFlow, and Optuna, we have reached the point where we have several trained models and want to test deploying one of them to design the functional architecture that will allow communication between Django and this server. As I mentioned, we don't want to program something that already exists. Therefore, for the moment, we have avoided using an API directly for the deployment step, as it would be the crudest way to do it. We understand that Kedro integrates with other deployment tools, such as Apache Airflow, but we don't know if this would be the best option for this case. So we're simply looking for recommendations. I hope I've been clearer this time. Thank you so much for the support.
n
Do I get this correctly? • The request is heavy, so an user would trigger the prediction job manually while not getting response immediately? • So you don't want the web request to trigger a Kedro run directly, but instead trigger a scheduled DAG in Airflow for example?
It sounds like what you need is some kind of queue to handle the async processing part. If you want to use Airflow as the task queuing system I think that's ok, given you can generate a parameterised DAG (probably parameter from your request to trigger a certain job).