hi all, I need some help! we are a startup working...
# questions
j
hi all, I need some help! we are a startup working with real state data. we are defining the deployment strategy for kedro on aws. I would love to discuss this with some experienced kedro users. this is our use case: we have a set of kedro projects with different models and processing pipelines. a typical project for us has the following components: [1] a kedro data processing pipeline [2] a kedro model training and evaluation pipeline [3] an inference/prediction pipeline [4] there is always an S3 bucket for each project [5] source data is usually in our datalake in aws, i.e. also an S3 bucket our models are not that big, so we usually run [1] and [2] locally, and the articfacts (model.pkl, and other stuff) are generated and stored in [4] the S3 location. so far so good. with [3], we have different requirements. [3.0] we have some post processing steps here: running the prediction on the model and then depending on the results from the model, extract some data from other datasets for that prediction, also format some of the data. we don’t want to duplicate the code here for [3.1] and [3.2]. [3.1] scheduled execution. we need to run the inference pipeline every x time period, let say every month. so, every month the generated data about real state properties [5] needs to be processed, and the output dataset is stored in the S3 [4]. -> we have tried a docker image and EC2, with an scheduled task, which kind of works, but this doesn’t play well with [3.2] [3.2] on demand excecution. we need to run the ‘same’ pipeline when a user registers a new property in our website. basically we need the same output information as in [3.1] but for one single registry. -> we have tried a lambda function that has the code for [3.0] and has access to the S3 location [4]. this works, but the code for the post-processing [3.0] is duplicated. questions? ? what would be a good architecture to solve this without duplicating the code in [3.0] the inference pipeline? ? how can we have an inference pipeline that can be run both on-demand and with an schedule? ? what aws services are recommended for this? thank you very much for any help or advise you can give me! bests, camilo
d
-> we have tried a lambda function that has the code for [3.0] and has access to the S3 location [4]. this works, but the code for the post-processing [3.0] is duplicated.
Why is the code duplicated? I don't think I understood from your writeup. Is the problem that you don't package the code for [3.0] in a way that it can be run from both the service doing scheduled execution and the one for on-demand execution?
j
I understand your point. I guess I could write a python-lib for the inference pipeline [3.0] and import it in both the kedro pipeline and the lambda… I’ve done that with other shared code. I wrote [3.0] as a kedro pipeline initially. I thought this would be the proper kedro way of doing things, I don’t know. • I tried micro-packaging the pipeline and importing it in the lambda, but the dependency was too large and jenkins had some issues with it.
I think, here is more about, what is the proper way of doing this. what do you think @Deepyaman Datta?
d
• I tried micro-packaging the pipeline and importing it in the lambda, but the dependency was too large and jenkins had some issues with it.
This is in line with the approach I'd take. You could push the generated wheel to a shared space, like S3 or even an internal PyPI respository. That wheel would be used by both deployments. You could alternatively have a CI process that builds the Docker image for EC2 + Docker image for Lambda (I don't have enough recent hands-on experience with AWS to say whether you coul duse the same image), and use those in your deployment. There are also "hackier" ways of doing things, like dynamically referencing the pipeline code from a shared location, but not sure why should do that instead of the above two options.
j
the thing with the wheel, is that the resulting dependencies for kedro are too large. I had a hard time getting that to run with our jenkins CI. maybe two different docker images are a good alternative
what aws services would you use for these use cases? are ec2 and lambda ok for this? or would you recommend some sagemaker stuff?
d
I haven't used AWS much, primarily GCP and Azure. I'll defer to somebody else on that. (But, if I had to guess, I'd say Lambda is good for dynamic execution, maybe Batch for scheduled execution, and I'm sure there's a way to do something involving Sagemaker--but I've never used it)
j
ok thanks @Deepyaman Datta, that gave me some light!
👍 1
@Deepyaman Datta do you have any experience with kedro-mlflow? would this be a good alternative for our usecase?
maybe there are some aws+kedro ninjas here that can also provide an opinion regarding aws services and architecture?
d
I haven't kept up with the latest developments, but the main part of kedro-mlflow I'm familiar with is the model registry. So, if you want to publish/load a versioned model, that makes sense (and it could be a good addition to your workflow), but it doesn't directly answer your initial question.
j
hi @J. Camilo V. Tieck, on top of everything that @Deepyaman Datta said, have a look at https://github.com/getindata/kedro-sagemaker for a Kedro & AWS Sagemaker integration created by the GetInData folks cc @marrrcin
j
hi @Juan Luis, I looked and kedro-sagemaker, but if I got it right, it is for running the training nodes, right? our model is not that large and can be run locally. basically, I need to deploy the an inference pipeline in two scenarios: batch/scheduled from any aws service, and on demand from an aws lambda function.
j
I see - I gave a quick look at our docs and I couldn't find anything, so maybe re-post outside of this thread for higher visibility might help
j
alright!