Hi Everyone, does anyone know whether Kedro suppor...
# random
o
Hi Everyone, does anyone know whether Kedro supports Prefect 2.0? The deployment configuration with Prefect in the official documentation seems to refer to Prefect 1.0.
m
In theory, yes. In practice however I would not recommend using kedro if you are already using Prefect (and vice versa).
t
@Matthias Roels why wouldn’t you recommend using them together? Could you elaborate?
m
In my opinion, apart from the orchestration component of Prefect, they are largely overlapping in functionality. With Prefect’s decorators for tasks and flows, it is easy enough to create pipelines definitions. The only additional advantage of kedro is its data catalog and datasets. Other than that, Prefect + hydra can cover most of your needs.
o
Thanks @Matthias Roels . It’s really a puzzle to us and I’d love to hear your thoughts on this
We have a data science team that uses Kedro as their SDLC platform and ds framework to keep everything neat and organized and reproducible
This is as far as the development goes
Now as far as deployment goes for data science pipelines, a lot of companies are using Apache Airflow
Now Prefect seems to tackle and address a lot of the issues that Airflow lacks, and the key selling point for me for an orchestrator like Prefect was DAG Visualization and the UI for tracking and submitting jobs
And it’s obvious ecosystem and native support for docker compose as well as Kubernetes
The question is whether we can have the best of both worlds of Kedro and Prefect as part of our platform and development and deployment cycle
m
If you are already using kedro and are looking for an orchestrator, I would either opt for a managed Airflow (astronomer, AWS managed Airflow, GCP Cloud Composer) for the simple reason that this is still the number one most used orchestrator (and hence lot of support/troubleshooting available) OR, if you are heavily invested ik k8s, Argo Workflows as this is a k8s native workflow engine and slightly more flexible/modern. In my project, we went for kedro-Argo Workflows because we are indeed heavily invested in k8s!
o
Thanks @Matthias Roels! We are heavily invested. How do you run a lighter weight Kedro deployment? e.g. for local development work we use docker compose and for simple GitHub workflow (CI jobs) we don’t necessarily want to provision a cluster
Do you use k8s for your day to day development and testing workflows or just for deployment?
m
Well that’s something else I forgot to mention. We also use Argo Events and combined with Argo Workflows, we have an event driven workflow orchestration framework. So general purpose and powerful that we use this to run our CI workloads too (with Kaniko for building container images) instead of our enterprise Jenkins framework. Hence, we can run a kedro pipeline (or a set of kedro pipelines) during CI with a simple argo submit to let it run in our dev cluster.
For actual development, we use openVSCode in a docker image together with all required dependencies, deployed in our cluster and exposed with an ingress (one set for every developer). This setup is inspired by gitpod’s workspace offerings. So a developer can run a small pipeline “locally” (which means it still runs in our dev cluster connected to S3 to fetch the required raw data)
It is not perfect from a setup point of view, but it works and gives you the benefit that everyone is using the same setup with the minimal amount of effort (a k8s operator takes care of deploying a the required resources for such a setup using a crd per dev)
As next steps, we want to further streamline this process/setup as well as looking into ways to automatically generate workflow crd’s for our kedro pipelines. Note that we do not want to convert every kedro node that a task in an Argo DAG as that would result in too much overhead from pod startups. Instead we want to split our kedro pipeline into several sub-pipelines and run each of these pipelines in a task
o
Thanks Matthias, that’s a lot of new terms for me but it looks very interesting and I’ll read about it. One thing to note is that we are deploying on-prem air-gapped on managed k8s envs (e.g. AKS / Rancher), so no access whatsoever to SaaS/cloud-based managed services
m
No worries, we are not using any SaaS service either. We have made our own implementations (too much of a hassle from a compliance pov) so it all runs in our own environment! It is actually not that hard. If creating an operator feels to difficult, you could start with a simple Helm chart as well
o
I’ll need 2 DevOps to maintain the setup you just described haha
m
Then who’s managing your clusters, docker images, …?
o
and someone who is very hands on the technology to do the initial bring up of the stack (or myself)
Currently we are using “DevOps as a service”, i.e. professional services, to do the bootstrapping - Terraform files, Helm charts, etc.
m
In our case, it’s a project with roughly 35 developers (data scientists and data engineers). We have a separate infra team to manage our cloud setup (terraform) and we have 1.6FTE on the project (incl. me) to manage the rest (cluster apps with Helm, …)
It’s doable if you have the right amount of automation (incl GitOps/ArgoCD) in place
o
Yeah I guess there’s a lot of grunt work to be done to have a good infra and a working setup
And cool, 35 sounds like a lot of R&D activity
m
It’s actually just 1 project running at massive scale (1 solution rolled out in for country x product combination across EMEA) and we are still scaling up in the region 😅. So lot of work to be done to onboard new country and product combinations (data integration, model building with input from business stakeholders,…). But indeed, we could only do the platform side with 1.6FTE thanks to solid processes that we have in place. In the early days, we where a 3FTE team to give room for extensive feature development.