Hi Everyone does anyone know whether Kedro supports Prefect Kedro #random

Hi Everyone, does anyone know whether Kedro suppor...

Ofir

03/05/2023, 10:46 PM

Hi Everyone, does anyone know whether Kedro supports Prefect 2.0? The deployment configuration with Prefect in the official documentation seems to refer to Prefect 1.0.

Matthias Roels

03/06/2023, 6:23 PM

In theory, yes. In practice however I would not recommend using kedro if you are already using Prefect (and vice versa).

Thanos Gatos

03/09/2023, 11:26 PM

@Matthias Roels why wouldn’t you recommend using them together? Could you elaborate?

Matthias Roels

03/10/2023, 8:55 AM

In my opinion, apart from the orchestration component of Prefect, they are largely overlapping in functionality. With Prefect’s decorators for tasks and flows, it is easy enough to create pipelines definitions. The only additional advantage of kedro is its data catalog and datasets. Other than that, Prefect + hydra can cover most of your needs.

Ofir

03/10/2023, 10:09 AM

Thanks @Matthias Roels . It’s really a puzzle to us and I’d love to hear your thoughts on this

Ofir

03/10/2023, 10:10 AM

We have a data science team that uses Kedro as their SDLC platform and ds framework to keep everything neat and organized and reproducible

Ofir

03/10/2023, 10:10 AM

This is as far as the development goes

Ofir

03/10/2023, 10:10 AM

Now as far as deployment goes for data science pipelines, a lot of companies are using Apache Airflow

Ofir

03/10/2023, 10:12 AM

Now Prefect seems to tackle and address a lot of the issues that Airflow lacks, and the key selling point for me for an orchestrator like Prefect was DAG Visualization and the UI for tracking and submitting jobs

Ofir

03/10/2023, 10:13 AM

And it’s obvious ecosystem and native support for docker compose as well as Kubernetes

Ofir

03/10/2023, 10:13 AM

The question is whether we can have the best of both worlds of Kedro and Prefect as part of our platform and development and deployment cycle

Matthias Roels

03/10/2023, 11:59 AM

If you are already using kedro and are looking for an orchestrator, I would either opt for a managed Airflow (astronomer, AWS managed Airflow, GCP Cloud Composer) for the simple reason that this is still the number one most used orchestrator (and hence lot of support/troubleshooting available) OR, if you are heavily invested ik k8s, Argo Workflows as this is a k8s native workflow engine and slightly more flexible/modern. In my project, we went for kedro-Argo Workflows because we are indeed heavily invested in k8s!

Ofir

03/10/2023, 12:07 PM

Thanks @Matthias Roels! We are heavily invested. How do you run a lighter weight Kedro deployment? e.g. for local development work we use docker compose and for simple GitHub workflow (CI jobs) we don’t necessarily want to provision a cluster

Ofir

03/10/2023, 12:07 PM

Do you use k8s for your day to day development and testing workflows or just for deployment?

Matthias Roels

03/10/2023, 1:17 PM

Well that’s something else I forgot to mention. We also use Argo Events and combined with Argo Workflows, we have an event driven workflow orchestration framework. So general purpose and powerful that we use this to run our CI workloads too (with Kaniko for building container images) instead of our enterprise Jenkins framework. Hence, we can run a kedro pipeline (or a set of kedro pipelines) during CI with a simple argo submit to let it run in our dev cluster.

Matthias Roels

03/10/2023, 1:20 PM

For actual development, we use openVSCode in a docker image together with all required dependencies, deployed in our cluster and exposed with an ingress (one set for every developer). This setup is inspired by gitpod’s workspace offerings. So a developer can run a small pipeline “locally” (which means it still runs in our dev cluster connected to S3 to fetch the required raw data)

Matthias Roels

03/10/2023, 1:22 PM

It is not perfect from a setup point of view, but it works and gives you the benefit that everyone is using the same setup with the minimal amount of effort (a k8s operator takes care of deploying a the required resources for such a setup using a crd per dev)

Matthias Roels

03/10/2023, 1:26 PM

As next steps, we want to further streamline this process/setup as well as looking into ways to automatically generate workflow crd’s for our kedro pipelines. Note that we do not want to convert every kedro node that a task in an Argo DAG as that would result in too much overhead from pod startups. Instead we want to split our kedro pipeline into several sub-pipelines and run each of these pipelines in a task

Ofir

03/10/2023, 1:29 PM

Thanks Matthias, that’s a lot of new terms for me but it looks very interesting and I’ll read about it. One thing to note is that we are deploying on-prem air-gapped on managed k8s envs (e.g. AKS / Rancher), so no access whatsoever to SaaS/cloud-based managed services

Matthias Roels

03/10/2023, 1:38 PM

No worries, we are not using any SaaS service either. We have made our own implementations (too much of a hassle from a compliance pov) so it all runs in our own environment! It is actually not that hard. If creating an operator feels to difficult, you could start with a simple Helm chart as well

Ofir

03/10/2023, 1:39 PM

I’ll need 2 DevOps to maintain the setup you just described haha

Matthias Roels

03/10/2023, 1:40 PM

Then who’s managing your clusters, docker images, …?

Ofir

03/10/2023, 1:40 PM

and someone who is very hands on the technology to do the initial bring up of the stack (or myself)

Ofir

03/10/2023, 1:41 PM

Currently we are using “DevOps as a service”, i.e. professional services, to do the bootstrapping - Terraform files, Helm charts, etc.

Matthias Roels

03/10/2023, 1:44 PM

In our case, it’s a project with roughly 35 developers (data scientists and data engineers). We have a separate infra team to manage our cloud setup (terraform) and we have 1.6FTE on the project (incl. me) to manage the rest (cluster apps with Helm, …)

Matthias Roels

03/10/2023, 1:45 PM

It’s doable if you have the right amount of automation (incl GitOps/ArgoCD) in place

Ofir

03/10/2023, 1:45 PM

Yeah I guess there’s a lot of grunt work to be done to have a good infra and a working setup

Ofir

03/10/2023, 1:46 PM

And cool, 35 sounds like a lot of R&D activity

Matthias Roels

03/10/2023, 8:15 PM

It’s actually just 1 project running at massive scale (1 solution rolled out in for country x product combination across EMEA) and we are still scaling up in the region 😅. So lot of work to be done to onboard new country and product combinations (data integration, model building with input from business stakeholders,…). But indeed, we could only do the platform side with 1.6FTE thanks to solid processes that we have in place. In the early days, we where a 3FTE team to give room for extensive feature development.

14 Views

Open in Slack