Hi Everyone does anyone know whether Kedro supports Prefect Kedro #questions

Hi Everyone, does anyone know whether Kedro suppor...

Ofir

03/05/2023, 10:46 PM

Hi Everyone, does anyone know whether Kedro supports Prefect 2.0? The deployment configuration with Prefect in the official documentation seems to refer to Prefect 1.0.

Deepyaman Datta

03/06/2023, 5:53 AM

Kedro doesn't support Prefect 2.0 per se, as much as there's an (outsdated) guide and I don't really see any reason why Kedro wouldn't work well wtith it.

Deepyaman Datta

03/06/2023, 6:01 AM

By any chance do you and @David work in the same org, or is it just two Prefect 2 deployment questions by chance today. I think will have to see if the team can prioritize updating the guide based on demand (among other things to do :)).

Ofir

03/06/2023, 2:55 PM

It’s not by chance 😉 That’s my bad.

👍 1

Ofir

03/06/2023, 2:55 PM

It’d be really nice to have the official Kedro documentation updated, we are keen to deploy and integrate our Kedro’s DS pipeline with our Prefect 2.0 orchestrator.

Ofir

03/06/2023, 2:59 PM

@Deepyaman Datta what we are currently (unfortunately) doing is to call:

Copy code

subprocess.call(['kedro', 'run'])

From one of the tasks in our Prefect workflow

Ofir

03/06/2023, 3:02 PM

It would be really nice to see how the deployment should look like for a production-grade setup. We have a data science team that is developing models with Kedro, and then we have the data engineering team that is in charge of the MLOps pipeline with Prefect. What is the optimal way to allow these two teams to work in parallel and have a fast and robust CI/CD pipeline across these two code bases and components? I guess building a wheel out of Kedro, building a custom Docker image for Prefect that `pip install`’s the Kedro wheel (from the ds team) and sticking to semver for backward compatibility. Right?

Deepyaman Datta

03/06/2023, 3:21 PM

Since Prefect 2 supports running native Python code in your workflows, I personally think it should be possible to orchestrate a Kedro pipeline with a

PrefectRunner

without having to containerize anything. Basically, it should be possible to automatically convert Kedro node -> Prefect task and Kedro pipeline -> Prefect flow (or maybe even modular pipeline -> flow and overall pipeline to flow of flows). I think mapping dataset behavior is a bit more challenging, since Prefect has a different method of handling task results (can you control where to write each result from a task? I'm not sure...). I can try to write up a Kedro issue on this (or feel free to post one to our GitHub), as I've also been keen on exploring Prefect 2 deployment for a while (I personally think it would be very cool to have a PrefectRunner for Kedro out of the box as a default deployment option), but I can't promise any timeline. Let me see if anybody else on the team has heard much in the way of general Prefect deployment demand. 🙂

👍 2

Ofir

03/06/2023, 3:22 PM

Thanks @Deepyaman Datta, highly appreciated K

Ofir

03/06/2023, 8:22 PM

@Deepyaman Datta what are the most common deployment scenarios you experience with Kedro’s customers? Dask? Airflow?

Ofir

03/06/2023, 8:23 PM

We are using the CLI based approach for now: https://kedro.readthedocs.io/en/stable/deployment/single_machine.html#cli-based

Ofir

03/06/2023, 8:24 PM

Can the

kedro docker build

build a deployment package of our Kedro pipeline onto a Docker image?

Deepyaman Datta

03/06/2023, 8:43 PM

@Deepyaman Datta what are the most common deployment scenarios you experience with Kedro’s customers? Dask? Airflow?

I think @Yetunde @Ivan Danov or @Merel can probably answer this with more data, but I think (from what I've seen/recall): • Databricks is very common, although that is probably skewed due to high use of Spark and Databricks amongst QuantumBlack/McKinsey projects. • Other cloud-based workflow orchestrators are often seen, such as Vertex AI (managed Kubeflow Pipelines) and Azure ML pipelines. GetInData (company unaffiliated with QuantumBlack/McKinsey) has developed a number of plugins in this space, and we see fair number of people looking to use such things in #C03RKPCLYGY (anecdotally). • I think Airflow is next most common, but I could be wrong on this (could be above some of the cloud-based orchestrators). • Also seen Sagemaker, Argo Workflows, etc., or even more closed platforms like Dataiku. • Dask is quite uncommon (😢).

Deepyaman Datta

03/06/2023, 8:44 PM

Can the
kedro docker build
build a deployment package of our Kedro pipeline onto a Docker image?

It can.

kedro docker

is quite simple, in that it will package everything into a single Docker image, and then you can run whatever pipelines you want. If you're happy with this, it's a fine option. If you want each node to map to a task in an orchestrator, it's not as great for that approach.

👍 1

Ofir

03/07/2023, 1:12 PM

Thanks!

Ofir

03/07/2023, 1:12 PM

Useful and cool stuff

Zhe An

03/16/2023, 4:44 PM

just want to add: in my company(fintech startup) , we also adopt prefect2 as pipeline

Deepyaman Datta

03/16/2023, 5:24 PM

I've created an issue to track this: https://github.com/kedro-org/kedro/issues/2431 But I don't know where this falls in our set of priorities. I would suggest commenting on there if it's important to you and your use cases; we're more likely to tackle things that have a lot of community demand first.

Ofir

05/05/2023, 10:05 AM

Thanks @Deepyaman Datta! What about simply calling

os.system("kedro run")

from a Prefect workflow? What are the disadvantages of doing such thing?

Ofir

05/05/2023, 10:06 AM

i.e. shelling the Kedro execution

Deepyaman Datta

05/06/2023, 9:11 PM

@Ofir In my opinion (and not necessarily an expert opinion :P), it's unnecessary redirection and less transparent to the executor (and whoever is looking at those logs). In order of least deviation from what you said: • Doing

os.system("kedro run")

is using Python to call a CLI endpoint that runs Python functions. • At the very minimum, I don't think using

os.system

is great for capturing potential errors, etc. But this can be remedied by using

subprocess

module methods (which are used heavily in Kedro actually), which you can check output/logs/error codes on. • Better yet, if you do want to essentially call the

run

method, you could do something like:

Copy code

from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project

bootstrap_project(project_root)

with KedroSession.create(project_path=project_root, env="prefect") as session:
    session.run()

(copied the boilerplate from https://docs.kedro.org/en/stable/integrations/databricks_workspace.html#running-kedro-project-from-a-databricks-notebook actually) • Finally, if you want more fine-grained control and/or are using bits and pieces of Kedro with more Prefect, you can construct runner + pipelines independently, configure them, and run. Alternatively (and ideally) a plugin/hook handles most of the dirty work here. The main reason I think some of the existing deployment guides do the shelled execution is because a lot of workflow orchestrators expect a shell script of sorts as a task, or Python tasks are often newer features. I think, where orchestrators support Python tasks, may as well take advantage of that.

72 Views

Open in Slack

Previous Next