Hi Everyone, does anyone know whether Kedro suppor...
# questions
o
Hi Everyone, does anyone know whether Kedro supports Prefect 2.0? The deployment configuration with Prefect in the official documentation seems to refer to Prefect 1.0.
d
Kedro doesn't support Prefect 2.0 per se, as much as there's an (outsdated) guide and I don't really see any reason why Kedro wouldn't work well wtith it.
By any chance do you and @David work in the same org, or is it just two Prefect 2 deployment questions by chance today. I think will have to see if the team can prioritize updating the guide based on demand (among other things to do :)).
o
It’s not by chance 😉 That’s my bad.
👍 1
It’d be really nice to have the official Kedro documentation updated, we are keen to deploy and integrate our Kedro’s DS pipeline with our Prefect 2.0 orchestrator.
@Deepyaman Datta what we are currently (unfortunately) doing is to call:
Copy code
subprocess.call(['kedro', 'run'])
From one of the tasks in our Prefect workflow
It would be really nice to see how the deployment should look like for a production-grade setup. We have a data science team that is developing models with Kedro, and then we have the data engineering team that is in charge of the MLOps pipeline with Prefect. What is the optimal way to allow these two teams to work in parallel and have a fast and robust CI/CD pipeline across these two code bases and components? I guess building a wheel out of Kedro, building a custom Docker image for Prefect that `pip install`’s the Kedro wheel (from the ds team) and sticking to semver for backward compatibility. Right?
d
Since Prefect 2 supports running native Python code in your workflows, I personally think it should be possible to orchestrate a Kedro pipeline with a
PrefectRunner
without having to containerize anything. Basically, it should be possible to automatically convert Kedro node -> Prefect task and Kedro pipeline -> Prefect flow (or maybe even modular pipeline -> flow and overall pipeline to flow of flows). I think mapping dataset behavior is a bit more challenging, since Prefect has a different method of handling task results (can you control where to write each result from a task? I'm not sure...). I can try to write up a Kedro issue on this (or feel free to post one to our GitHub), as I've also been keen on exploring Prefect 2 deployment for a while (I personally think it would be very cool to have a PrefectRunner for Kedro out of the box as a default deployment option), but I can't promise any timeline. Let me see if anybody else on the team has heard much in the way of general Prefect deployment demand. 🙂
👍 2
o
Thanks @Deepyaman Datta, highly appreciated K
@Deepyaman Datta what are the most common deployment scenarios you experience with Kedro’s customers? Dask? Airflow?
Can the
kedro docker build
build a deployment package of our Kedro pipeline onto a Docker image?
d
@Deepyaman Datta what are the most common deployment scenarios you experience with Kedro’s customers? Dask? Airflow?
I think @Yetunde @Ivan Danov or @Merel can probably answer this with more data, but I think (from what I've seen/recall): • Databricks is very common, although that is probably skewed due to high use of Spark and Databricks amongst QuantumBlack/McKinsey projects. • Other cloud-based workflow orchestrators are often seen, such as Vertex AI (managed Kubeflow Pipelines) and Azure ML pipelines. GetInData (company unaffiliated with QuantumBlack/McKinsey) has developed a number of plugins in this space, and we see fair number of people looking to use such things in #plugins-integrations (anecdotally). • I think Airflow is next most common, but I could be wrong on this (could be above some of the cloud-based orchestrators). • Also seen Sagemaker, Argo Workflows, etc., or even more closed platforms like Dataiku. • Dask is quite uncommon (😢).
Can the
kedro docker build
build a deployment package of our Kedro pipeline onto a Docker image?
It can.
kedro docker
is quite simple, in that it will package everything into a single Docker image, and then you can run whatever pipelines you want. If you're happy with this, it's a fine option. If you want each node to map to a task in an orchestrator, it's not as great for that approach.
👍 1
o
Thanks!
Useful and cool stuff
z
just want to add: in my company(fintech startup) , we also adopt prefect2 as pipeline
d
I've created an issue to track this: https://github.com/kedro-org/kedro/issues/2431 But I don't know where this falls in our set of priorities. I would suggest commenting on there if it's important to you and your use cases; we're more likely to tackle things that have a lot of community demand first.
o
Thanks @Deepyaman Datta! What about simply calling
os.system("kedro run")
from a Prefect workflow? What are the disadvantages of doing such thing?
i.e. shelling the Kedro execution
d
@Ofir In my opinion (and not necessarily an expert opinion :P), it's unnecessary redirection and less transparent to the executor (and whoever is looking at those logs). In order of least deviation from what you said: • Doing
os.system("kedro run")
is using Python to call a CLI endpoint that runs Python functions. • At the very minimum, I don't think using
os.system
is great for capturing potential errors, etc. But this can be remedied by using
subprocess
module methods (which are used heavily in Kedro actually), which you can check output/logs/error codes on. • Better yet, if you do want to essentially call the
run
method, you could do something like:
Copy code
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project

bootstrap_project(project_root)

with KedroSession.create(project_path=project_root, env="prefect") as session:
    session.run()
(copied the boilerplate from https://docs.kedro.org/en/stable/integrations/databricks_workspace.html#running-kedro-project-from-a-databricks-notebook actually) • Finally, if you want more fine-grained control and/or are using bits and pieces of Kedro with more Prefect, you can construct runner + pipelines independently, configure them, and run. Alternatively (and ideally) a plugin/hook handles most of the dirty work here. The main reason I think some of the existing deployment guides do the shelled execution is because a lot of workflow orchestrators expect a shell script of sorts as a task, or Python tasks are often newer features. I think, where orchestrators support Python tasks, may as well take advantage of that.