Ofir
03/05/2023, 10:46 PMDeepyaman Datta
03/06/2023, 5:53 AMOfir
03/06/2023, 2:55 PMsubprocess.call(['kedro', 'run'])
From one of the tasks in our Prefect workflowDeepyaman Datta
03/06/2023, 3:21 PMPrefectRunner
without having to containerize anything. Basically, it should be possible to automatically convert Kedro node -> Prefect task and Kedro pipeline -> Prefect flow (or maybe even modular pipeline -> flow and overall pipeline to flow of flows). I think mapping dataset behavior is a bit more challenging, since Prefect has a different method of handling task results (can you control where to write each result from a task? I'm not sure...).
I can try to write up a Kedro issue on this (or feel free to post one to our GitHub), as I've also been keen on exploring Prefect 2 deployment for a while (I personally think it would be very cool to have a PrefectRunner for Kedro out of the box as a default deployment option), but I can't promise any timeline. Let me see if anybody else on the team has heard much in the way of general Prefect deployment demand. 🙂Ofir
03/06/2023, 3:22 PMkedro docker build
build a deployment package of our Kedro pipeline onto a Docker image?Deepyaman Datta
03/06/2023, 8:43 PM@Deepyaman Datta what are the most common deployment scenarios you experience with Kedro’s customers? Dask? Airflow?I think @Yetunde @Ivan Danov or @Merel can probably answer this with more data, but I think (from what I've seen/recall): • Databricks is very common, although that is probably skewed due to high use of Spark and Databricks amongst QuantumBlack/McKinsey projects. • Other cloud-based workflow orchestrators are often seen, such as Vertex AI (managed Kubeflow Pipelines) and Azure ML pipelines. GetInData (company unaffiliated with QuantumBlack/McKinsey) has developed a number of plugins in this space, and we see fair number of people looking to use such things in #plugins-integrations (anecdotally). • I think Airflow is next most common, but I could be wrong on this (could be above some of the cloud-based orchestrators). • Also seen Sagemaker, Argo Workflows, etc., or even more closed platforms like Dataiku. • Dask is quite uncommon (😢).
Can theIt can.build a deployment package of our Kedro pipeline onto a Docker image?kedro docker build
kedro docker
is quite simple, in that it will package everything into a single Docker image, and then you can run whatever pipelines you want. If you're happy with this, it's a fine option.
If you want each node to map to a task in an orchestrator, it's not as great for that approach.Ofir
03/07/2023, 1:12 PMZhe An
03/16/2023, 4:44 PMDeepyaman Datta
03/16/2023, 5:24 PMOfir
05/05/2023, 10:05 AMos.system("kedro run")
from a Prefect workflow? What are the disadvantages of doing such thing?Deepyaman Datta
05/06/2023, 9:11 PMos.system("kedro run")
is using Python to call a CLI endpoint that runs Python functions.
• At the very minimum, I don't think using os.system
is great for capturing potential errors, etc. But this can be remedied by using subprocess
module methods (which are used heavily in Kedro actually), which you can check output/logs/error codes on.
• Better yet, if you do want to essentially call the run
method, you could do something like:
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
bootstrap_project(project_root)
with KedroSession.create(project_path=project_root, env="prefect") as session:
session.run()
(copied the boilerplate from https://docs.kedro.org/en/stable/integrations/databricks_workspace.html#running-kedro-project-from-a-databricks-notebook actually)
• Finally, if you want more fine-grained control and/or are using bits and pieces of Kedro with more Prefect, you can construct runner + pipelines independently, configure them, and run. Alternatively (and ideally) a plugin/hook handles most of the dirty work here.
The main reason I think some of the existing deployment guides do the shelled execution is because a lot of workflow orchestrators expect a shell script of sorts as a task, or Python tasks are often newer features. I think, where orchestrators support Python tasks, may as well take advantage of that.