Announcing `kedro databricks` A New Kedro Plugin for Seamle Kedro #plugins-integrations

*# Announcing `kedro-databricks` : A New Kedro Plu...

Jens Peder Meldgaard

07/19/2024, 4:57 PM

# Announcing
kedro-databricks
: A New Kedro Plugin for Seamless Databricks Integration 🚀 Developing pipelines on Databricks just got a whole lot easier! I am excited to introduce

kedro-databricks

, a powerful new plugin designed to enhance your Kedro experience on Databricks. This plugin provides a streamlined, efficient, and developer-friendly approach to deploying and managing Kedro pipelines on the Databricks platform. ## Key Features: • Initialization: Transform your local Kedro project into a Databricks Asset Bundle project with a single command. • Generation: Effortlessly generate Asset Bundle resources definitions. • Deployment: Simplify the deployment of your Kedro projects to Databricks. # How to get started ## Prerequisites: Before you begin, ensure that the Databricks CLI is installed and configured. For more information on installation and configuration, please refer to the Databricks CLI documentation . • Installation Help • Configuration Help ## Creating a new project Before creating a new project, ensure you have installed Kedro into a virtual environment . Then use the following command:

Copy code

pip install kedro

Initialize a new Kedro project with the

databricks-iris

starter with the following command:

Copy code

kedro new --starter="databricks-iris"

After the project is created, navigate to the newly created project directory:

Copy code

cd <my-project-name>  # change directory

Install the required dependencies:

Copy code

pip install -r requirements.txt
pip install kedro-databricks

Now you can initialize the Databricks asset bundle

Copy code

kedro databricks init

Next, generate the Asset Bundle resources definition:

Copy code

kedro databricks bundle

Finally, deploy the Kedro project to Databricks:

Copy code

kedro databricks deploy

That's it! Your pipelines have now been deployed as a workflow to Databricks as ``[dev user] project_name``. Try running the workflow to see the results. You're all set to start developing your Kedro pipelines on Databricks. For more detailed information and documentation, visit the Github Repository.

🔥 9

K 10

🧱 9

Juan Luis

07/19/2024, 5:25 PM

amazing work @Jens Peder Meldgaard!! 🙌🏼

Yury Fedotov

07/19/2024, 5:35 PM

CC: @Richard Purvis, @Yaroslav Starukhin

🤯 1

Richard Purvis

07/19/2024, 5:39 PM

Super cool, will have to try it out!

datajoely

07/19/2024, 9:45 PM

HERO!

❤️ 1

Nicolas P

07/24/2024, 9:36 AM

Hello @Jens Peder Meldgaard, thx for the pluggin! I've just tested it and got an exception after a long waiting time at the step :

Uploading bundle files to ...

Copy code

Exception: Deploying to Databricks: Command '['databricks', 'bundle', 'deploy', '--target', 'local']' returned non-zero exit status 1.

I can confirm the bundle is created in

Home/,bundle

but no workflow in

Workflows

Jens Peder Meldgaard

07/24/2024, 9:48 AM

Sent PM

Puneet Saini

07/25/2024, 11:49 AM

Thank you @Jens Peder Meldgaard! I have been trying out your plugin today. It sucks that databricks limits to max of 100 tasks in a single job. I have more than 100 nodes in my default pipeline T.T which is why it is crashing...

😅 2

Jens Peder Meldgaard

07/25/2024, 2:27 PM

Yeah, I didn't really test it out with pipelines that size

Juan Luis

07/25/2024, 3:18 PM

should we think of grouping the tasks then, like we do with Kedro-airflow? cc @Ankita Katiyar @Simon Brugman @datajoely

👍 1

👍🏼 1

Puneet Saini

07/25/2024, 3:19 PM

That would be a great add on

Puneet Saini

07/25/2024, 3:22 PM

I was suggesting another feature to set cluster_id at the time of init in case people want to leverage an existing cluster.

Jens Peder Meldgaard

07/25/2024, 4:18 PM

A huddle started - Sorry misclicked

Juan Luis

07/25/2024, 6:28 PM

when I first tested the plugin I was able to set the cluster id by tweaking the config files

Puneet Saini

07/25/2024, 6:53 PM

Yes, we can set the cluster id by tweaking the config files

Puneet Saini

07/29/2024, 11:20 AM

Can probably also give an option to use

--pipeline

instead of

--nodes

. That can help in grouping tasks together. There's a downside but it's also an option that the user can benefit with.

Juan Luis

07/29/2024, 11:35 AM

opened an issue https://github.com/JenspederM/kedro-databricks/issues/32 I'd say let's continue the conversation there!

❤️ 1

Leonardo David Treiger Herszenhaut Brettas

08/10/2024, 5:27 PM

Thanks!

15 Views

Open in Slack

Previous Next