I have a Kedro project where I want to use PySpark when runn Kedro #questions

I have a Kedro project where I want to use PySpark...

Andrew Stewart

02/01/2023, 1:35 AM

I have a Kedro project where I want to use PySpark when running in a cloud / production environment, but for experimentation in local environment I don't want to necessarily bother with standing up an entire Spark env. Looking for strategy advice. Solutions areas as I see so far: • somehow make SparkHook condition on environment? • really really simple Spark setup (like via Docker or something ; don't want to install Java on native)

Deepyaman Datta

02/01/2023, 3:09 AM

You can try a way to run the same code in some local and Spark backends: • Write code in pandas, run using

pyspark.pandas

in production • Write code in pandas, use

modin

to scale • Write code in Fugue, choose your backend • Write code in Ibis, choose your backend For Kedro, I'd recommend one of the first options, and you can potentially look at https://github.com/mzjp2/kedro-dataframe-dropin (very out of date) to see how this could be achieved Or you can set up a Spark env 🙂

datajoely

02/01/2023, 8:36 AM

I’m also super keen to bring Polars to Kedro

William Caicedo

02/01/2023, 8:58 AM

I use Docker for my dev environments and install PySpark inside, with great success

👍 1

Andrew Stewart

02/01/2023, 5:18 PM

Yeah I might have to just run a container locally.

Andrew Stewart

02/01/2023, 5:19 PM

@datajoely do you think it is possible to condition the loading of the Spark hook on the environment selected ? Maybe in

settings.py

datajoely

02/01/2023, 5:23 PM

so in the hook you could do

context.env

and do your condition there?

Andrew Stewart

02/01/2023, 5:29 PM

oh yes, I see.. the

context

is passed into the hook

datajoely

02/01/2023, 5:29 PM

yeah

datajoely

02/01/2023, 5:30 PM

settings.py

is evaluated before the

env

is known

👍 1

29 Views

Open in Slack

Previous Next