https://kedro.org/ logo
#questions
Title
# questions
a

Andrew Stewart

02/01/2023, 1:35 AM
I have a Kedro project where I want to use PySpark when running in a cloud / production environment, but for experimentation in local environment I don't want to necessarily bother with standing up an entire Spark env. Looking for strategy advice. Solutions areas as I see so far: • somehow make SparkHook condition on environment? • really really simple Spark setup (like via Docker or something ; don't want to install Java on native)
d

Deepyaman Datta

02/01/2023, 3:09 AM
You can try a way to run the same code in some local and Spark backends: • Write code in pandas, run using
pyspark.pandas
in production • Write code in pandas, use
modin
to scale • Write code in Fugue, choose your backend • Write code in Ibis, choose your backend For Kedro, I'd recommend one of the first options, and you can potentially look at https://github.com/mzjp2/kedro-dataframe-dropin (very out of date) to see how this could be achieved Or you can set up a Spark env 🙂
d

datajoely

02/01/2023, 8:36 AM
I’m also super keen to bring Polars to Kedro
w

William Caicedo

02/01/2023, 8:58 AM
I use Docker for my dev environments and install PySpark inside, with great success
👍 1
a

Andrew Stewart

02/01/2023, 5:18 PM
Yeah I might have to just run a container locally.
@datajoely do you think it is possible to condition the loading of the Spark hook on the environment selected ? Maybe in
settings.py
?
d

datajoely

02/01/2023, 5:23 PM
so in the hook you could do
context.env
and do your condition there?
a

Andrew Stewart

02/01/2023, 5:29 PM
oh yes, I see.. the
context
is passed into the hook
d

datajoely

02/01/2023, 5:29 PM
yeah
settings.py
is evaluated before the
env
is known
👍 1
8 Views