I have a Kedro project where I want to use PySpark...
# questions
a
I have a Kedro project where I want to use PySpark when running in a cloud / production environment, but for experimentation in local environment I don't want to necessarily bother with standing up an entire Spark env. Looking for strategy advice. Solutions areas as I see so far: • somehow make SparkHook condition on environment? • really really simple Spark setup (like via Docker or something ; don't want to install Java on native)
d
You can try a way to run the same code in some local and Spark backends: • Write code in pandas, run using
pyspark.pandas
in production • Write code in pandas, use
modin
to scale • Write code in Fugue, choose your backend • Write code in Ibis, choose your backend For Kedro, I'd recommend one of the first options, and you can potentially look at https://github.com/mzjp2/kedro-dataframe-dropin (very out of date) to see how this could be achieved Or you can set up a Spark env 🙂
d
I’m also super keen to bring Polars to Kedro
w
I use Docker for my dev environments and install PySpark inside, with great success
👍 1
a
Yeah I might have to just run a container locally.
@datajoely do you think it is possible to condition the loading of the Spark hook on the environment selected ? Maybe in
settings.py
?
d
so in the hook you could do
context.env
and do your condition there?
a
oh yes, I see.. the
context
is passed into the hook
d
yeah
settings.py
is evaluated before the
env
is known
👍 1