I am having an issue with using spark sessions in ...
# questions
a
I am having an issue with using spark sessions in Kedro memory datasets. I have a function
get_spark() -> SparkSession: ...
I have in my catalog:
Copy code
spark_session:
  type: MemoryDataset
  copy_mode: assign
then my nodes are
node(func=get_spark, outputs="spark_session")
and I get this error:
Copy code
[CONTEXT_ONLY_VALID_ON_DRIVER] It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Is there another way to pass the session around to make it available to my nodes? Maybe I should be doing this in hooks? edit: I found https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#initialise-a-sparksession-using-a-hook might just do this
m
Yes, hooks is definitely the way to go!