I copied <@U041YC3GGD7>'s question from the <#C03R...
# questions
y
I copied @Ashish Verma's question from the #introduce-yourself channel to here: Hey team, I am still struggling with Kedro + Databricks integrations. After resolving all the package conflicts, I am encountering a never seen before error. While creating the Kedro session, I am facing the Py4JSecurityException error. Error and screenshot below for reference. py4j.security.Py4JSecurityException: Constructor public org.apache.spark.SparkConf(boolean) is not whitelisted. Can you please help me on this?? Solutions I find on google is to create a new cluster, which is not an option for us. Also, I tried removing the context.py which initialize the custom spark context, this is also not working. Let me know if there is something else I need to do, thanks. 🙂 Thanks Ashish Verma
Hi Ashish, could you describe what steps you took that caused this?
a
Copy code
metadata = bootstrap_project(Path.cwd())
with KedroSession.create(metadata.package_name) as session:
     session.run()
y
Awesome thanks for sharing this. Let me get additional information, what version of Kedro are you using?
t
m
Hi @Ashish Verma, this doesn’t look like a Kedro issue, but rather something to do with your cluster setup. Looking at the stacktrace it seems to come from where the spark session is initialised.
a
@Yetunde I am using 0.17.7 and python version 3.8.10.
@Tynan Thankyou, Yes I am using the Azure databricks.
👍 1
t
got it. as Merel said, this is most likely a Azure databricks problem and not something to do with Kedro. hope that link i sent helps out 🙂
r
.
Seems like we face a conflict between ADLS credentials passthrough (feature of Azure Databricks to relate storage access to AD role) and the Sparksession that Kedro tries to create (regarded unsafe by Azure). Has anyone encountered this before? Or are there experts that modified the Kedrocontext to import the sparksession from the environment before that could quickly brainstorm?
a
@Rogier Vlijm what code do you have in your
KedroContext
? Does is instantiate spark or just use the in-built
spark
object in databricks?
r
Now it instantiates a new spark session through .getOrCreate(), but we’re trying to assign the built-in spark now
👍 1
a
Yeah, that’s what I was thinking of. Please can you try this to get the
spark
instance:
Copy code
def _get_databricks_object(name: str):
    """Gets object called `name` from the user namespace."""
    return IPython.get_ipython().user_ns.get(name)  # pragma: no cover

_get_databricks_object("spark")
Just to check, is the code you’re using that doesn’t work the same as this? https://github.com/kedro-org/kedro-starters/blob/0.18.2/pyspark/%7B%7B%20cookiecut[…]7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/context.py I have heard of problems like this before but am not sure exactly what the solution was - will need to do some searching and remind myself. Ultimately we should have something built into kedro that figures out if you’re on databricks and reuses the
spark
object rather than calling
getOrCreate
r
Thanks for chipping in Antony! That code snippet might be useful, giving it a try in a few moments
Our code is similar to your link although from kedro 0.17.7 instead
Seems like we get the built-in spark through this method, but now running into:
Copy code
Object 'SparkDataSet' cannot be loaded from 'kedro.extras.datasets.spark'. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.spark.SparkDataSet
and when running
pip install kedro[spark.SparkDataSet]
:
(we’re trying to
pip install hdfs
now)