https://kedro.org/ logo
#questions
Title
# questions
y

Yetunde

09/14/2022, 8:43 AM
I copied @Ashish Verma's question from the #introduce-yourself channel to here: Hey team, I am still struggling with Kedro + Databricks integrations. After resolving all the package conflicts, I am encountering a never seen before error. While creating the Kedro session, I am facing the Py4JSecurityException error. Error and screenshot below for reference. py4j.security.Py4JSecurityException: Constructor public org.apache.spark.SparkConf(boolean) is not whitelisted. Can you please help me on this?? Solutions I find on google is to create a new cluster, which is not an option for us. Also, I tried removing the context.py which initialize the custom spark context, this is also not working. Let me know if there is something else I need to do, thanks. 🙂 Thanks Ashish Verma
Hi Ashish, could you describe what steps you took that caused this?
a

Ashish Verma

09/14/2022, 9:05 AM
Copy code
metadata = bootstrap_project(Path.cwd())
with KedroSession.create(metadata.package_name) as session:
     session.run()
y

Yetunde

09/14/2022, 9:34 AM
Awesome thanks for sharing this. Let me get additional information, what version of Kedro are you using?
t

Tynan

09/14/2022, 9:43 AM
m

Merel

09/14/2022, 9:44 AM
Hi @Ashish Verma, this doesn’t look like a Kedro issue, but rather something to do with your cluster setup. Looking at the stacktrace it seems to come from where the spark session is initialised.
a

Ashish Verma

09/14/2022, 9:53 AM
@Yetunde I am using 0.17.7 and python version 3.8.10.
@Tynan Thankyou, Yes I am using the Azure databricks.
👍 1
t

Tynan

09/14/2022, 10:00 AM
got it. as Merel said, this is most likely a Azure databricks problem and not something to do with Kedro. hope that link i sent helps out 🙂
r

Rogier Vlijm

09/15/2022, 7:58 AM
.
Seems like we face a conflict between ADLS credentials passthrough (feature of Azure Databricks to relate storage access to AD role) and the Sparksession that Kedro tries to create (regarded unsafe by Azure). Has anyone encountered this before? Or are there experts that modified the Kedrocontext to import the sparksession from the environment before that could quickly brainstorm?
a

Antony Milne

09/15/2022, 9:11 AM
@Rogier Vlijm what code do you have in your
KedroContext
? Does is instantiate spark or just use the in-built
spark
object in databricks?
r

Rogier Vlijm

09/15/2022, 9:12 AM
Now it instantiates a new spark session through .getOrCreate(), but we’re trying to assign the built-in spark now
👍 1
a

Antony Milne

09/15/2022, 9:19 AM
Yeah, that’s what I was thinking of. Please can you try this to get the
spark
instance:
Copy code
def _get_databricks_object(name: str):
    """Gets object called `name` from the user namespace."""
    return IPython.get_ipython().user_ns.get(name)  # pragma: no cover

_get_databricks_object("spark")
Just to check, is the code you’re using that doesn’t work the same as this? https://github.com/kedro-org/kedro-starters/blob/0.18.2/pyspark/%7B%7B%20cookiecut[…]7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/context.py I have heard of problems like this before but am not sure exactly what the solution was - will need to do some searching and remind myself. Ultimately we should have something built into kedro that figures out if you’re on databricks and reuses the
spark
object rather than calling
getOrCreate
r

Rogier Vlijm

09/15/2022, 9:23 AM
Thanks for chipping in Antony! That code snippet might be useful, giving it a try in a few moments
Our code is similar to your link although from kedro 0.17.7 instead
Seems like we get the built-in spark through this method, but now running into:
Copy code
Object 'SparkDataSet' cannot be loaded from 'kedro.extras.datasets.spark'. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.spark.SparkDataSet
and when running
pip install kedro[spark.SparkDataSet]
:
(we’re trying to
pip install hdfs
now)