I copied < Ashish Verma> s question from the < C03RKNSN3U0|> Kedro #questions

I copied <@U041YC3GGD7>'s question from the <#C03R...

Yetunde

09/14/2022, 8:43 AM

I copied @Ashish Verma's question from the #C03RKNSN3U0 channel to here: Hey team, I am still struggling with Kedro + Databricks integrations. After resolving all the package conflicts, I am encountering a never seen before error. While creating the Kedro session, I am facing the Py4JSecurityException error. Error and screenshot below for reference. py4j.security.Py4JSecurityException: Constructor public org.apache.spark.SparkConf(boolean) is not whitelisted. Can you please help me on this?? Solutions I find on google is to create a new cluster, which is not an option for us. Also, I tried removing the context.py which initialize the custom spark context, this is also not working. Let me know if there is something else I need to do, thanks. 🙂 Thanks Ashish Verma

Yetunde

09/14/2022, 8:46 AM

Hi Ashish, could you describe what steps you took that caused this?

Ashish Verma

09/14/2022, 9:05 AM

Copy code

metadata = bootstrap_project(Path.cwd())
with KedroSession.create(metadata.package_name) as session:
     session.run()

Yetunde

09/14/2022, 9:34 AM

Awesome thanks for sharing this. Let me get additional information, what version of Kedro are you using?

Tynan

09/14/2022, 9:43 AM

And another question Ashish: are you using databricks on Azure? If so, this may help: https://docs.microsoft.com/en-gb/azure/databricks/security/credential-passthrough/adls-passthrough#py4jsecuritypy4jsecurit[…]ion--is-not-whitelisted

Merel

09/14/2022, 9:44 AM

Hi @Ashish Verma, this doesn’t look like a Kedro issue, but rather something to do with your cluster setup. Looking at the stacktrace it seems to come from where the spark session is initialised.

Ashish Verma

09/14/2022, 9:53 AM

@Yetunde I am using 0.17.7 and python version 3.8.10.

Ashish Verma

09/14/2022, 9:56 AM

@Tynan Thankyou, Yes I am using the Azure databricks.

👍 1

Tynan

09/14/2022, 10:00 AM

got it. as Merel said, this is most likely a Azure databricks problem and not something to do with Kedro. hope that link i sent helps out 🙂

Rogier Vlijm

09/15/2022, 7:58 AM

Rogier Vlijm

09/15/2022, 7:59 AM

Seems like we face a conflict between ADLS credentials passthrough (feature of Azure Databricks to relate storage access to AD role) and the Sparksession that Kedro tries to create (regarded unsafe by Azure). Has anyone encountered this before? Or are there experts that modified the Kedrocontext to import the sparksession from the environment before that could quickly brainstorm?

Antony Milne

09/15/2022, 9:11 AM

@Rogier Vlijm what code do you have in your

KedroContext

? Does is instantiate spark or just use the in-built

spark

object in databricks?

Rogier Vlijm

09/15/2022, 9:12 AM

Now it instantiates a new spark session through .getOrCreate(), but we’re trying to assign the built-in spark now

👍 1

Antony Milne

09/15/2022, 9:19 AM

Yeah, that’s what I was thinking of. Please can you try this to get the

spark

instance:

Copy code

def _get_databricks_object(name: str):
    """Gets object called `name` from the user namespace."""
    return IPython.get_ipython().user_ns.get(name)  # pragma: no cover

_get_databricks_object("spark")

Antony Milne

09/15/2022, 9:21 AM

Just to check, is the code you’re using that doesn’t work the same as this? https://github.com/kedro-org/kedro-starters/blob/0.18.2/pyspark/%7B%7B%20cookiecut[…]7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/context.py I have heard of problems like this before but am not sure exactly what the solution was - will need to do some searching and remind myself. Ultimately we should have something built into kedro that figures out if you’re on databricks and reuses the

spark

object rather than calling

getOrCreate

Rogier Vlijm

09/15/2022, 9:23 AM

Thanks for chipping in Antony! That code snippet might be useful, giving it a try in a few moments

Rogier Vlijm

09/15/2022, 9:24 AM

Our code is similar to your link although from kedro 0.17.7 instead

Rogier Vlijm

09/15/2022, 10:12 AM

Seems like we get the built-in spark through this method, but now running into:

Copy code

Object 'SparkDataSet' cannot be loaded from 'kedro.extras.datasets.spark'. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.spark.SparkDataSet

Rogier Vlijm

09/15/2022, 10:23 AM

and when running

pip install kedro[spark.SparkDataSet]

Rogier Vlijm

09/15/2022, 12:02 PM

(we’re trying to

pip install hdfs

now)

19 Views

Open in Slack

Previous Next