Hi again everyone When I set a `spark yml` file on the confi Kedro #questions

Hi again everyone, When I set a `spark.yml` file o...

Rob

01/29/2023, 6:21 PM

Hi again everyone, When I set a

spark.yml

file on the configuration folder, this to run the code from a

databricks cluster

(using a workflow job, so my

run.py

is in the DBFS), is required to specify the spark master URL? Or is there an alternative to omit the

spark.yml

to let Databricks manage my configuration? (I mean, to omit the manual setting of the Master URL) Thanks in advance!

Rob

01/29/2023, 8:51 PM

Maybe a way to make

init_spark_session

optional? This is the error that is returned:

Copy code

raise Exception("A master URL must be set in your configuration")
Exception: A master URL must be set in your configuration

Rob

01/29/2023, 8:55 PM

And after adding the URL (using

spark.master

Copy code

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: In Databricks, developers should utilize the shared SparkContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sc. When running a job, you can access the shared context by calling SparkContext.getOrCreate()

So I guess I'm missing something...

Rob

01/30/2023, 2:10 PM

Hi everyone, can someone help me to clarify this issue. I'd appreciate it very much, thanks

Yetunde

01/30/2023, 2:38 PM

I'm wondering who can help here. @datajoely, @Adit Tiwari or @poornima p, would you have any thoughts here?

datajoely

01/30/2023, 2:39 PM

if you let Databricks manage things does it work?

Rob

01/30/2023, 2:59 PM

First I set the

spark.yml

as:

Copy code

# You can define spark specific configuration here.
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.enabled: true

# <https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR

And got this error,

Copy code

raise Exception("A master URL must be set in your configuration")
Exception: A master URL must be set in your configuration

So what I did, was to hardcode the

spark.master

as:

Copy code

# You can define spark specific configuration here.
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.enabled: true
spark.master: <spark://10.0.2.6:7077>

# <https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR

And then got this error:

Copy code

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: In Databricks, developers should utilize the shared SparkContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sc. When running a job, you can access the shared context by calling SparkContext.getOrCreate()

So I'm using a databricks workflow job, where I set a path invoking a python script located in the DBFS. This script invokes the subprocess

'kedro run',

this way:

Copy code

def subprocess_call(cmd: str) -> None:
    """Call subprocess with error check."""
    print("=========================================")
    print(f"Calling: {cmd}")
    print("=========================================")
    subprocess.run(cmd, check=True, shell=True)

dbfs_loc = "/dbfs/location"
kedro_call = "kedro run --env=databricks"
cmd = f"cd {dbfs_loc} && {kedro_call}"
subprocess_call(cmd)

Olivia Lihn

01/30/2023, 3:05 PM

Hi @Rob! I am currently running kedro projects in databricks and the following spark.yml configuration works:

Copy code

# You can define spark specific configuration here.

spark.driver.maxResultSize: 3g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true

# <https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR

# Default settings
spark.driver.memory: 16g
spark.sql.shuffle.partitions: 16
spark.default.parallelism: 16

There's no need to specify the spark master URL, at least running the pipeline from a databricks notebook. Is this how you are running your project?

Rob

01/30/2023, 4:24 PM

Thanks Olivia, yes now it's running but at the end my issue was that I was pointing to the wrong path for the Kedro Context creation, since I'm running from an .py executing a command maybe something was overlapping the session

🥳 1

98 Views

Open in Slack

Previous Next