Hi again everyone, When I set a `spark.yml` file o...
# questions
r
Hi again everyone, When I set a
spark.yml
file on the configuration folder, this to run the code from a
databricks cluster
(using a workflow job, so my
run.py
is in the DBFS), is required to specify the spark master URL? Or is there an alternative to omit the
spark.yml
to let Databricks manage my configuration? (I mean, to omit the manual setting of the Master URL) Thanks in advance!
Maybe a way to make
init_spark_session
optional? This is the error that is returned:
Copy code
raise Exception("A master URL must be set in your configuration")
Exception: A master URL must be set in your configuration
And after adding the URL (using
spark.master
):
Copy code
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: In Databricks, developers should utilize the shared SparkContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sc. When running a job, you can access the shared context by calling SparkContext.getOrCreate()
So I guess I'm missing something...
Hi everyone, can someone help me to clarify this issue. I'd appreciate it very much, thanks
y
I'm wondering who can help here. @datajoely, @Adit Tiwari or @poornima p, would you have any thoughts here?
d
if you let Databricks manage things does it work?
r
First I set the
spark.yml
as:
Copy code
# You can define spark specific configuration here.
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.enabled: true

# <https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR
And got this error,
Copy code
raise Exception("A master URL must be set in your configuration")
Exception: A master URL must be set in your configuration
So what I did, was to hardcode the
spark.master
as:
Copy code
# You can define spark specific configuration here.
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.enabled: true
spark.master: <spark://10.0.2.6:7077>

# <https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR
And then got this error:
Copy code
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: In Databricks, developers should utilize the shared SparkContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sc. When running a job, you can access the shared context by calling SparkContext.getOrCreate()
So I'm using a databricks workflow job, where I set a path invoking a python script located in the DBFS. This script invokes the subprocess
'kedro run',
this way:
Copy code
def subprocess_call(cmd: str) -> None:
    """Call subprocess with error check."""
    print("=========================================")
    print(f"Calling: {cmd}")
    print("=========================================")
    subprocess.run(cmd, check=True, shell=True)

dbfs_loc = "/dbfs/location"
kedro_call = "kedro run --env=databricks"
cmd = f"cd {dbfs_loc} && {kedro_call}"
subprocess_call(cmd)
o
Hi @Rob! I am currently running kedro projects in databricks and the following spark.yml configuration works:
Copy code
# You can define spark specific configuration here.

spark.driver.maxResultSize: 3g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true

# <https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR

# Default settings
spark.driver.memory: 16g
spark.sql.shuffle.partitions: 16
spark.default.parallelism: 16
There's no need to specify the spark master URL, at least running the pipeline from a databricks notebook. Is this how you are running your project?
r
Thanks Olivia, yes now it's running but at the end my issue was that I was pointing to the wrong path for the Kedro Context creation, since I'm running from an .py executing a command maybe something was overlapping the session
🥳 1