Rob
01/29/2023, 6:21 PMspark.yml
file on the configuration folder, this to run the code from a databricks cluster
(using a workflow job, so my run.py
is in the DBFS), is required to specify the spark master URL?
Or is there an alternative to omit the spark.yml
to let Databricks manage my configuration? (I mean, to omit the manual setting of the Master URL)
Thanks in advance!init_spark_session
optional?
This is the error that is returned:
raise Exception("A master URL must be set in your configuration")
Exception: A master URL must be set in your configuration
spark.master
):
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: In Databricks, developers should utilize the shared SparkContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sc. When running a job, you can access the shared context by calling SparkContext.getOrCreate()
So I guess I'm missing something...Yetunde
01/30/2023, 2:38 PMdatajoely
01/30/2023, 2:39 PMRob
01/30/2023, 2:59 PMspark.yml
as:
# You can define spark specific configuration here.
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.enabled: true
# <https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR
And got this error,
raise Exception("A master URL must be set in your configuration")
Exception: A master URL must be set in your configuration
So what I did, was to hardcode the spark.master
as:
# You can define spark specific configuration here.
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.enabled: true
spark.master: <spark://10.0.2.6:7077>
# <https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR
And then got this error:
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: In Databricks, developers should utilize the shared SparkContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sc. When running a job, you can access the shared context by calling SparkContext.getOrCreate()
So I'm using a databricks workflow job, where I set a path invoking a python script located in the DBFS. This script invokes the subprocess 'kedro run',
this way:
def subprocess_call(cmd: str) -> None:
"""Call subprocess with error check."""
print("=========================================")
print(f"Calling: {cmd}")
print("=========================================")
subprocess.run(cmd, check=True, shell=True)
dbfs_loc = "/dbfs/location"
kedro_call = "kedro run --env=databricks"
cmd = f"cd {dbfs_loc} && {kedro_call}"
subprocess_call(cmd)
Olivia Lihn
01/30/2023, 3:05 PM# You can define spark specific configuration here.
spark.driver.maxResultSize: 3g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
# <https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR
# Default settings
spark.driver.memory: 16g
spark.sql.shuffle.partitions: 16
spark.default.parallelism: 16
There's no need to specify the spark master URL, at least running the pipeline from a databricks notebook. Is this how you are running your project?Rob
01/30/2023, 4:24 PM