Hi, everyone! I have a question related to how to...
# questions
l
Hi, everyone! I have a question related to how to use Kedro inside Databricks, since whenever I try to use a "kedro run" in the repository, an error happens related to Spark: apparently, Databricks' native Spark is giving conflict with the Spark used inside the project, in a Hook (in the code below, you can see the Hook definition and the error).
Copy code
sc = SparkContext(conf=spark_conf, appName="Kedro")
    
    
    _spark_session = (
        SparkSession.builder
        .appName(context._package_name)
        .enableHiveSupport()
        .master("local[*,4]")
        .getOrCreate()
    )
    
    
    _spark_session.sparkContext.setLogLevel("WARN")
Error: Error: py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : org.apache.spark.SparkException: In Databricks, developers should utilize the shared SparkContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sc. When running a job, you can access the shared context by calling SparkContext.getOrCreate(). The other SparkContext was created at: CallSite(SparkContext at DatabricksILoop.scala:353,org.apache.spark.SparkContext.(SparkContext.scala:114) I've tried to delete the Hook and make the Spark settings directly in the Cluster, without success. I have tried to configure it directly in the Spark Session, also without success. I also followed the instructions in the documentation to use a repository within Databricks, but since the base project does not use this Hook, it did not give the error. Has anyone had a similar error? I thought I could run if I turned the project into a Wheel, but I can't use the "kedro package" since the project can't run inside Databricks. I would be grateful for any ideas, thank you!
p
Hi Luiz, Databricks already has spark session so you might want to add condition to your hooks that will skip creation of a new session if one already exists or when executed on databricks. I do that also for other reasons by checking existence of env variable DATABRICKS_RUNTIME_VERSION (this one looked like the most telling). That way we can still run our kedro pipeline out of databricks without any ad-hoc adjustments. By the way I don’t think that
kedro run
will be a way to go. You might rather want to create kedro session as described in https://kedro.readthedocs.io/en/stable/deployment/databricks.html
👍 1
Actually you can pass env variable in and check it in context in hook to find out if you’re on databricks: https://kedro.readthedocs.io/en/stable/kedro_project_setup/configuration.html#additional-configuration-environments
👍 1
l
Thanks! I think the problem was in the "Kedro run". Now it's working without Spark Errors 🙂