Jamal Sealiti
08/19/2025, 10:40 AMspark.submit.deployMode = "cluster"
⢠spark.master = "yarn"
My goal is to run this setup within a datafabric. However, I came across a discussion online stating that Kedro internally uses the PySpark shell to instantiate the SparkSession, which is incompatible with YARN's cluster deploy mode. As cluster mode requires spark-submit rather than interactive shells, this presents a challenge.
A suggested workaround involves:
⢠Packaging the Kedro project as a Python wheel (.whl) or zip archive.
⢠Using spark-submit to deploy the packaged project to the cluster.
But this workround maybe avoiding dependency issues...
Do you have any recommendations or best practices for this deployment approach? Is there a more streamlined way to integrate Kedro with Spark in cluster mode within a datafabric context?Huong Nguyen
08/19/2025, 3:00 PMNok Lam Chan
08/19/2025, 3:37 PMHowever, I came across a discussion online stating that Kedro internally uses the PySpark shell to instantiate theI don't think this is the casešCan you share the link please?SparkSession
Nok Lam Chan
08/19/2025, 3:38 PMNok Lam Chan
08/19/2025, 3:39 PMspark-submit
--deploy-mode cluster
--master yarn
--conf spark.submit.pyFiles=s3://{S3_BUCKET}/<whl-file>.whl
--archives=s3://{S3_BUCKET}/pyspark_deps.tar.gz#environment,s3://{S3_BUCKET}/conf.tar.gz#conf
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python
--conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python
--conf spark.yarn.appMasterEnv.<env-var-here>={ENV}
--conf spark.executorEnv.<env-var-here>={ENV}
s3://{S3_BUCKET}/run.py --env base --pipeline my_new_pipeline --params run_date:2023-03-07,runtime:cloudJamal Sealiti
08/20/2025, 8:14 AM