Hi I have a question regarding best practices for deploying Kedro #questions

Hi, I have a question regarding best practices for...

Jamal Sealiti

08/19/2025, 10:40 AM

Hi, I have a question regarding best practices for deploying a Kedro project in a distributed environment. Currently, I have Kedro running inside a container with the following Spark configuration: •

spark.submit.deployMode = "cluster"

•

spark.master = "yarn"

My goal is to run this setup within a datafabric. However, I came across a discussion online stating that Kedro internally uses the PySpark shell to instantiate the

SparkSession

, which is incompatible with YARN's cluster deploy mode. As cluster mode requires

spark-submit

rather than interactive shells, this presents a challenge. A suggested workaround involves: • Packaging the Kedro project as a Python wheel (

.whl

) or zip archive. • Using

spark-submit

to deploy the packaged project to the cluster. But this workround maybe avoiding dependency issues... Do you have any recommendations or best practices for this deployment approach? Is there a more streamlined way to integrate Kedro with Spark in cluster mode within a datafabric context?

👀 2

Huong Nguyen

08/19/2025, 3:00 PM

hey @Jamal Sealiti I'm looking into this for you. Will get back as soon as I can 😄

👍 1

Nok Lam Chan

08/19/2025, 3:37 PM

However, I came across a discussion online stating that Kedro internally uses the PySpark shell to instantiate the
SparkSession

I don't think this is the case👀Can you share the link please?

Nok Lam Chan

08/19/2025, 3:38 PM

In fact Kedro only build the SparkSession for the Spark App https://github.com/kedro-org/kedro-starters/blob/main/spaceflights-pyspark/%7B%7B%[…]D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/hooks.py, I don't think it handles the cluster side of thing.

Nok Lam Chan

08/19/2025, 3:39 PM

I am not familiar with DataFabrics particularly, but maybe this article that use Hadoop and Amazon EMR provide some guidance: https://kedro.org/blog/how-to-deploy-kedro-pipelines-on-amazon-emr example of running this in cluster mode on EMR:

Copy code

spark-submit 
    --deploy-mode cluster 
    --master yarn 
    --conf spark.submit.pyFiles=s3://{S3_BUCKET}/<whl-file>.whl
    --archives=s3://{S3_BUCKET}/pyspark_deps.tar.gz#environment,s3://{S3_BUCKET}/conf.tar.gz#conf
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python
    --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python 
    --conf spark.yarn.appMasterEnv.<env-var-here>={ENV} 
    --conf spark.executorEnv.<env-var-here>={ENV} 

    s3://{S3_BUCKET}/run.py --env base --pipeline my_new_pipeline --params run_date:2023-03-07,runtime:cloud

👍 1

Jamal Sealiti

08/20/2025, 8:14 AM

@Nok Lam Chan her is the link https://stackoverflow.com/questions/56898664/why-do-spark-shells-pyspark-or-scala-run-in-client-mode-instead-of-cluster-mod

10 Views

Open in Slack

Previous Next