Hi, I have a question regarding best practices for...
# questions
j
Hi, I have a question regarding best practices for deploying a Kedro project in a distributed environment. Currently, I have Kedro running inside a container with the following Spark configuration: •
spark.submit.deployMode = "cluster"
•
spark.master = "yarn"
My goal is to run this setup within a datafabric. However, I came across a discussion online stating that Kedro internally uses the PySpark shell to instantiate the
SparkSession
, which is incompatible with YARN's cluster deploy mode. As cluster mode requires
spark-submit
rather than interactive shells, this presents a challenge. A suggested workaround involves: • Packaging the Kedro project as a Python wheel (
.whl
) or zip archive. • Using
spark-submit
to deploy the packaged project to the cluster. But this workround maybe avoiding dependency issues... Do you have any recommendations or best practices for this deployment approach? Is there a more streamlined way to integrate Kedro with Spark in cluster mode within a datafabric context?
šŸ‘€ 2
h
hey @Jamal Sealiti I'm looking into this for you. Will get back as soon as I can šŸ˜„
šŸ‘ 1
n
However, I came across a discussion online stating that Kedro internally uses the PySpark shell to instantiate the
SparkSession
I don't think this is the casešŸ‘€Can you share the link please?
In fact Kedro only build the SparkSession for the Spark App https://github.com/kedro-org/kedro-starters/blob/main/spaceflights-pyspark/%7B%7B%[…]D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/hooks.py, I don't think it handles the cluster side of thing.
I am not familiar with DataFabrics particularly, but maybe this article that use Hadoop and Amazon EMR provide some guidance: https://kedro.org/blog/how-to-deploy-kedro-pipelines-on-amazon-emr example of running this in cluster mode on EMR:
Copy code
spark-submit 
    --deploy-mode cluster 
    --master yarn 
    --conf spark.submit.pyFiles=s3://{S3_BUCKET}/<whl-file>.whl
    --archives=s3://{S3_BUCKET}/pyspark_deps.tar.gz#environment,s3://{S3_BUCKET}/conf.tar.gz#conf
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python
    --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python 
    --conf spark.yarn.appMasterEnv.<env-var-here>={ENV} 
    --conf spark.executorEnv.<env-var-here>={ENV} 

    s3://{S3_BUCKET}/run.py --env base --pipeline my_new_pipeline --params run_date:2023-03-07,runtime:cloud
šŸ‘ 1