Hi all! I have a kedro project that is being intia...
# questions
a
Hi all! I have a kedro project that is being intiated with a pyspark session. Till date, I never had any issues when running pipelines or opening a jupyter notebook from my project's directory. However today I am facing this error -
Copy code
Exception: Java gateway process exited before sending its port number
Has anyone faced this error before?
d
Can you kill the kernel and see if persists?
a
It does. Not sure, what is causing the error.
d
and are you running against a remote spark cluster or all locally
a
remote spark cluster
d
and you’re configuring your spark stuff with
spark.yaml
in Kedro or environment varialbes?
a
spark.yml in kedro
d
Okay messing around with those variables is really the only course of action
all Kedro is doing is passing those arguments to the session builder stuff
Copy code
SparkSession.builder.appName('myapp').getOrCreate
a
Alright thank you - will look into it, would it be helpful if I paste the contents of my spark.yml here?
d
Maybe if someone else sees it
but I’m not sure
a
Alright, let me fiddle around with it
Thank you for your help
d
good luck! sorry can’t do more here
a
That's alright, you've given me something to work with
o
hi @Anirudh Dahiya if you want share the contents of spark.yml and i can take a look at those
a
Hi @Olivia Lihn, thank you. Please find below the configuration of spark.yml
Copy code
# You can define spark specific configuration here.

spark.sql.execution.arrow.pyspark.enabled: true

spark.ui.port: 4050
spark.driver.bindAddress: 127.0.0.1
spark.driver.memory: 180g
spark.driver.maxResultSize: 70g
spark.driver.memoryOverhead: 40g
spark.network.timeout: 1000s
spark.hadoop.fs.s3a.connection.maximum: 1000
spark.debug.maxToStringFields: 10000
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.broadcastTimeout: 600
spark.executor.extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true
spark.driver.extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true
spark.hadoop.fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
spark.local.dir: /data1/temp
spark.sql.autoBroadcastJoinThreshold: 50000000
spark.speculation: true
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: 2
spark.jars.packages: org.apache.hadoop:hadoop-aws:2.9.2,com.databricks:spark-redshift_2.11:2.0.1,org.apache.avro:avro:1.8.1,org.apache.spark:spark-avro_2.11:2.4.4
spark.jars: /packages/RedshiftJDBC42-no-awssdk-1.2.55.1083.jar

# <https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR
o
are you sure the cluster is configured correctly? If this same spark.yml configuration file worked running pipelines from your cli and on jupyter notebook, then it might be because there's a difference between you remote cluster configuration and this file
what i would do is, if you can, check connecting or creating a spark session in your cluster and verify the config above works (as a first step)
a
Alright let me do that
Thank you - is there a way to check the configuration of the remote cluster?
o
try connecting to the cluster and sending a very simple spark job (something like "hello world")