Hi all I have a kedro project that is being intiated with a Kedro #questions

Hi all! I have a kedro project that is being intia...

Anirudh Dahiya

02/01/2023, 1:14 PM

Hi all! I have a kedro project that is being intiated with a pyspark session. Till date, I never had any issues when running pipelines or opening a jupyter notebook from my project's directory. However today I am facing this error -

Copy code

Exception: Java gateway process exited before sending its port number

Has anyone faced this error before?

datajoely

02/01/2023, 1:43 PM

Can you kill the kernel and see if persists?

Anirudh Dahiya

02/01/2023, 1:45 PM

It does. Not sure, what is causing the error.

datajoely

02/01/2023, 1:46 PM

and are you running against a remote spark cluster or all locally

Anirudh Dahiya

02/01/2023, 1:46 PM

remote spark cluster

datajoely

02/01/2023, 1:46 PM

and you’re configuring your spark stuff with

spark.yaml

in Kedro or environment varialbes?

Anirudh Dahiya

02/01/2023, 1:47 PM

spark.yml in kedro

datajoely

02/01/2023, 1:48 PM

Okay messing around with those variables is really the only course of action

datajoely

02/01/2023, 1:48 PM

all Kedro is doing is passing those arguments to the session builder stuff

Copy code

SparkSession.builder.appName('myapp').getOrCreate

datajoely

02/01/2023, 1:49 PM

Possibly helpful? https://stackoverflow.com/questions/31841509/pyspark-exception-java-gateway-process-exited-before-sending-the-driver-its-po

Anirudh Dahiya

02/01/2023, 1:51 PM

Alright thank you - will look into it, would it be helpful if I paste the contents of my spark.yml here?

datajoely

02/01/2023, 1:51 PM

Maybe if someone else sees it

datajoely

02/01/2023, 1:51 PM

but I’m not sure

Anirudh Dahiya

02/01/2023, 1:51 PM

Alright, let me fiddle around with it

Anirudh Dahiya

02/01/2023, 1:51 PM

Thank you for your help

datajoely

02/01/2023, 1:52 PM

good luck! sorry can’t do more here

Anirudh Dahiya

02/01/2023, 1:52 PM

That's alright, you've given me something to work with

Olivia Lihn

02/01/2023, 2:02 PM

hi @Anirudh Dahiya if you want share the contents of spark.yml and i can take a look at those

Anirudh Dahiya

02/01/2023, 2:03 PM

Hi @Olivia Lihn, thank you. Please find below the configuration of spark.yml

Anirudh Dahiya

02/01/2023, 2:03 PM

Copy code

# You can define spark specific configuration here.

spark.sql.execution.arrow.pyspark.enabled: true

spark.ui.port: 4050
spark.driver.bindAddress: 127.0.0.1
spark.driver.memory: 180g
spark.driver.maxResultSize: 70g
spark.driver.memoryOverhead: 40g
spark.network.timeout: 1000s
spark.hadoop.fs.s3a.connection.maximum: 1000
spark.debug.maxToStringFields: 10000
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.broadcastTimeout: 600
spark.executor.extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true
spark.driver.extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true
spark.hadoop.fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
spark.local.dir: /data1/temp
spark.sql.autoBroadcastJoinThreshold: 50000000
spark.speculation: true
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: 2
spark.jars.packages: org.apache.hadoop:hadoop-aws:2.9.2,com.databricks:spark-redshift_2.11:2.0.1,org.apache.avro:avro:1.8.1,org.apache.spark:spark-avro_2.11:2.4.4
spark.jars: /packages/RedshiftJDBC42-no-awssdk-1.2.55.1083.jar

# <https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR

Olivia Lihn

02/01/2023, 2:04 PM

are you sure the cluster is configured correctly? If this same spark.yml configuration file worked running pipelines from your cli and on jupyter notebook, then it might be because there's a difference between you remote cluster configuration and this file

Olivia Lihn

02/01/2023, 2:05 PM

what i would do is, if you can, check connecting or creating a spark session in your cluster and verify the config above works (as a first step)

Anirudh Dahiya

02/01/2023, 2:14 PM

Alright let me do that

Anirudh Dahiya

02/01/2023, 2:15 PM

Thank you - is there a way to check the configuration of the remote cluster?

Olivia Lihn

02/01/2023, 2:15 PM

try connecting to the cluster and sending a very simple spark job (something like "hello world")

40 Views

Open in Slack

Previous Next