Hi Kedro Team Getting attached error when we submit job in d Kedro #questions

Hi Kedro Team...Getting attached error when we sub...

Balachandran Ponnusamy

03/22/2023, 2:51 PM

Hi Kedro Team...Getting attached error when we submit job in dataproc cluster to run a Data Engineering pipeline, we have a datafile in ".txt.gz" format. Same if we run it in .master(local[*]) , it works fine. but fails when we submit with saprk.master:yarn and spark.submit.deploymentmode: client Any idea where it is going wrong?

datajoely

03/22/2023, 3:20 PM

it’s a bit hard to work out from this

datajoely

03/22/2023, 3:20 PM

where is the data being persisted?

Balachandran Ponnusamy

03/22/2023, 6:48 PM

data is in GCS bucket

datajoely

03/22/2023, 6:55 PM

and it works okay when run from a single node, but not when distributed?

datajoely

03/22/2023, 6:57 PM

and if you exclude this

txt.gz

file it works correctly?

Balachandran Ponnusamy

03/22/2023, 7:18 PM

yes..it works in single node. I need to load this file for further pipeline runs, so couldnot exclude this file

datajoely

03/23/2023, 9:45 AM

I’m unsure on how to deal with this - are you using the ThreadRunner?

Balachandran Ponnusamy

03/23/2023, 12:59 PM

no, Is there a way we can connect and I can show you what is happening

datajoely

03/23/2023, 1:00 PM

We’re really outside of my area of expertise here unfortunately

datajoely

03/23/2023, 1:00 PM

our current view of best practice is here

datajoely

03/23/2023, 1:00 PM

https://docs.kedro.org/en/stable/tools_integration/pyspark.html

Olivia Lihn

03/23/2023, 3:51 PM

hi @Balachandran Ponnusamy looks like the distributed cluster might have security permissions missing to access the data. Have you check on that?

5 Views

Open in Slack

Previous Next