https://kedro.org/ logo
#questions
Title
# questions
b

Balachandran Ponnusamy

03/22/2023, 2:51 PM
Hi Kedro Team...Getting attached error when we submit job in dataproc cluster to run a Data Engineering pipeline, we have a datafile in ".txt.gz" format. Same if we run it in .master(local[*]) , it works fine. but fails when we submit with saprk.master:yarn and spark.submit.deploymentmode: client Any idea where it is going wrong?
d

datajoely

03/22/2023, 3:20 PM
it’s a bit hard to work out from this
where is the data being persisted?
b

Balachandran Ponnusamy

03/22/2023, 6:48 PM
data is in GCS bucket
d

datajoely

03/22/2023, 6:55 PM
and it works okay when run from a single node, but not when distributed?
and if you exclude this
txt.gz
file it works correctly?
b

Balachandran Ponnusamy

03/22/2023, 7:18 PM
yes..it works in single node. I need to load this file for further pipeline runs, so couldnot exclude this file
d

datajoely

03/23/2023, 9:45 AM
I’m unsure on how to deal with this - are you using the ThreadRunner?
b

Balachandran Ponnusamy

03/23/2023, 12:59 PM
no, Is there a way we can connect and I can show you what is happening
d

datajoely

03/23/2023, 1:00 PM
We’re really outside of my area of expertise here unfortunately
our current view of best practice is here
o

Olivia Lihn

03/23/2023, 3:51 PM
hi @Balachandran Ponnusamy looks like the distributed cluster might have security permissions missing to access the data. Have you check on that?