Hi all I am trying to run my Kedro 18 14 pipeline in Databri Kedro #questions

Hi all! I am trying to run my Kedro 18.14 pipeline...

Shubham Agrawal

01/26/2024, 9:59 AM

Hi all! I am trying to run my Kedro 18.14 pipeline in Databricks 12.2 environment but getting a

No such file or directory:

error. I am using transcoding to store this dataframe as a spark.SparkDataSet and then load it as a pandas.ParquetDataSet when this error happens. I tried loading the same path in

spark.read.parquet

and it loads. I have also reinstalled kedro using the notation

kedro[spark.SparkDataSet,pandas.ParquetDataSet]

I have also set the following in spark.yml file:

spark.sql.execution.arrow.pyspark.enabled: true

Could you please suggest what I might be missing?

Nok Lam Chan

01/26/2024, 10:28 AM

If you just try to load both dataset individually, does it work? You can test is quickly in the notebook and load it with catalog.

Shubham Agrawal

01/26/2024, 10:29 AM

it doesn’t load with catalog

Nok Lam Chan

01/26/2024, 10:30 AM

Which dataset doesn’t load? Is it both spark and pandas?

Shubham Agrawal

01/26/2024, 10:30 AM

the pandas.. i write as spark, success.. but loading it into pandas fails.. even if I do directly using pandas or catalog

Nok Lam Chan

01/26/2024, 10:31 AM

Are you saving it to dbfs/s3 ? By default spark save stuff to the spark driver node so sometimes it causes issue because pandas won’t be able to see it

Shubham Agrawal

01/26/2024, 10:32 AM

Yep, I checked in the azure browser as well.. the “missing” dataset is there

Nok Lam Chan

01/26/2024, 10:34 AM

Were you able to load other data save on azure that are not Spark?

Shubham Agrawal

01/26/2024, 10:35 AM

this is the first pandas dataset i encounter.. rest are spark

Nok Lam Chan

01/26/2024, 10:35 AM

Second guess is credentials issue because Spark has it’s native authentication mechanism

Nok Lam Chan

01/26/2024, 10:36 AM

Is it a version dataset?

Shubham Agrawal

01/26/2024, 10:37 AM

not a version dataset.. but i can try saving to local dbfs rather mounted.. i think i got same error yesterday and that’s why tried saving to ADSL.. thinking databricks might not be liking this conversion

Shubham Agrawal

01/26/2024, 10:41 AM

tried saving to local workspace path in databricks as well.. spark works fine, pandas parquet diesnt

Nok Lam Chan

01/26/2024, 10:41 AM

That’s worth trying, I think you can try to load some nonspark datasets first, my guess is not a transcoding specific issue

Shubham Agrawal

01/26/2024, 10:43 AM

ok.. but this is the suggested way to use spark and pandas right? just confirming if i am doing something wrong 😓

Samiksha Jain

03/11/2024, 10:02 AM

Hi @Shubham Agrawal were you able to resolve this issue? I also have the same configuration and i also tried loading non spark datasets using pandas.. and that works.. but loading a pandas.parquetdataset gives the same error..

5 Views

Open in Slack

Previous Next