https://kedro.org/ logo
#questions
Title
# questions
s

Shubham Agrawal

01/26/2024, 9:59 AM
Hi all! I am trying to run my Kedro 18.14 pipeline in Databricks 12.2 environment but getting a
No such file or directory:
error. I am using transcoding to store this dataframe as a spark.SparkDataSet and then load it as a pandas.ParquetDataSet when this error happens. I tried loading the same path in
spark.read.parquet
and it loads. I have also reinstalled kedro using the notation
kedro[spark.SparkDataSet,pandas.ParquetDataSet]
I have also set the following in spark.yml file:
spark.sql.execution.arrow.pyspark.enabled: true
Could you please suggest what I might be missing?
n

Nok Lam Chan

01/26/2024, 10:28 AM
If you just try to load both dataset individually, does it work? You can test is quickly in the notebook and load it with catalog.
s

Shubham Agrawal

01/26/2024, 10:29 AM
it doesn’t load with catalog
n

Nok Lam Chan

01/26/2024, 10:30 AM
Which dataset doesn’t load? Is it both spark and pandas?
s

Shubham Agrawal

01/26/2024, 10:30 AM
the pandas.. i write as spark, success.. but loading it into pandas fails.. even if I do directly using pandas or catalog
n

Nok Lam Chan

01/26/2024, 10:31 AM
Are you saving it to dbfs/s3 ? By default spark save stuff to the spark driver node so sometimes it causes issue because pandas won’t be able to see it
s

Shubham Agrawal

01/26/2024, 10:32 AM
Yep, I checked in the azure browser as well.. the “missing” dataset is there
n

Nok Lam Chan

01/26/2024, 10:34 AM
Were you able to load other data save on azure that are not Spark?
s

Shubham Agrawal

01/26/2024, 10:35 AM
this is the first pandas dataset i encounter.. rest are spark
n

Nok Lam Chan

01/26/2024, 10:35 AM
Second guess is credentials issue because Spark has it’s native authentication mechanism
Is it a version dataset?
s

Shubham Agrawal

01/26/2024, 10:37 AM
not a version dataset.. but i can try saving to local dbfs rather mounted.. i think i got same error yesterday and that’s why tried saving to ADSL.. thinking databricks might not be liking this conversion
tried saving to local workspace path in databricks as well.. spark works fine, pandas parquet diesnt
n

Nok Lam Chan

01/26/2024, 10:41 AM
That’s worth trying, I think you can try to load some nonspark datasets first, my guess is not a transcoding specific issue
s

Shubham Agrawal

01/26/2024, 10:43 AM
ok.. but this is the suggested way to use spark and pandas right? just confirming if i am doing something wrong 😓
s

Samiksha Jain

03/11/2024, 10:02 AM
Hi @Shubham Agrawal were you able to resolve this issue? I also have the same configuration and i also tried loading non spark datasets using pandas.. and that works.. but loading a pandas.parquetdataset gives the same error..
2 Views