Hi all! I am trying to run my Kedro 18.14 pipeline...
# questions
s
Hi all! I am trying to run my Kedro 18.14 pipeline in Databricks 12.2 environment but getting a
No such file or directory:
error. I am using transcoding to store this dataframe as a spark.SparkDataSet and then load it as a pandas.ParquetDataSet when this error happens. I tried loading the same path in
spark.read.parquet
and it loads. I have also reinstalled kedro using the notation
kedro[spark.SparkDataSet,pandas.ParquetDataSet]
I have also set the following in spark.yml file:
spark.sql.execution.arrow.pyspark.enabled: true
Could you please suggest what I might be missing?
n
If you just try to load both dataset individually, does it work? You can test is quickly in the notebook and load it with catalog.
s
it doesn’t load with catalog
n
Which dataset doesn’t load? Is it both spark and pandas?
s
the pandas.. i write as spark, success.. but loading it into pandas fails.. even if I do directly using pandas or catalog
n
Are you saving it to dbfs/s3 ? By default spark save stuff to the spark driver node so sometimes it causes issue because pandas won’t be able to see it
s
Yep, I checked in the azure browser as well.. the “missing” dataset is there
n
Were you able to load other data save on azure that are not Spark?
s
this is the first pandas dataset i encounter.. rest are spark
n
Second guess is credentials issue because Spark has it’s native authentication mechanism
Is it a version dataset?
s
not a version dataset.. but i can try saving to local dbfs rather mounted.. i think i got same error yesterday and that’s why tried saving to ADSL.. thinking databricks might not be liking this conversion
tried saving to local workspace path in databricks as well.. spark works fine, pandas parquet diesnt
n
That’s worth trying, I think you can try to load some nonspark datasets first, my guess is not a transcoding specific issue
s
ok.. but this is the suggested way to use spark and pandas right? just confirming if i am doing something wrong 😓
s
Hi @Shubham Agrawal were you able to resolve this issue? I also have the same configuration and i also tried loading non spark datasets using pandas.. and that works.. but loading a pandas.parquetdataset gives the same error..