Hello everyone! I am trying to set up Kedro on my...
# questions
y
Hello everyone! I am trying to set up Kedro on my machine for an existing project and pipeline. My colleague and I have similar dependencies, and the project works perfectly fine on their machine. The error I got is related to writing a parquet file. To debug, I have: • validated that Pyspark works when reading and writing a parquet file, including overwriting an existing file • I loaded Kedro jupyter lab and tried to load and write a parquet file, the loading works, but writing gives me the same error message as when I run the pipeline (Failed while saving data to data set)
@Rabeez Riaz
d
Thanks for sharing those debugging steps upfront! Can you share: 1. The vanilla PySpark command that works to write the dataset 2. The catalog entry 3. The error message Kedro datasets are little more than a wrapper around the underlying engines, so we should be able to see if the code that worked in vanilla PySpark isn't equivalent to the one getting run by the dataset.
y
Hi! Thanks for the response! Here comes the information 1. import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.appName("parquetFile").getOrCreate() df = spark.read.parquet('test.parquet') df.write.mode('overwrite').parquet("test.parquet") 2.
Copy code
data_example:
  <<: *sp_pq
  filepath: data/filename
  layer: cleaned
3.
Copy code
2022-10-25 15:02:07,057 - kedro.io.data_catalog - INFO - Saving data to `filename` (SparkDataSet)...
22/10/25 13:02:07 ERROR Utils: Aborting task
java.io.FileNotFoundException: 
/Users/mynamehere/Documents/project_folder/data/filename (Is a directory)

It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.
d
What if you change your catalog entry to
test.parquet
? That would be more equivalent. Similarly, what if you do
df.write
to
"data/filename"
?
y
Oh sorry, that is just a random name I replaced, but in my local project both files have the same name"data/filename"
Additional question, should I specify the extension in the catalog entry? Such as "data/filename.parquet"
d
Extension shouldn't be necessary
Oh, also, your catalog entry doesn't specify overwrite mode
Can you add
Copy code
save_args:
  mode: overwrite
Unless
sp_parquet
includes that already
Also, are you reading and writing to the same location by chance?
y
Hi again, I specified the overwrite mode at the beginning:
Copy code
save_args:
  mode: overwrite
  header: true
  sep: ','
  decimal: '.'
and yes, i am writing and reading to the same location
d
Sorry, I miswrote that. I meant, are you writing to the same location in any process earlier in the pipeline? I was taking a look at https://stackoverflow.com/questions/42607591/filenotfoundexception-when-trying-to-save-dataframe-to-parquet-format-with-ove
r
@Deepyaman Datta thanks for the help so far. The “FileNotFound” issue was a random goose chase. During debugging we accidentally tried saving a dataframe back to it’s source file and pyspark complained. The actual issue is a generic “DataSetError” on the save method for spark dataset (parquet) but only if running the kedro catalog. As shown above both load/save work correctly when using the direct pyspark code.
👍 1
d
If you're able to reproduce this in a codebase where nothing is sensitive, I can potentially hop on a call and take a look. In short, I think the main process would be: 1. See exactly what command is being run to save the data when run from the Jupyter notebook 2. Try to run that exact same command outside of Kedro, from the Jupyter notebook 3. If 2 doesn't work, identify the difference between that and the command that you can run in vanilla Pyspark; if 2 does work, then have to investigate what's happening more deeply, but I imagine (hope) this isn't the case
r
Figured out the issue there. There was a hidden installation of pyspark from an earlier setup attempt via pip which was being picked up instead of the new one we were installing with conda. Removing all traces of pyspark from the system and then installing the required version once resolved the issue.
🥳 2
conda list
and
pip list
were only showing one of the versions so we didn’t pick up on the issue initially
d
Cool. Glad it's resolved!