Hello everyone I am trying to set up Kedro on my machine for Kedro #questions

Hello everyone! I am trying to set up Kedro on my...

Yuchu Liu

10/25/2022, 1:40 PM

Hello everyone! I am trying to set up Kedro on my machine for an existing project and pipeline. My colleague and I have similar dependencies, and the project works perfectly fine on their machine. The error I got is related to writing a parquet file. To debug, I have: • validated that Pyspark works when reading and writing a parquet file, including overwriting an existing file • I loaded Kedro jupyter lab and tried to load and write a parquet file, the loading works, but writing gives me the same error message as when I run the pipeline (Failed while saving data to data set)

Yuchu Liu

10/25/2022, 1:40 PM

@Rabeez Riaz

Deepyaman Datta

10/25/2022, 1:45 PM

Thanks for sharing those debugging steps upfront! Can you share: 1. The vanilla PySpark command that works to write the dataset 2. The catalog entry 3. The error message Kedro datasets are little more than a wrapper around the underlying engines, so we should be able to see if the code that worked in vanilla PySpark isn't equivalent to the one getting run by the dataset.

Yuchu Liu

10/25/2022, 2:00 PM

Hi! Thanks for the response! Here comes the information 1. import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.appName("parquetFile").getOrCreate() df = spark.read.parquet('test.parquet') df.write.mode('overwrite').parquet("test.parquet") 2.

Copy code

data_example:
  <<: *sp_pq
  filepath: data/filename
  layer: cleaned

Copy code

2022-10-25 15:02:07,057 - kedro.io.data_catalog - INFO - Saving data to `filename` (SparkDataSet)...
22/10/25 13:02:07 ERROR Utils: Aborting task
java.io.FileNotFoundException: 
/Users/mynamehere/Documents/project_folder/data/filename (Is a directory)

It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.

Deepyaman Datta

10/25/2022, 2:03 PM

What if you change your catalog entry to

test.parquet

? That would be more equivalent. Similarly, what if you do

df.write

"data/filename"

Yuchu Liu

10/25/2022, 2:04 PM

Oh sorry, that is just a random name I replaced, but in my local project both files have the same name"data/filename"

Yuchu Liu

10/25/2022, 2:07 PM

Additional question, should I specify the extension in the catalog entry? Such as "data/filename.parquet"

Deepyaman Datta

10/25/2022, 2:08 PM

Extension shouldn't be necessary

Deepyaman Datta

10/25/2022, 2:09 PM

Oh, also, your catalog entry doesn't specify overwrite mode

Deepyaman Datta

10/25/2022, 2:12 PM

Can you add

Copy code

save_args:
  mode: overwrite

Deepyaman Datta

10/25/2022, 2:13 PM

Unless

sp_parquet

includes that already

Deepyaman Datta

10/25/2022, 2:17 PM

Also, are you reading and writing to the same location by chance?

Yuchu Liu

10/25/2022, 2:41 PM

Hi again, I specified the overwrite mode at the beginning:

Copy code

save_args:
  mode: overwrite
  header: true
  sep: ','
  decimal: '.'

Yuchu Liu

10/25/2022, 2:41 PM

and yes, i am writing and reading to the same location

Deepyaman Datta

10/25/2022, 3:12 PM

Sorry, I miswrote that. I meant, are you writing to the same location in any process earlier in the pipeline? I was taking a look at https://stackoverflow.com/questions/42607591/filenotfoundexception-when-trying-to-save-dataframe-to-parquet-format-with-ove

Rabeez Riaz

10/26/2022, 11:25 AM

@Deepyaman Datta thanks for the help so far. The “FileNotFound” issue was a random goose chase. During debugging we accidentally tried saving a dataframe back to it’s source file and pyspark complained. The actual issue is a generic “DataSetError” on the save method for spark dataset (parquet) but only if running the kedro catalog. As shown above both load/save work correctly when using the direct pyspark code.

👍 1

Deepyaman Datta

10/26/2022, 1:11 PM

If you're able to reproduce this in a codebase where nothing is sensitive, I can potentially hop on a call and take a look. In short, I think the main process would be: 1. See exactly what command is being run to save the data when run from the Jupyter notebook 2. Try to run that exact same command outside of Kedro, from the Jupyter notebook 3. If 2 doesn't work, identify the difference between that and the command that you can run in vanilla Pyspark; if 2 does work, then have to investigate what's happening more deeply, but I imagine (hope) this isn't the case

Rabeez Riaz

10/26/2022, 1:25 PM

Figured out the issue there. There was a hidden installation of pyspark from an earlier setup attempt via pip which was being picked up instead of the new one we were installing with conda. Removing all traces of pyspark from the system and then installing the required version once resolved the issue.

🥳 2

Rabeez Riaz

10/26/2022, 1:26 PM

conda list

and

pip list

were only showing one of the versions so we didn’t pick up on the issue initially

Deepyaman Datta

10/26/2022, 2:01 PM

Cool. Glad it's resolved!

Open in Slack

Previous Next