Hi Team seeing this error while trying to load a p...
# questions
d
Hi Team seeing this error while trying to load a parquet file using
catalog.load
in kedro 0.18.4
Copy code
Py4JJavaError: An error occurred while calling o186.load.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
org.apache.hadoop.fs.s3a.S3AFileSystem not found
any thoughts on what could be going on here?
w
You’re missing some jars in your classpath, especifically
hadoop-aws-x.x.x.jar
and
aws-java-sdk-bundle-x.x.xxxx.jar
. You need to add to the classpath the right jar versions for your PySpark version, otherwise it won’t work. To prompt PySpark to download the jars (I’m assuming you’re running PySpark locally), you can set an environment variable like this (I do it in a
after_context_created
hook):
Copy code
os.environ['PYSPARK_SUBMIT_ARGS'] = f'--packages "org.apache.hadoop:hadoop-aws:{version}" pyspark-shell'
If you’re using PySpark 3.3.0 then
version = 3.3.2
You might also need to add the jar paths to your spark.yml conf file, for example:
Copy code
spark.driver.extraClassPath: "/home/user/.ivy2/cache/org.apache.hadoop/hadoop-aws/jars/hadoop-aws-3.3.2.jar:/home/user/.ivy2/cache/com.amazonaws/aws-java-sdk-bundle/jars/aws-java-sdk-bundle-1.11.1026.jar"
🔝 1