divas verma
04/14/2023, 11:17 AMcatalog.load
in kedro 0.18.4
Py4JJavaError: An error occurred while calling o186.load.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
any thoughts on what could be going on here?William Caicedo
04/14/2023, 11:51 AMhadoop-aws-x.x.x.jar
and aws-java-sdk-bundle-x.x.xxxx.jar
. You need to add to the classpath the right jar versions for your PySpark version, otherwise it won’t work.
To prompt PySpark to download the jars (I’m assuming you’re running PySpark locally), you can set an environment variable like this (I do it in a after_context_created
hook):
os.environ['PYSPARK_SUBMIT_ARGS'] = f'--packages "org.apache.hadoop:hadoop-aws:{version}" pyspark-shell'
If you’re using PySpark 3.3.0 then version = 3.3.2
You might also need to add the jar paths to your spark.yml conf file, for example:
spark.driver.extraClassPath: "/home/user/.ivy2/cache/org.apache.hadoop/hadoop-aws/jars/hadoop-aws-3.3.2.jar:/home/user/.ivy2/cache/com.amazonaws/aws-java-sdk-bundle/jars/aws-java-sdk-bundle-1.11.1026.jar"