Rob
04/29/2023, 10:01 PMcatalog.yml
for a parquet of type spark.SparkDataSet
?
I'm trying to use the .json
file from Google Cloud but I'm having problems not knowing how to define it in the catalog
Thanks in advance 🙂GOOGLE_APPLICATION_CREDENTIALS
env variable point to my ADC json file as described here.
But then I try to read the parquet file, which I set in the catalog as usual:
_pyspark: &pyspark
type: spark.SparkDataSet
file_format: parquet
load_args:
header: true
save_args:
mode: overwrite
sep: ','
header: True
user_activity_data@pyspark:
<<: *pyspark
filepath: ${gcp.enriched_data}/user_activity_data.parquet
But I'm getting the next `DataSetError`:
kedro.io.core.DataSetError: Failed while loading data from data set SparkDataSet(file_format=parquet, filepath=gs://<my-bucket>/<my-project>/04_enriched_data/user_activity_data.parquet, load_args={'header': True}, save_args={'header': True, 'mode': overwrite, 'sep': ,}).
An error occurred while calling o51.load.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
And my bucket I correctly populated so IDK what is the issue
Also tried setting the next Spark (3.3.1
version) config:
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
spark.hadoop.fs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
But the error is different:
kedro.io.core.DataSetError: Failed while loading data from data set SparkDataSet(file_format=parquet, filepath=<gs://third-echelon-bucket/brawlstars/04_enriched_data/user_activity_data.parquet>, load_args={'header': True}, save_args={'header': True, 'mode': overwrite, 'sep': ,}).
An error occurred while calling o55.load.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
GCS Hadoop
connector installed 🙂
Anyone with a similar issue, download the .jar
from the next link and place it at $SPARK_HOME/jars
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
And set your spark.yml
as :
# Google Cloud Service Config
com.google.cloud.bigdataoss: gcs-connector:hadoop3-2.3.0
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
# The AbstractFileSystem for 'gs:' URIs
spark.hadoop.fs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS