Hello everyone, happy weekend! Does anyone have a...
# questions
r
Hello everyone, happy weekend! Does anyone have an example of how to set GCP Bucket credentials from the
catalog.yml
for a parquet of type
spark.SparkDataSet
? I'm trying to use the
.json
file from Google Cloud but I'm having problems not knowing how to define it in the catalog Thanks in advance 🙂
Update: My issue is regarding how the data is being read, I already set the
GOOGLE_APPLICATION_CREDENTIALS
env variable point to my ADC json file as described here. But then I try to read the parquet file, which I set in the catalog as usual:
Copy code
_pyspark: &pyspark
  type: spark.SparkDataSet
  file_format: parquet
  load_args:
    header: true
  save_args:
    mode: overwrite
    sep: ','
    header: True

user_activity_data@pyspark:
  <<: *pyspark
  filepath: ${gcp.enriched_data}/user_activity_data.parquet
But I'm getting the next `DataSetError`:
Copy code
kedro.io.core.DataSetError: Failed while loading data from data set SparkDataSet(file_format=parquet, filepath=gs://<my-bucket>/<my-project>/04_enriched_data/user_activity_data.parquet, load_args={'header': True}, save_args={'header': True, 'mode': overwrite, 'sep': ,}).
An error occurred while calling o51.load.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
And my bucket I correctly populated so IDK what is the issue Also tried setting the next Spark (
3.3.1
version) config:
Copy code
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
spark.hadoop.fs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
But the error is different:
Copy code
kedro.io.core.DataSetError: Failed while loading data from data set SparkDataSet(file_format=parquet, filepath=<gs://third-echelon-bucket/brawlstars/04_enriched_data/user_activity_data.parquet>, load_args={'header': True}, save_args={'header': True, 'mode': overwrite, 'sep': ,}).
An error occurred while calling o55.load.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
Guys, don't worry. I didn't have the
GCS Hadoop
connector installed 🙂 Anyone with a similar issue, download the
.jar
from the next link and place it at
$SPARK_HOME/jars
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage And set your
spark.yml
as :
Copy code
# Google Cloud Service Config
com.google.cloud.bigdataoss: gcs-connector:hadoop3-2.3.0
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
# The AbstractFileSystem for 'gs:' URIs
spark.hadoop.fs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS