Hello everyone happy weekend Does anyone have an example of Kedro #questions

Hello everyone, happy weekend! Does anyone have a...

Rob

04/29/2023, 10:01 PM

Hello everyone, happy weekend! Does anyone have an example of how to set GCP Bucket credentials from the

catalog.yml

for a parquet of type

spark.SparkDataSet

? I'm trying to use the

.json

file from Google Cloud but I'm having problems not knowing how to define it in the catalog Thanks in advance 🙂

Rob

04/30/2023, 3:44 PM

Update: My issue is regarding how the data is being read, I already set the

GOOGLE_APPLICATION_CREDENTIALS

env variable point to my ADC json file as described here. But then I try to read the parquet file, which I set in the catalog as usual:

Copy code

_pyspark: &pyspark
  type: spark.SparkDataSet
  file_format: parquet
  load_args:
    header: true
  save_args:
    mode: overwrite
    sep: ','
    header: True

user_activity_data@pyspark:
  <<: *pyspark
  filepath: ${gcp.enriched_data}/user_activity_data.parquet

But I'm getting the next `DataSetError`:

Copy code

kedro.io.core.DataSetError: Failed while loading data from data set SparkDataSet(file_format=parquet, filepath=gs://<my-bucket>/<my-project>/04_enriched_data/user_activity_data.parquet, load_args={'header': True}, save_args={'header': True, 'mode': overwrite, 'sep': ,}).
An error occurred while calling o51.load.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"

And my bucket I correctly populated so IDK what is the issue Also tried setting the next Spark (

3.3.1

version) config:

Copy code

spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
spark.hadoop.fs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS

But the error is different:

Copy code

kedro.io.core.DataSetError: Failed while loading data from data set SparkDataSet(file_format=parquet, filepath=<gs://third-echelon-bucket/brawlstars/04_enriched_data/user_activity_data.parquet>, load_args={'header': True}, save_args={'header': True, 'mode': overwrite, 'sep': ,}).
An error occurred while calling o55.load.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

Rob

04/30/2023, 5:50 PM

Guys, don't worry. I didn't have the

GCS Hadoop

connector installed 🙂 Anyone with a similar issue, download the

.jar

from the next link and place it at

$SPARK_HOME/jars

https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage And set your

spark.yml

as :

Copy code

# Google Cloud Service Config
com.google.cloud.bigdataoss: gcs-connector:hadoop3-2.3.0
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
# The AbstractFileSystem for 'gs:' URIs
spark.hadoop.fs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS

Open in Slack

Previous Next