Hello :slightly_smiling_face: I'd like to read a ...
# questions
m
Hello šŸ™‚ I'd like to read a BigQuery table using
spark.Dataset
, but I'm getting an error saying that I need to configure the project ID. Has anyone encountered this issue before? Spark Session :
Copy code
spark.jars.packages: io.delta:delta-spark_2.12:3.2.0
spark.jars: <https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar>
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
Error :
Copy code
DatasetError: Failed while loading data from dataset SparkDataset(file_format=bigquery, filepath=/tmp/dummy.parquet, load_args={'table': project_id.dataset_id.table_id}, save_args={}).
An error occurred while calling o45.load.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) [Guice/ErrorInCustomProvider]: IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment.  Please set a project ID using the 
builder.
  at SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:102)
  while locating SparkBigQueryConfig

Learn more:
  <https://github.com/google/guice/wiki/ERROR_IN_CUSTOM_PROVIDER>

1 error

======================
Full classname legend:
======================
SparkBigQueryConfig:          "com.google.cloud.spark.bigquery.SparkBigQueryConfig"
SparkBigQueryConnectorModule: "com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule"
========================
End of classname legend:
========================
āœ… 1
h
Someone will reply to you shortly. In the meantime, this might help:
l
Hey Mohamed, I believe you can add your project ID on the
spark.yml
file in your Kedro project. How are you passing the project ID to the Spark session?
m
Hello @Laura Couto šŸ™‚, the only place I have declared the project ID is in the catalog to identify the table in question. Do you think I should add any specific configuration to the spark.yml file? Catalog.yml :
Copy code
table_name:
  type: spark.SparkDataset
  file_format: bigquery
  filepath: "/tmp/dummy.parquet"
  load_args:
    table: "project_id.dataset_id.table_id"
l
I think you have to pass it to the Spark session either by declaring it on the spark.yml file or by declaring it on the hook where you initialize the session. https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#centralise-spark-configuration-in-conf-base-spark-yml
m
Yes, that's a good lead, thank you @Laura Couto šŸ™‚. However, I can't find anything on the internet or in the Kedro documentation that explains how to properly configure these parameters in the Spark session. I haven't found any information related to these parameters, which seem to be the ones I need to configure: •
com.google.cloud.spark.bigquery.SparkBigQueryConfig
•
com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule
l
Would you mind sharing how you're configuring your spark session?
m
I am using the spark.yml file with this configuration:
Copy code
spark.jars.packages: io.delta:delta-spark_2.12:3.2.0
spark.jars: <https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar>
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
And I am using the basic Hook provided by Kedro to configure a Spark session.
n
This sounds more like Spark configuration issue and you should consult the Bigquery/Spark doc instead
šŸ‘ 1
l
Try passing this to the session builder, it's what I could find in the spark docs.
Copy code
.config('parentProject', 'google-project-ID')
m
Hi @Nok Lam Chan and @Laura Couto šŸ™‚ , I wanted to share how I successfully read the BigQuery table. It turns out that some configurations were missing for reading the table. Here is the configuration I used:
Copy code
table_name:
  type: spark.SparkDataset
  file_format: bigquery
  filepath: "tmp/dummy.parquet"
  load_args:
    dataset: "project_id"
    table: "project_id.dataset_id.table_name"
I also verified that the service account (SA) had the following roles: • Storage Object Viewer • BigQuery Data Viewer • BigQuery Read Session User After properly configuring the credentials and without altering the Spark configuration I shared with you, I was able to read the BigQuery table successfully.
n
^ which configuration is the missing one?
Maybe it's something worth mentioned in the doc as an example at least
šŸ‘ 1
m
Yes, what was missing was the addition of the
dataset
parameter in the
load_args
. I believe it would indeed be useful to include this configuration in the documentation.