Hello slightly smiling face I d like to read a BigQuery tabl Kedro #questions

Hello :slightly_smiling_face: I'd like to read a ...

Mohamed El Guendouz

02/26/2025, 1:59 PM

Hello 🙂 I'd like to read a BigQuery table using

spark.Dataset

, but I'm getting an error saying that I need to configure the project ID. Has anyone encountered this issue before? Spark Session :

Copy code

spark.jars.packages: io.delta:delta-spark_2.12:3.2.0
spark.jars: <https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar>
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

Error :

Copy code

DatasetError: Failed while loading data from dataset SparkDataset(file_format=bigquery, filepath=/tmp/dummy.parquet, load_args={'table': project_id.dataset_id.table_id}, save_args={}).
An error occurred while calling o45.load.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) [Guice/ErrorInCustomProvider]: IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment.  Please set a project ID using the 
builder.
  at SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:102)
  while locating SparkBigQueryConfig

Learn more:
  <https://github.com/google/guice/wiki/ERROR_IN_CUSTOM_PROVIDER>

1 error

======================
Full classname legend:
======================
SparkBigQueryConfig:          "com.google.cloud.spark.bigquery.SparkBigQueryConfig"
SparkBigQueryConnectorModule: "com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule"
========================
End of classname legend:
========================

✅ 1

Hall

02/26/2025, 1:59 PM

Someone will reply to you shortly. In the meantime, this might help:

Laura Couto

02/26/2025, 2:36 PM

Hey Mohamed, I believe you can add your project ID on the

spark.yml

file in your Kedro project. How are you passing the project ID to the Spark session?

Mohamed El Guendouz

02/27/2025, 9:19 AM

Hello @Laura Couto 🙂, the only place I have declared the project ID is in the catalog to identify the table in question. Do you think I should add any specific configuration to the spark.yml file? Catalog.yml :

Copy code

table_name:
  type: spark.SparkDataset
  file_format: bigquery
  filepath: "/tmp/dummy.parquet"
  load_args:
    table: "project_id.dataset_id.table_id"

Laura Couto

02/27/2025, 12:38 PM

I think you have to pass it to the Spark session either by declaring it on the spark.yml file or by declaring it on the hook where you initialize the session. https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#centralise-spark-configuration-in-conf-base-spark-yml

Mohamed El Guendouz

02/27/2025, 2:19 PM

Yes, that's a good lead, thank you @Laura Couto 🙂. However, I can't find anything on the internet or in the Kedro documentation that explains how to properly configure these parameters in the Spark session. I haven't found any information related to these parameters, which seem to be the ones I need to configure: •

com.google.cloud.spark.bigquery.SparkBigQueryConfig

•

com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule

Laura Couto

02/27/2025, 2:23 PM

Would you mind sharing how you're configuring your spark session?

Mohamed El Guendouz

02/27/2025, 2:32 PM

I am using the spark.yml file with this configuration:

Copy code

spark.jars.packages: io.delta:delta-spark_2.12:3.2.0
spark.jars: <https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar>
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

Mohamed El Guendouz

02/27/2025, 2:33 PM

And I am using the basic Hook provided by Kedro to configure a Spark session.

Nok Lam Chan

02/27/2025, 2:42 PM

This sounds more like Spark configuration issue and you should consult the Bigquery/Spark doc instead

👍 1

Laura Couto

02/27/2025, 2:42 PM

Try passing this to the session builder, it's what I could find in the spark docs.

Copy code

.config('parentProject', 'google-project-ID')

Mohamed El Guendouz

03/03/2025, 2:25 PM

Hi @Nok Lam Chan and @Laura Couto 🙂 , I wanted to share how I successfully read the BigQuery table. It turns out that some configurations were missing for reading the table. Here is the configuration I used:

Copy code

table_name:
  type: spark.SparkDataset
  file_format: bigquery
  filepath: "tmp/dummy.parquet"
  load_args:
    dataset: "project_id"
    table: "project_id.dataset_id.table_name"

I also verified that the service account (SA) had the following roles: • Storage Object Viewer • BigQuery Data Viewer • BigQuery Read Session User After properly configuring the credentials and without altering the Spark configuration I shared with you, I was able to read the BigQuery table successfully.

Nok Lam Chan

03/03/2025, 2:25 PM

^ which configuration is the missing one?

Nok Lam Chan

03/03/2025, 2:26 PM

Maybe it's something worth mentioned in the doc as an example at least

👍 1

Mohamed El Guendouz

03/03/2025, 2:31 PM

Yes, what was missing was the addition of the

dataset

parameter in the

load_args

. I believe it would indeed be useful to include this configuration in the documentation.

11 Views

Open in Slack

Previous Next