Hi Team I am trying to read a bigquery table using the `spar Kedro #questions

Hi Team! I am trying to read a bigquery table usi...

Abhishek Bhatia

12/17/2024, 2:02 PM

Hi Team! I am trying to read a bigquery table using the

spark.SparkDataSet

with an arbitrary query as follows

Copy code

trx_agg_data:
 type: spark.SparkDataSet
 file_format: bigquery
 load_args:
   viewsEnabled: true
   query: | 
     SELECT ph.category, MAX(trx.sales) 
     FROM {project}.{dataset}.trx_data trx
     LEFT JOIN {project}.{dataset}.prod_hierarchy ph
 filepath: <gs://my-bucket/trx_agg_data.parquet>

The dataset complains that the

filepath

is not in the correct format (BigQuery expected

<project>.<dataset>.<table>

), but I am trying to read it with a query. The following code works

Copy code

spark.read.format("bigquery").option("query", "SELECT ph.category, MAX(trx.sales) 
     FROM {project}.{dataset}.trx_data trx
     LEFT JOIN {project}.{dataset}.prod_hierarchy ph"
).load()

Looks like

spark.SparkDataSet

does not have this functionality. Should I create a custom dataset here?

Hall

12/17/2024, 2:02 PM

Someone will reply to you shortly. In the meantime, this might help:

datajoely

12/17/2024, 2:07 PM

I think you'd have more luck with the JDBC approach https://docs.kedro.org/en/0.18.7/kedro.datasets.spark.SparkJDBCDataSet.html

Abhishek Bhatia

12/17/2024, 2:08 PM

It requires to pass a

table

init parameter but my SQL query can contain arbitrary number of tables

datajoely

12/17/2024, 2:09 PM

That sort of goes against the ergonomics Kedro is designed around

datajoely

12/17/2024, 2:09 PM

I guess you could try doing a dataset factory here

datajoely

12/17/2024, 2:09 PM

but it's a bit funky

Abhishek Bhatia

12/17/2024, 2:14 PM

Updated the post to show that I am using 2 tables in the query to BigQuery.

datajoely

12/17/2024, 2:14 PM

sure but the

spark.SparkDataSet

wasn't build around that pattern

datajoely

12/17/2024, 2:15 PM

you can extend it to work that way

datajoely

12/17/2024, 2:15 PM

but it doesn't out of the box

Abhishek Bhatia

12/17/2024, 2:17 PM

So similar to

kedro.datasets.pandas.GBQQueryDataSet

is exactly what I want

Copy code

vehicles:
  type: pandas.GBQQueryDataSet
  sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
  project: my-project
  credentials: gbq-creds
  load_args:
    reauth: True

I think then creating

spark.GBQQueryDataSet

is my best bet?

datajoely

12/17/2024, 2:17 PM

Or you extend / override the existing spark.SparkDataSet to support this, if it works we'd love a PR back into Kedro

Abhishek Bhatia

12/17/2024, 2:20 PM

Sure, yes seems like a good idea! Although I must note that I am on

kedro==0.18.14

for this 🙂 Would be similar to implement for

kedro_datasets

package post

kedro>=0.19

though

datajoely

12/17/2024, 2:21 PM

because they're now decoupled they should largely be backwards compatible

✅ 1

datajoely

12/17/2024, 2:21 PM

get it working locally first and then I can help you get your contribution into

kedro-datasets

Abhishek Bhatia

12/17/2024, 2:24 PM

Very soon! Thanks @datajoely

Abhishek Bhatia

12/20/2024, 11:41 AM

Hi @datajoely + Kedro Team K Opened a PR on

kedro-plugins

to implement a new dataset

spark.GBQQueryDataset

feat(datasets): Implement `spark.GBQQueryDataset` for reading data from BigQuery as a spark dataframe using SQL query #971 Currently draft, but would be great if I can have some initial comments 🙂

❤️ 1

20 Views

Open in Slack

Previous Next