Hi Team! I am trying to read a bigquery table usi...
# questions
a
Hi Team! I am trying to read a bigquery table using the
spark.SparkDataSet
with an arbitrary query as follows
Copy code
trx_agg_data:
 type: spark.SparkDataSet
 file_format: bigquery
 load_args:
   viewsEnabled: true
   query: | 
     SELECT ph.category, MAX(trx.sales) 
     FROM {project}.{dataset}.trx_data trx
     LEFT JOIN {project}.{dataset}.prod_hierarchy ph
 filepath: <gs://my-bucket/trx_agg_data.parquet>
The dataset complains that the
filepath
is not in the correct format (BigQuery expected
<project>.<dataset>.<table>
), but I am trying to read it with a query. The following code works
Copy code
spark.read.format("bigquery").option("query", "SELECT ph.category, MAX(trx.sales) 
     FROM {project}.{dataset}.trx_data trx
     LEFT JOIN {project}.{dataset}.prod_hierarchy ph"
).load()
Looks like
spark.SparkDataSet
does not have this functionality. Should I create a custom dataset here?
h
Someone will reply to you shortly. In the meantime, this might help:
d
I think you'd have more luck with the JDBC approach https://docs.kedro.org/en/0.18.7/kedro.datasets.spark.SparkJDBCDataSet.html
a
It requires to pass a
table
init parameter but my SQL query can contain arbitrary number of tables
d
That sort of goes against the ergonomics Kedro is designed around
I guess you could try doing a dataset factory here
but it's a bit funky
a
Updated the post to show that I am using 2 tables in the query to BigQuery.
d
sure but the
spark.SparkDataSet
wasn't build around that pattern
you can extend it to work that way
but it doesn't out of the box
a
So similar to
kedro.datasets.pandas.GBQQueryDataSet
is exactly what I want
Copy code
vehicles:
  type: pandas.GBQQueryDataSet
  sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
  project: my-project
  credentials: gbq-creds
  load_args:
    reauth: True
I think then creating
spark.GBQQueryDataSet
is my best bet?
d
Or you extend / override the existing spark.SparkDataSet to support this, if it works we'd love a PR back into Kedro
a
Sure, yes seems like a good idea! Although I must note that I am on
kedro==0.18.14
for this 🙂 Would be similar to implement for
kedro_datasets
package post
kedro>=0.19
though
d
because they're now decoupled they should largely be backwards compatible
1
get it working locally first and then I can help you get your contribution into
kedro-datasets
a
Very soon! Thanks @datajoely
Hi @datajoely + Kedro Team K Opened a PR on
kedro-plugins
to implement a new dataset
spark.GBQQueryDataset
feat(datasets): Implement `spark.GBQQueryDataset` for reading data from BigQuery as a spark dataframe using SQL query #971 Currently draft, but would be great if I can have some initial comments 🙂
❤️ 1