Abhishek Bhatia
12/17/2024, 2:02 PMspark.SparkDataSet
with an arbitrary query as follows
trx_agg_data:
type: spark.SparkDataSet
file_format: bigquery
load_args:
viewsEnabled: true
query: |
SELECT ph.category, MAX(trx.sales)
FROM {project}.{dataset}.trx_data trx
LEFT JOIN {project}.{dataset}.prod_hierarchy ph
filepath: <gs://my-bucket/trx_agg_data.parquet>
The dataset complains that the filepath
is not in the correct format (BigQuery expected <project>.<dataset>.<table>
), but I am trying to read it with a query.
The following code works
spark.read.format("bigquery").option("query", "SELECT ph.category, MAX(trx.sales)
FROM {project}.{dataset}.trx_data trx
LEFT JOIN {project}.{dataset}.prod_hierarchy ph"
).load()
Looks like spark.SparkDataSet
does not have this functionality. Should I create a custom dataset here?Hall
12/17/2024, 2:02 PMdatajoely
12/17/2024, 2:07 PMAbhishek Bhatia
12/17/2024, 2:08 PMtable
init parameter but my SQL query can contain arbitrary number of tablesdatajoely
12/17/2024, 2:09 PMdatajoely
12/17/2024, 2:09 PMdatajoely
12/17/2024, 2:09 PMAbhishek Bhatia
12/17/2024, 2:14 PMdatajoely
12/17/2024, 2:14 PMspark.SparkDataSet
wasn't build around that patterndatajoely
12/17/2024, 2:15 PMdatajoely
12/17/2024, 2:15 PMAbhishek Bhatia
12/17/2024, 2:17 PMkedro.datasets.pandas.GBQQueryDataSet
is exactly what I want
vehicles:
type: pandas.GBQQueryDataSet
sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
project: my-project
credentials: gbq-creds
load_args:
reauth: True
I think then creating spark.GBQQueryDataSet
is my best bet?datajoely
12/17/2024, 2:17 PMAbhishek Bhatia
12/17/2024, 2:20 PMkedro==0.18.14
for this 🙂 Would be similar to implement for kedro_datasets
package post kedro>=0.19
thoughdatajoely
12/17/2024, 2:21 PMdatajoely
12/17/2024, 2:21 PMkedro-datasets
Abhishek Bhatia
12/17/2024, 2:24 PMAbhishek Bhatia
12/20/2024, 11:41 AMkedro-plugins
to implement a new dataset spark.GBQQueryDataset
feat(datasets): Implement `spark.GBQQueryDataset` for reading data from BigQuery as a spark dataframe using SQL query #971
Currently draft, but would be great if I can have some initial comments 🙂