Abhishek Bhatia
07/05/2024, 10:06 AMfetch_datasets_from_bq
and define the transcoded catalog entries my_dataset_1@bigquery (for loading from BQ) and my_dataset_1@spark (for writing as a partitioned spark dataset)
3. The problem with 2. is that the BigQuery dataset in kedro loads as a pandas dataset and not spark i.e. kedro.extras.datasets.pandas.GBQQueryDataSet
, and so it won't be possible to load the entire dataset into driver memory (also not efficient). Looks like there is no kedro.extras.datasets.spark.GBQQueryDataSet
The flow I am thinking is:
1. Author queries on BigQuery SQL
2. Define the DAG in cloud composer
3. Run the DAG to run the BQ queries in order and then write them to GCS as parquet
4. Now use these in the kedro project (with DataProc)
Additional Context:
1. The BigQuery tables are populated using Datastream essentially to replicate updates directly from source systems in real time.
2. Our job picks up after that, to manipulate BigQuery tables to create new tables in Primary Layer and export them to GCS
3. We want to setup a DAG for these BigQuery manipulations and write them as Parquet to GCS
Bonus Help:
1. Since Datastream essentially updates BQ tables using CDC, Is there a way we can do incremental + scheduled BigQuery SQL queries, to only manipulate the new data points and update the parquet on GCS (or is it additional unnecessary overhead in complexity without too much savings on time + money?)datajoely
07/05/2024, 11:25 AMdatajoely
07/05/2024, 11:26 AMdatajoely
07/05/2024, 11:26 AMAbhishek Bhatia
07/05/2024, 11:26 AMdatajoely
07/05/2024, 11:28 AMspark.SparkJDBCDataset
to talk to BQdatajoely
07/05/2024, 11:28 AMAbhishek Bhatia
07/05/2024, 1:40 PMdatajoely
07/05/2024, 2:36 PM