Hi Kedro + GCP Experts! :kedro: Context: 1. I ha...
# questions
a
Hi Kedro + GCP Experts! K Context: 1. I have written BigQuery SQL Queries to manipulate datasets and create 10 new datasets on BigQuery. 2. Some queries have dependencies on each other and hence need to be run in a specific order. Other queries need to be run parallely. (Basically a DAG) 3. I now want to export these datasets as parquet files into a GCS bucket but ensure the schema on write, and set the partitioning columns and startegy while writing. 4. After writing the datasets, I will be using these datasets in a kedro project on GCP Dataproc utilising pyspark for further manipulations Questions: 1. How do I ensure I can run queries in order in BigQuery to create BigQuery tables? i.e. basically creating a DAG of BigQuery SQL queries which run in order. (Or is it possible using some other service on GCP to develop the DAG which can run BigQuery SQL queries in order? i.e Cloud Composer) 2. Alternatively to 1, do you suggest I do it in BigQuery or do I take create a pipeline in kedro project as
fetch_datasets_from_bq
and define the transcoded catalog entries my_dataset_1@bigquery (for loading from BQ) and my_dataset_1@spark (for writing as a partitioned spark dataset) 3. The problem with 2. is that the BigQuery dataset in kedro loads as a pandas dataset and not spark i.e.
kedro.extras.datasets.pandas.GBQQueryDataSet
, and so it won't be possible to load the entire dataset into driver memory (also not efficient). Looks like there is no
kedro.extras.datasets.spark.GBQQueryDataSet
The flow I am thinking is: 1. Author queries on BigQuery SQL 2. Define the DAG in cloud composer 3. Run the DAG to run the BQ queries in order and then write them to GCS as parquet 4. Now use these in the kedro project (with DataProc) Additional Context: 1. The BigQuery tables are populated using Datastream essentially to replicate updates directly from source systems in real time. 2. Our job picks up after that, to manipulate BigQuery tables to create new tables in Primary Layer and export them to GCS 3. We want to setup a DAG for these BigQuery manipulations and write them as Parquet to GCS Bonus Help: 1. Since Datastream essentially updates BQ tables using CDC, Is there a way we can do incremental + scheduled BigQuery SQL queries, to only manipulate the new data points and update the parquet on GCS (or is it additional unnecessary overhead in complexity without too much savings on time + money?)
d
I would use Ibis here
that way you can also use the same code for Spark and BQ
a
Thanks @datajoely There already is an ongoing effort to now convert all pyspark code to ibis (but it's longer term). So looking for an immediate alternate right now 🙂
d
if you’re already using Spark you can use the
spark.SparkJDBCDataset
to talk to BQ
that would be my preferred way as you’d stay in DataFrame world and be able to benefit from predicate pushdown
a
Ah I see, thanks that is interesting! Any specific setup required for this or perhaps a useful link to make that work with BigQuery in a kedro codebase? Essentially, I want to run BigQuery SQL queries in order and not just read BigQuery SQL tables.
d
Just follow the jdbc instructions it should work