Italo Sayan
04/30/2024, 6:55 PMDeepyaman Datta
04/30/2024, 8:09 PMItalo Sayan
04/30/2024, 11:46 PMItalo Sayan
04/30/2024, 11:47 PMItalo Sayan
04/30/2024, 11:49 PMItalo Sayan
04/30/2024, 11:50 PMDeepyaman Datta
05/01/2024, 12:30 AMpandas.GBQDataset
(likely just be simplified for demo purposes); if your data resides in BigQuery to begin with, you probably want to use something like the newly-released ibis.TableDataset
(available since Kedro-Datasets 3.0.0) or BigFrames (pandas API on BigQuery) to avoid I/O costs and the inefficiency of pulling the data locally. Then, you could use BigQuery ML for your modeling. Also, there's something weird going on with the input_ml
dataset (I think it's not used in any way, and could cause some error if you tried to use it, since it's nesting MemoryDataset
).Italo Sayan
05/01/2024, 7:42 AMMatthias Roels
05/01/2024, 8:01 AMItalo Sayan
05/01/2024, 8:02 AMItalo Sayan
05/01/2024, 8:03 AMItalo Sayan
05/01/2024, 8:07 AMItalo Sayan
05/01/2024, 8:07 AMDeepyaman Datta
05/01/2024, 11:21 AM@Matthias Roels let’s say I create my master data using bq. I want to do the train/test split and the training with sklearn or statsmodels. At what point do you switch to Python\pollars or ibis as deepyamam mentioned? What’s a smooth way of moving from bq to the disk or ram of my node?Not Matthias, but... 🙂 As @Matthias Roels mentioned, I would switch as late as possible. I would go at least as far as creating the master table in BigQuery before you move to something like pandas or Polars. You can use Ibis (or BigFrames) from the very beginning. Both of these options will essentially generate BigQuery SQL and submit it lazily, so you can avoid additional compute costs (it should be more-or-less the same as using BigQuery directly). For what it's worth, the Ibis backend for BigQuery and BigFrames are both maintained by Google.* If you want to learn more about using Ibis in particular with Kedro, see https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis. In many cases, by the time you create your master table, the data is sufficiently small that you can just use something like scikit-learn. However, if your use case is large or you would otherwise like to do as much compute as possible on BigQuery, you could use BigQuery ML. You can also consider using IbisML** to do sklearn/tidymodels-style preprocessing directly on BigQuery (more flexible for manual preprocessing than BigQuery ML). *The Ibis backend for BigQuery was developed by the BigQuery DataFrames team lead at Google. He is also an Ibis maintainer, but the backend is collectively maintained by the Ibis team. BigFrames uses Ibis under the hood, but provides a pandas API. This is fully maintained by the Google team. **IbisML is an early-stage project. If you're interested in learning more, or seeing a demo on how it works with BigQuery, I can help! Also, full disclosure: I am paid to work on Ibis. While I genuinely believe it's a very good fit given what you're trying to do (push as much computation as possible to BigQuery), and am happy to help provide guidance, you should evaluate the options yourself. 🙂
Italo Sayan
05/01/2024, 11:30 AMimport bigframes.pandas as bpd
# Set BigQuery DataFrames options
bpd.options.bigquery.project = your_gcp_project_id
bpd.options.bigquery.location = "us"
# Create a DataFrame from a BigQuery table
query_or_table = "bigquery-public-data.ml_datasets.penguins"
df = bpd.read_gbq(query_or_table)
Compared to the ibis approach:
execute(self, expr, params=None, limit='default', **kwargs)
Italo Sayan
05/01/2024, 11:32 AMItalo Sayan
05/01/2024, 11:34 AMItalo Sayan
05/01/2024, 11:34 AMDeepyaman Datta
05/01/2024, 11:38 AM.to_pandas()
to convert a BigFrames dataframe to pandas dataframe on disk: https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.dataframe.DataFrame#bigframes_dataframe_DataFrame_to_pandas
(Similarly, <http://ibis.Table.to|ibis.Table.to>_pandas()
is an alias for execute()
)Deepyaman Datta
05/01/2024, 11:41 AM.execute()
on Ibis is to convert an Ibis table back to pandas. If you want to read a BigQuery table, it would be something like:
conn = ibis.bigquery.connect(
project_id=YOUR_PROJECT_ID,
dataset_id='bigquery-public-data.stackoverflow')
table = conn.table('posts_questions')
(adapted from https://github.com/GoogleCloudPlatform/community/blob/master/archived/bigquery-ibis/index.md; note that there are outdated references to ibis_bigquery
, as it is an archived repo from when ibis-bigquery
was a separate project)Italo Sayan
05/01/2024, 11:41 AMItalo Sayan
05/01/2024, 11:42 AMItalo Sayan
05/01/2024, 11:43 AMItalo Sayan
05/01/2024, 11:45 AMDeepyaman Datta
05/01/2024, 12:11 PMI see them as exclusive to be honest. If I use big frames I won’t need ibis.Sure. Also, if you use BigFrames, you will probably need to implement some custom Kedro datasets, since I don't think BigFrames datasets exist anywhere yet. However, this is quite easy to do, and we would happily welcome contributions if you do decide to go down this route!
Italo Sayan
05/01/2024, 1:07 PMItalo Sayan
05/01/2024, 1:07 PMDeepyaman Datta
05/01/2024, 1:12 PMItalo Sayan
05/03/2024, 9:24 AMDeepyaman Datta
05/03/2024, 9:55 PMItalo Sayan
05/04/2024, 1:20 PMItalo Sayan
05/04/2024, 1:20 PMDeepyaman Datta
05/04/2024, 5:47 PMpandas.GBQQueryDataset
Deepyaman Datta
05/04/2024, 5:48 PM