Is that a reasonable setup for a low-resource team...
# questions
i
Is that a reasonable setup for a low-resource team deploying its first model? What are the benefits and drawbacks? My experience is using aws-sagemaker with jenkins.
d
That seems reasonable! Something more akin to Sagemaker would be to deploy via Kubeflow Pipelines/Vertex AI, but what you've described is a more minimal way to get started. If you need to have pipelines running on multiple containers with separate resources, you can always migrate as necessary in the future.
i
Yeah, I see that as overkill for now
@Deepyaman Datta what about io with big query. I saw an article integrating bigquery ml to kedro.
Do you think it’s a good option to avoid IO costs?
d
I'm not previously familiar with that article. If your data resides in BigQuery, using BigQuery ML could make sense (assuming you don't need to do anything complicated; my understanding is that BigQuery ML isn't going to be as flexible as your standard data scientist toolkit, with its many libraries. The demo pipeline linked above has some weird quirks and/or inefficiencies. For one, the data engineering is done in pandas and pushed up to BigQuery with
pandas.GBQDataset
(likely just be simplified for demo purposes); if your data resides in BigQuery to begin with, you probably want to use something like the newly-released
ibis.TableDataset
(available since Kedro-Datasets 3.0.0) or BigFrames (pandas API on BigQuery) to avoid I/O costs and the inefficiency of pulling the data locally. Then, you could use BigQuery ML for your modeling. Also, there's something weird going on with the
input_ml
dataset (I think it's not used in any way, and could cause some error if you tried to use it, since it's nesting
MemoryDataset
).
i
Amazing. Thanks for the info
m
Two remarks: 1) if your data sits in BQ, I would do as much as possible in BQ, avoiding additional compute costs. 2) running a pipeline as a k8s cron job requires a GKE cluster. Unless you get one for “free” (i.e. maintained by another team), I wouldn’t call that a simple/lightweight deployment solution. The cost of maintaining a GKE cluster is not small!
👍 2
i
Oh yeah. It’s for free
They use it for other jobs and I’ll just add mine with a proper PR 👌
@Matthias Roels let’s say I create my master data using bq. I want to do the train/test split and the training with sklearn or statsmodels. At what point do you switch to Python\pollars or ibis as deepyamam mentioned? What’s a smooth way of moving from bq to the disk or ram of my node?
As my understanding I can only scale vertically. Which is fine by me.
d
@Matthias Roels let’s say I create my master data using bq. I want to do the train/test split and the training with sklearn or statsmodels. At what point do you switch to Python\pollars or ibis as deepyamam mentioned? What’s a smooth way of moving from bq to the disk or ram of my node?
Not Matthias, but... 🙂 As @Matthias Roels mentioned, I would switch as late as possible. I would go at least as far as creating the master table in BigQuery before you move to something like pandas or Polars. You can use Ibis (or BigFrames) from the very beginning. Both of these options will essentially generate BigQuery SQL and submit it lazily, so you can avoid additional compute costs (it should be more-or-less the same as using BigQuery directly). For what it's worth, the Ibis backend for BigQuery and BigFrames are both maintained by Google.* If you want to learn more about using Ibis in particular with Kedro, see https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis. In many cases, by the time you create your master table, the data is sufficiently small that you can just use something like scikit-learn. However, if your use case is large or you would otherwise like to do as much compute as possible on BigQuery, you could use BigQuery ML. You can also consider using IbisML** to do sklearn/tidymodels-style preprocessing directly on BigQuery (more flexible for manual preprocessing than BigQuery ML). *The Ibis backend for BigQuery was developed by the BigQuery DataFrames team lead at Google. He is also an Ibis maintainer, but the backend is collectively maintained by the Ibis team. BigFrames uses Ibis under the hood, but provides a pandas API. This is fully maintained by the Google team. **IbisML is an early-stage project. If you're interested in learning more, or seeing a demo on how it works with BigQuery, I can help! Also, full disclosure: I am paid to work on Ibis. While I genuinely believe it's a very good fit given what you're trying to do (push as much computation as possible to BigQuery), and am happy to help provide guidance, you should evaluate the options yourself. 🙂
i
Copy code
import bigframes.pandas as bpd
Copy code
# Set BigQuery DataFrames options
bpd.options.bigquery.project = your_gcp_project_id
bpd.options.bigquery.location = "us"
Copy code
# Create a DataFrame from a BigQuery table
query_or_table = "bigquery-public-data.ml_datasets.penguins"
df = bpd.read_gbq(query_or_table)
Compared to the ibis approach:
Copy code
execute(self, expr, params=None, limit='default', **kwargs)
At one point of the project I need to decide between one of those right?
Both are engineered considering bigquery so I assume they are very similar.
Prob one thing calls the other under the hood
d
You can also use
.to_pandas()
to convert a BigFrames dataframe to pandas dataframe on disk: https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.dataframe.DataFrame#bigframes_dataframe_DataFrame_to_pandas (Similarly,
<http://ibis.Table.to|ibis.Table.to>_pandas()
is an alias for
execute()
)
Ah, maybe I misunderstood your question.
.execute()
on Ibis is to convert an Ibis table back to pandas. If you want to read a BigQuery table, it would be something like:
Copy code
conn = ibis.bigquery.connect(
    project_id=YOUR_PROJECT_ID,
    dataset_id='bigquery-public-data.stackoverflow')
table = conn.table('posts_questions')
(adapted from https://github.com/GoogleCloudPlatform/community/blob/master/archived/bigquery-ibis/index.md; note that there are outdated references to
ibis_bigquery
, as it is an archived repo from when
ibis-bigquery
was a separate project)
i
Makes sense.
Yeah, in general I meant that at some point I will use the APIs of any of the projects to make the query
I see them as exclusive to be honest. If I use big frames I won’t need ibis.
As long as I only query one database
d
I see them as exclusive to be honest. If I use big frames I won’t need ibis.
Sure. Also, if you use BigFrames, you will probably need to implement some custom Kedro datasets, since I don't think BigFrames datasets exist anywhere yet. However, this is quite easy to do, and we would happily welcome contributions if you do decide to go down this route!
👌 1
i
Any resource that can help me on that path?
Looks like my best option
i
Fun fact. Bigframes uses ibis: # ----------------------------------------- # Create Ibis table expression and validate # ----------------------------------------- # Use a time travel to make sure the DataFrame is deterministic, even # if the underlying table changes. table_expression = bf_read_gbq_table.get_ibis_time_travel_table( self.ibis_client, table_ref, time_travel_timestamp, )
😁 1
d
Yep! BigFrames is basically a pandas API skin on top of the Ibis backend for BigQuery. The same person (Tim Swast/Sweña, BigQuery DataFrames lead) drove the development of both.
i
Nice. I also have polars code in my pipeline
I’m thinking on just using this and calling it a day: https://docs.pola.rs/user-guide/io/bigquery/#read
d
If you would like to do as much as possible in BigQuery, plus use a Pythonic API, I think Ibis (or BigFrames) is going to give you that more than the above solution. What you're showing in that last link is really no different than using
pandas.GBQQueryDataset
Just using polars instead of pandas