Is that a reasonable setup for a low resource team deploying Kedro #questions

Is that a reasonable setup for a low-resource team...

Italo Sayan

04/30/2024, 6:55 PM

Is that a reasonable setup for a low-resource team deploying its first model? What are the benefits and drawbacks? My experience is using aws-sagemaker with jenkins.

Deepyaman Datta

04/30/2024, 8:09 PM

That seems reasonable! Something more akin to Sagemaker would be to deploy via Kubeflow Pipelines/Vertex AI, but what you've described is a more minimal way to get started. If you need to have pipelines running on multiple containers with separate resources, you can always migrate as necessary in the future.

Italo Sayan

04/30/2024, 11:46 PM

Yeah, I see that as overkill for now

Italo Sayan

04/30/2024, 11:47 PM

@Deepyaman Datta what about io with big query. I saw an article integrating bigquery ml to kedro.

Italo Sayan

04/30/2024, 11:49 PM

https://towardsdatascience.com/running-kedro-machine-learning-pipelines-with-google-cloud-bigquery-ml-47cfe2e7c943

Italo Sayan

04/30/2024, 11:50 PM

Do you think it’s a good option to avoid IO costs?

Deepyaman Datta

05/01/2024, 12:30 AM

I'm not previously familiar with that article. If your data resides in BigQuery, using BigQuery ML could make sense (assuming you don't need to do anything complicated; my understanding is that BigQuery ML isn't going to be as flexible as your standard data scientist toolkit, with its many libraries. The demo pipeline linked above has some weird quirks and/or inefficiencies. For one, the data engineering is done in pandas and pushed up to BigQuery with

pandas.GBQDataset

(likely just be simplified for demo purposes); if your data resides in BigQuery to begin with, you probably want to use something like the newly-released

ibis.TableDataset

(available since Kedro-Datasets 3.0.0) or BigFrames (pandas API on BigQuery) to avoid I/O costs and the inefficiency of pulling the data locally. Then, you could use BigQuery ML for your modeling. Also, there's something weird going on with the

input_ml

dataset (I think it's not used in any way, and could cause some error if you tried to use it, since it's nesting

MemoryDataset

Italo Sayan

05/01/2024, 7:42 AM

Amazing. Thanks for the info

Matthias Roels

05/01/2024, 8:01 AM

Two remarks: 1) if your data sits in BQ, I would do as much as possible in BQ, avoiding additional compute costs. 2) running a pipeline as a k8s cron job requires a GKE cluster. Unless you get one for “free” (i.e. maintained by another team), I wouldn’t call that a simple/lightweight deployment solution. The cost of maintaining a GKE cluster is not small!

👍 2

Italo Sayan

05/01/2024, 8:02 AM

Oh yeah. It’s for free

Italo Sayan

05/01/2024, 8:03 AM

They use it for other jobs and I’ll just add mine with a proper PR 👌

Italo Sayan

05/01/2024, 8:07 AM

@Matthias Roels let’s say I create my master data using bq. I want to do the train/test split and the training with sklearn or statsmodels. At what point do you switch to Python\pollars or ibis as deepyamam mentioned? What’s a smooth way of moving from bq to the disk or ram of my node?

Italo Sayan

05/01/2024, 8:07 AM

As my understanding I can only scale vertically. Which is fine by me.

Deepyaman Datta

05/01/2024, 11:21 AM

@Matthias Roels let’s say I create my master data using bq. I want to do the train/test split and the training with sklearn or statsmodels. At what point do you switch to Python\pollars or ibis as deepyamam mentioned? What’s a smooth way of moving from bq to the disk or ram of my node?

Not Matthias, but... 🙂 As @Matthias Roels mentioned, I would switch as late as possible. I would go at least as far as creating the master table in BigQuery before you move to something like pandas or Polars. You can use Ibis (or BigFrames) from the very beginning. Both of these options will essentially generate BigQuery SQL and submit it lazily, so you can avoid additional compute costs (it should be more-or-less the same as using BigQuery directly). For what it's worth, the Ibis backend for BigQuery and BigFrames are both maintained by Google.* If you want to learn more about using Ibis in particular with Kedro, see https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis. In many cases, by the time you create your master table, the data is sufficiently small that you can just use something like scikit-learn. However, if your use case is large or you would otherwise like to do as much compute as possible on BigQuery, you could use BigQuery ML. You can also consider using IbisML** to do sklearn/tidymodels-style preprocessing directly on BigQuery (more flexible for manual preprocessing than BigQuery ML). *The Ibis backend for BigQuery was developed by the BigQuery DataFrames team lead at Google. He is also an Ibis maintainer, but the backend is collectively maintained by the Ibis team. BigFrames uses Ibis under the hood, but provides a pandas API. This is fully maintained by the Google team. **IbisML is an early-stage project. If you're interested in learning more, or seeing a demo on how it works with BigQuery, I can help! Also, full disclosure: I am paid to work on Ibis. While I genuinely believe it's a very good fit given what you're trying to do (push as much computation as possible to BigQuery), and am happy to help provide guidance, you should evaluate the options yourself. 🙂

Italo Sayan

05/01/2024, 11:30 AM

Copy code

import bigframes.pandas as bpd

Copy code

# Set BigQuery DataFrames options
bpd.options.bigquery.project = your_gcp_project_id
bpd.options.bigquery.location = "us"

Copy code

# Create a DataFrame from a BigQuery table
query_or_table = "bigquery-public-data.ml_datasets.penguins"
df = bpd.read_gbq(query_or_table)

Compared to the ibis approach:

Copy code

execute(self, expr, params=None, limit='default', **kwargs)

Italo Sayan

05/01/2024, 11:32 AM

At one point of the project I need to decide between one of those right?

Italo Sayan

05/01/2024, 11:34 AM

Both are engineered considering bigquery so I assume they are very similar.

Italo Sayan

05/01/2024, 11:34 AM

Prob one thing calls the other under the hood

Deepyaman Datta

05/01/2024, 11:38 AM

You can also use

.to_pandas()

to convert a BigFrames dataframe to pandas dataframe on disk: https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.dataframe.DataFrame#bigframes_dataframe_DataFrame_to_pandas (Similarly,

<http://ibis.Table.to|ibis.Table.to>_pandas()

is an alias for

execute()

)

Deepyaman Datta

05/01/2024, 11:41 AM

Ah, maybe I misunderstood your question.

.execute()

on Ibis is to convert an Ibis table back to pandas. If you want to read a BigQuery table, it would be something like:

Copy code

conn = ibis.bigquery.connect(
    project_id=YOUR_PROJECT_ID,
    dataset_id='bigquery-public-data.stackoverflow')
table = conn.table('posts_questions')

(adapted from https://github.com/GoogleCloudPlatform/community/blob/master/archived/bigquery-ibis/index.md; note that there are outdated references to

ibis_bigquery

, as it is an archived repo from when

ibis-bigquery

was a separate project)

Italo Sayan

05/01/2024, 11:41 AM

Makes sense.

Italo Sayan

05/01/2024, 11:42 AM

Yeah, in general I meant that at some point I will use the APIs of any of the projects to make the query

Italo Sayan

05/01/2024, 11:43 AM

I see them as exclusive to be honest. If I use big frames I won’t need ibis.

Italo Sayan

05/01/2024, 11:45 AM

As long as I only query one database

Deepyaman Datta

05/01/2024, 12:11 PM

I see them as exclusive to be honest. If I use big frames I won’t need ibis.

Sure. Also, if you use BigFrames, you will probably need to implement some custom Kedro datasets, since I don't think BigFrames datasets exist anywhere yet. However, this is quite easy to do, and we would happily welcome contributions if you do decide to go down this route!

👌 1

Italo Sayan

05/01/2024, 1:07 PM

Any resource that can help me on that path?

Italo Sayan

05/01/2024, 1:07 PM

Looks like my best option

Deepyaman Datta

05/01/2024, 1:12 PM

Creating a custom dataset: https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html Existing Ibis table dataset: https://docs.kedro.org/projects/kedro-datasets/en/latest/api/kedro_datasets.ibis.TableDataset.html

Italo Sayan

05/03/2024, 9:24 AM

Fun fact. Bigframes uses ibis: # ----------------------------------------- # Create Ibis table expression and validate # ----------------------------------------- # Use a time travel to make sure the DataFrame is deterministic, even # if the underlying table changes. table_expression = bf_read_gbq_table.get_ibis_time_travel_table( self.ibis_client, table_ref, time_travel_timestamp, )

😁 1

Deepyaman Datta

05/03/2024, 9:55 PM

Yep! BigFrames is basically a pandas API skin on top of the Ibis backend for BigQuery. The same person (Tim Swast/Sweña, BigQuery DataFrames lead) drove the development of both.

Italo Sayan

05/04/2024, 1:20 PM

Nice. I also have polars code in my pipeline

Italo Sayan

05/04/2024, 1:20 PM

I’m thinking on just using this and calling it a day: https://docs.pola.rs/user-guide/io/bigquery/#read

Deepyaman Datta

05/04/2024, 5:47 PM

If you would like to do as much as possible in BigQuery, plus use a Pythonic API, I think Ibis (or BigFrames) is going to give you that more than the above solution. What you're showing in that last link is really no different than using

pandas.GBQQueryDataset

Deepyaman Datta

05/04/2024, 5:48 PM

Just using polars instead of pandas

13 Views

Open in Slack

Previous Next