`bigframes`, DataFrame APIs for BigQuery :fire: <h...
# resources
j
d
It uses Ibis behind the scenes too! 🚀 nice work @Cody Peterson
j
yesssss ibis
c
not my work but super cool to see 🙂 might be adaptable more generally too? really cool that they made this open-source
d
It's interesting. I was (maybe still am) a bit confused by the
Block
concept, but looking more into it, seems like it's just their version of
NDFrame
.
might be adaptable more generally too?
I think so. Their
Session
pretty much creates your
ibis.Backend
instance, and there's not much backend-specific until you get to `to_pandas()`and you use that
Session
in executing the Ibis expression. https://github.com/googleapis/python-bigquery-dataframes/blob/main/bigframes/session.py#L299-L306 and then places like https://github.com/googleapis/python-bigquery-dataframes/blob/bf6ecb81afeb199b3dad07d1fd2057668352f939/bigframes/core/scalar.py#L57 Bits and pieces where there are specific BigQuery-related limitations, but look pretty easy to pick out in making it more generic.
Wonder what their operation coverage is/if they have it listed anywhere?
👀 1
d
Do we start the countdown timers for RedFrames and AzureFrames now then?
😂 1
j
in a way this might a response to Snowpark? but I'm not very familiar with these platforms, might be spitting bullsh*t here
d
Yeah it is
without the JVM overhead too
IIRC
c
it's definitely a play around Snowflake -- we're moving toward a world where each cloud data platform has their own similar, but slightly different dataframe implementation (PySpark dataframes or pandas on PySpark for Databricks/Synapse, Snowpark for Snowflake, Bigframes for BigQuery, etc.)
our hope is for Ibis to act as a vendor-agnostic, open dataframe standard that works on (or ideally powers) those platforms, so Google building this in the open on top of Ibis is a great step IMO
💡 1
d
My hope as well!
🔥 1
d
we're moving toward a world where each cloud data platform has their own similar, but slightly different dataframe implementation (PySpark dataframes or pandas on PySpark for Databricks/Synapse, Snowpark for Snowflake, Bigframes for BigQuery, etc.)
Snowpark is a bit more different, in that it isn't a pandas API really (as far as I know).
pyspark.pandas
and BigFrames both try to stay as close to the pandas API as possible.
pyspark.pandas
whole test suite is checking equivalence to pandas operations/syntax; BigFrames vendors pandas itself so it doesn't need to rewrite docstrings 😂 (
pyspark.pandas
is not as fancy about it, but most of the docstrings are copied word-for-word for the most part) But, if more vendors/projects down the line want that pandas-equivalent API layer, I can definitely see them using this + Ibis, since it's a lot of extra work/maintenance that most teams would love to avoid. So I think/hope we'd see not-so-different dataframe implementations in the end 🙂
c
yeah Snowpark is more regular PySpark-like, but still has differences. pandas/PySpark are the two major flavors, then koalas -> pandas on Spark muddies that further. the big issue w/ the pandas API is it inherently doesn't scale (the index and some other reasons), so you always end up with a not-so-great support matrix of operations. I think a bunch of different API "flavors" is fine but the duplication of efforts across the ecosystem isn't great
they're very rarely truly drop-in replacements for each other
and maybe soon LLMs that can easily translate between them all make it moot? 🤷