<https://aws.amazon.com/s3/features/tables/> S3 no...
# resources
n
https://aws.amazon.com/s3/features/tables/ S3 now support Iceberg Table natively > When you create a table in your table bucket, the underlying data in S3 is stored as Parquet data. Then, S3 maintains the metadata necessary to make that Parquet data queryable by your applications. Table buckets include a client library that is used by query engines to navigate and update the Iceberg metadata of tables in your table bucket.
apache iceberg 1
j
I guess this makes a lot of sense for them because it creates lock-in with AWS Glue?
n
That too, but I think SageMaker Lakehouse is the driving force here: https://aws.amazon.com/sagemaker/lakehouse/ . They are trying to create a more complete studio experience
d
image.png
j
from the meltware article I read that they introduced their own alternative to the Iceberg REST standard? that's pretty significant
d
Oh and look who did exactly the same thing: https://cloud.google.com/bigquery/docs/iceberg-tables
there is a scramble for metadata going on, but all the vendors are at least backing one table format for the future
j
how come all of Big Tech is building on top of Apache Iceberg and yet PyIceberg is so lacking in terms of features? where is all that effort going? (maybe I'm just grumpy today)
yeah I think it's safe to say Iceberg "won", @Deepyaman Datta and you @datajoely were right
d
Oh I think the answer there is because the query engines are doing it from first principles
☝️ 1
☝🏼 1
Spark / JVM for now, the inevitable rust + python approaches will come shortly
j
if we can extract any lesson here, maybe we could say Iceberg won because people wanted to use their own catalog? 😬
d
I read sometime a while back how Delta was what you’d build if you’re an analytics company and Iceberg is the same if you are a data warehousing company But also try as they might Delta always felt like a Databricks product
d
how come all of Big Tech is building on top of Apache Iceberg and yet PyIceberg is so lacking in terms of features? where is all that effort going? (maybe I'm just grumpy today)
Iceberg focuses on the storage layer, and query engines need some way to interact with that storage layer. No large scale query engine is Python-based. PyIceberg is for working directly with the storage layer. This isn't necessary for 90+% of cases. (Just my guess/understanding, could be wrong.)
👀 1
☝🏼 1
j
for the time being looks like my favourite query engine doesn't have write support for Iceberg, only read https://github.com/pola-rs/polars/issues/14610
d
There is also a school of thought that maybe the S3 catalog API will win in the same way the S3 API has https://bsky.app/profile/benesch.bsky.social/post/3lck4caxqsc2x
👍🏼 1
n
Probably, at least as a least friction path to start with something that is production ready. Many people just want to use Iceberg and have a single source (like s3 bucket)
Noted that s3 (Hadoop Catalog) is not implemented in Pyiceberg
d
Interesting
Given on I’m on GCP im really not sure if I should go with the Google version of this or think more vendor agnostic