<https aws amazon com s3 features tables > S3 now support Ic Kedro #resources

<https://aws.amazon.com/s3/features/tables/> S3 no...

Nok Lam Chan

12/04/2024, 10:22 AM

https://aws.amazon.com/s3/features/tables/ S3 now support Iceberg Table natively > When you create a table in your table bucket, the underlying data in S3 is stored as Parquet data. Then, S3 maintains the metadata necessary to make that Parquet data queryable by your applications. Table buckets include a client library that is used by query engines to navigate and update the Iceberg metadata of tables in your table bucket.

apache iceberg 1

Juan Luis

12/04/2024, 10:51 AM

I guess this makes a lot of sense for them because it creates lock-in with AWS Glue?

Nok Lam Chan

12/04/2024, 11:03 AM

That too, but I think SageMaker Lakehouse is the driving force here: https://aws.amazon.com/sagemaker/lakehouse/ . They are trying to create a more complete studio experience

datajoely

12/04/2024, 11:37 AM

https://meltware.com/2024/12/04/s3-tables

datajoely

12/04/2024, 11:39 AM

image.png

datajoely

12/04/2024, 11:40 AM

https://www.linkedin.com/posts/hugo-lu-confirmed_dataengineering-apacheiceberg-s3-act[…]515818905601-Idp7?utm_source=share&utm_medium=member_desktop

Juan Luis

12/04/2024, 11:47 AM

from the meltware article I read that they introduced their own alternative to the Iceberg REST standard? that's pretty significant

datajoely

12/04/2024, 11:48 AM

Oh and look who did exactly the same thing: https://cloud.google.com/bigquery/docs/iceberg-tables

datajoely

12/04/2024, 11:48 AM

there is a scramble for metadata going on, but all the vendors are at least backing one table format for the future

Juan Luis

12/04/2024, 11:49 AM

how come all of Big Tech is building on top of Apache Iceberg and yet PyIceberg is so lacking in terms of features? where is all that effort going? (maybe I'm just grumpy today)

Juan Luis

12/04/2024, 11:50 AM

yeah I think it's safe to say Iceberg "won", @Deepyaman Datta and you @datajoely were right

datajoely

12/04/2024, 11:50 AM

Oh I think the answer there is because the query engines are doing it from first principles

☝️ 1

☝🏼 1

datajoely

12/04/2024, 11:52 AM

Spark / JVM for now, the inevitable rust + python approaches will come shortly

Juan Luis

12/04/2024, 12:03 PM

if we can extract any lesson here, maybe we could say Iceberg won because people wanted to use their own catalog? 😬

datajoely

12/04/2024, 12:04 PM

I read sometime a while back how Delta was what you’d build if you’re an analytics company and Iceberg is the same if you are a data warehousing company But also try as they might Delta always felt like a Databricks product

Deepyaman Datta

12/04/2024, 1:40 PM

how come all of Big Tech is building on top of Apache Iceberg and yet PyIceberg is so lacking in terms of features? where is all that effort going? (maybe I'm just grumpy today)

Iceberg focuses on the storage layer, and query engines need some way to interact with that storage layer. No large scale query engine is Python-based. PyIceberg is for working directly with the storage layer. This isn't necessary for 90+% of cases. (Just my guess/understanding, could be wrong.)

👀 1

☝🏼 1

Juan Luis

12/04/2024, 1:44 PM

for the time being looks like my favourite query engine doesn't have write support for Iceberg, only read https://github.com/pola-rs/polars/issues/14610

datajoely

12/05/2024, 9:34 AM

There is also a school of thought that maybe the S3 catalog API will win in the same way the S3 API has https://bsky.app/profile/benesch.bsky.social/post/3lck4caxqsc2x

👍🏼 1

Nok Lam Chan

12/05/2024, 2:13 PM

Probably, at least as a least friction path to start with something that is production ready. Many people just want to use Iceberg and have a single source (like s3 bucket)

Nok Lam Chan

12/05/2024, 2:14 PM

Noted that s3 (Hadoop Catalog) is not implemented in Pyiceberg

datajoely

12/05/2024, 2:14 PM

Interesting

datajoely

12/05/2024, 2:15 PM

Given on I’m on GCP im really not sure if I should go with the Google version of this or think more vendor agnostic

7 Views

Open in Slack

Previous Next