Iceberg Summit 2025 - Part 2

This is the second article of my thoughts for Iceberg Summit 2025 (Here is Part-1), which is not limited to Iceberg but focuses on Data Lake. I am trying to share what these creative teams are doing in this field, why it is a problem, what the solution is, and how it will develop in the future. These insights come from various sharing, discussions, and debates, but are limited by my personal understanding. Anyway, I hope this article can inspire everyone.

AI + Data Lake

As I mentioned in the previous article, every company or product in the Infra field should firmly grasp the trend of AI. In fact, many companies have done so and found a great positions in my opinion.

“Data lake” usually refers to a centralized system used to store, process, and manage massive data. “Storage” usually includes Structured Data, Semi-Structured Data, and Unstructured Data; “Processing” usually includes common business scenarios such as ETL, BI analysis, feature engineering, data preprocessing, Model Training, and model post-training; “Management” usually includes dataset sharing, dataset permissions, dataset security, dataset lineage, and other collaboration-oriented things for different businesses.

Iceberg is now generally considered the de facto standard for Table Format in the field of Data Lake. However, having only one Iceberg Table Format is not enough to solve the complex dimensions of “storage”, “processing”, and “management” covered in the “general Data Lake”. For example, Iceberg is essentially a “storage” solution. Iceberg v1 & v2 are best suited for maintaining Structured Data, while iceberg v3 focuses on semi-Structured Data, which is relatively insufficient in unstructured scenarios. Compared with the concept of “Data lake” in the previous paragraph, I found that there are still many shortcomings in the Data lake solution around Iceberg, and it is these shortcomings that have inspired the innovation of various teams and brought new solutions.

Storage

This part is usually divided into two layers: “Format” and “Filesystem/ObjectStorage”.

First, let’s talk about the Format part. The datasets involved in AI scenarios include Structured Data, Semi-Structured Data, and Unstructured Data. As mentioned earlier, Iceberg is currently not suitable for storing unstructured scenarios (including images, audio, video, etc.).

Firstly, using Parquet to read and write large columns is inefficient and easy to OOM. Imagine a column storing video data, with each video ranging from 20MB to 100MB. Therefore, a 128MB RowGroup in Parquet can only store up to 6 rows of data. Such a small number of large field Rows is actually a disadvantage for Parquet column storage (such as high compression cost but low profit, and Pruning cannot hit because there are usually no column stats). The most critical issue is that due to the uncontrollable column byte size, Parquet writers are extremely prone to OOM, and stability is a huge challenge.

Secondly, AI scenarios involve Random access, but Iceberg and Parquet do not provide a simple and easy-to-use interface. Why is this interface important? I have encountered two scenarios with clients: the first is that clients usually do JOINs between image datasets and text datasets to complete the cleaning or reconstruction of a mixed image and text data. JOINs essentially involve Random access to image data; the second is the Sample Shuffle scenario of AI training scenarios, which distributes data randomly, so that each concurrent training task obtains random data to promote the model to converge as soon as possible during the training process. Shuffle is essentially Random Access.

Finally, as ML engineers deepen their understanding of AI datasets, they will continue to add new features. Corresponding to the dataset, they usually write new columns in batches. Currently, Iceberg needs to rewrite the entire dataset to complete this. This is unacceptable for huge datasets. Russell also mentioned this issue at the Iceberg Summit 2025 Panel Table . ByteDance maintains EB-level feature engineering data in Iceberg format internally, but they have changed the design of Iceberg internally. Please refer to the sharing [1] of Iceberg Summit 2024 for details.

The above issues were keenly captured by a data start-up company, which continuously iterated and implemented a usable solution. The company is called LanceDB, which creatively designed the lance file format (compared to Parquet) and a more semantically rich lance dataset format (compared to Iceberg) based on the lance file format. In many aspects (such as Transaction mechanism and Table file layout), the lancedb team absorbed the inspiration of iceberg and made similar designs, but with their own unique features. For example, boldly using Rust to write the underlying unified format library makes cross-language Format iteration faster; introducing the concept of RowID to provide random access APIs; introducing secondary indexes (B + Tree Index, FullText Index, Vector Index) at the dataset format level to facilitate more efficient data processing. Overall, I think their solution solves the three problems I listed at the beginning, which is very helpful for AI scenarios. Our team encountered this scenario last year and invested nearly a year in lance, which allowed us to gain two lance Committers and good customer revenue. Much thanks to the lancedb team. For more information on lancedb technology details, I suggest reading these materials [2], [3], [4]. At the Iceberg Summit 2025, I was pleased to see the LanceDB team working with the Iceberg community to integrate the lance format into Iceberg, giving Iceberg greater leadership in unstructured scenarios. Jack Ye has joined the LanceDB team and is a member of the Apache Iceberg PMC. I believe that with his help, Iceberg and Lance will have a better integration. Finally, other projects in the field of AI Dataset Format include Vortex [5], Nimble [6], Deeplake [7], and Streaming [8]. I hope to have the opportunity to analyze them separately in the future.

Let’s talk about the “ObjectStorage/FileSystem” part. Some people may say that storage should be delegated to AWS S3 directly. Yes, Object Storage is very good, simple and easy to use, reliable data, and pay-as-you-go. But in fact, Only Table/Dataset Format and Object Storage are not enough, for AI scenarios.

Firstly, in the pre-traning stage of AI scenarios, the data that needs to be cleaned may be tens or even hundreds of petabytes, and the number of files can reach billions. This requires extremely high throughput and metadata efficiency for Object Storage. On the one hand, the cost may be high; on the other hand, storage bottlenecks can easily lead to low GPU utilization. In fact, the GPU bill is much higher than the storage bill, and leaving the GPU unbusy is a waste.

Secondly, AI scenarios heavily rely on the Python ecosystem, and many Python libraries use the Posix API to access underlying data. This requires that data on ObjectStorage can be accessed using the Posix API. Some people may say that with the Iceberg Table Format or Lance Format, Python can naturally skip the Posix API for access. Yes, but at least for now, it cannot be guaranteed that all data will be converted to Iceberg or Lance. Therefore, this Posix API is still necessary.

Finally, the AI training scenario requires a random IO delay of sub-second, and the IOPS is very high. In addition, whether it is the checkpoint process of Model Training or the model deployment scenario, there will be a huge peak consumption of storage bandwidth. If the storage does not meet the requirements, it will lead to checkpoint failure or model deployment failure, resulting in greater GPU waste.

Among these start-up companies, Juicedata and Alluxio were one of the earliest to solve the issues. Among them, Juicedata may be the start-up company that has greatly benefited from this wave of AI. Many large model start-up companies in China are their customers, and it is said that their revenue growth has been very good in recent years. Davies, the founder of Juicedata, is a legendary geek engineer who worked at Databricks from 2014 to 2016. He believed that Databricks needed to develop a storage acceleration system based on s3. Databricks’ CTO Matei (author of Apache Spark) told him, “Storage is not something we are good at, so try not to touch it if possible.” So, Davies founded Juicedata in 2017. If Davies had stayed at Databricks, he would have been financially free, but excellent people would have succeeded in another field. Success is success, but from a technical perspective, I actually prefer the “transparent acceleration” technology direction, which can accelerate access to users’ existing massive S3 data. Of course, this solution is much more complex than the “non-transparent acceleration” solution (which Juicedata adopts), especially in terms of semantics and performance issues. We saw that in November 2023, AWS S3 released S3 express one zone, s3 mountpoint, and AI connectors, which essentially solve the above problems in a “transparent acceleration” way. The cost is 7x~ 8x of the regular S3 cost, but for AI scenarios, it is completely acceptable because the GPU is too expensive. MinIO has an acceleration solution called AIStor cache, but it seems to be an on-premise s3 closed-source acceleration solution. What people want more is a cloud-based acceleration solution for AWS S3.

To sum up, there are various entrepreneurial teams and large company teams that have achieved varying degrees of success in the field of “Data Lake Acceleration”. However, I guess there is still opportunity and space in this area. Because from the perspective of broad “ObjectStorage/FileSystem cache acceleration”, I have not yet seen a reliable and successful cloud vendor-neutral “transparent acceleration” solution. From the narrow perspective of “Data Lake Format Acceleration”, I believe that the solution of cache acceleration for Manifest Files in iceberg/lance and hot Pages in RowGroup/Fragment can achieve fine grain and extreme performance acceleration. For example, if a customer has a large amount of iceberg table data stored on AWS S3, the user only needs to activate a SaaS service to enjoy the effect of computing acceleration by 10x.

Processing

In the data pprocessing stage of AI, Spark is usually used to complete CPU-intensive data processing tasks, while Ray has almost become the core framework for GPU-intensive data processing. Ray’s advantages are reflected in fine-grained control of GPU and CPU mixed resources, seamless native Python ecosystem to reuse GPU & AI ecology, etc. However, Ray has some shortcomings, such as not supporting Shuffle operation, not supporting SQL, etc.

I met Daft’s booth at the venue and chatted with Co-founder Jay [9] for a long time. I finally got to the core problem that Daft wanted to solve. He believes that in the data processing scenario of MultiModal Machine Learning, a classic example is “semantic deduplication” - > “clustering” - > “model batch inference”. The first and second steps of this process are CPU-intensive and involve shuffling, which can only be handled by Spark; the third step is offline inference of small models, usually handled by Ray. These three steps involve processing across Spark and Ray engines, which brings several problems: firstly, AI/ML developers feel that Spark has a relatively high threshold and is relatively heavyweight; secondly, the data processed by Spark must be persisted to the object storage system or file system before it can be further processed by downstream Ray tasks. On the one hand, persistence has a high cost, and on the other hand, it requires configuring complex workflows to complete job scheduling across two engines; thirdly, completing this task requires AI/ML developers to master both Spark and Ray. Jay believes that the Daft solution is used to solve this problem. Daft supports SQL, Shuffle, and Join, and is native to Python, naturally running on the Ray framework. In short, he believes that Daft = Spark + Ray. Therefore, the above scenario only needs to be solved with Daft as a framework, which he believes is the core competitiveness of Daft.

In the past year or two, our team has been searching for the best computing framework for “AI MultiModal Machine Learning data processing” because we have seen the similar pain points of real customer scenarios. Daft may be a good candidate, but it will take some time to continue reviewing. Deepseek open-sourced Smallpond at Open Source Week, which is somewhat similar to Daft’s positioning [10]. It is a native Python ecosystem, supports SQL, supports Shuffle, and reuses Ray distributed framework to achieve CPU & GPU hybrid scheduling. The core weakness of Smallpond is that Shuffle relies too much on the internal high-performance parallel file system 3FS, and users need to switch between low-level API and SQL API, which leads to a slightly higher threshold. It may not be an open-source user-friendly solution that can be used immediately. Daft does better in these two aspects.

Finally, I want to say that the Daft team is an extremely young team. Jay graduated from Cornel in 2018 and participated in the Lyft L5 project, which was acquired by Toyota. He founded Daft in 2022 and has been doing so until now. The other members of the booth are also young, and their conversations are full of enthusiasm and confidence. I am amazed that such a young team has such keen insight into the AI scene and has built the Daft project step by step with Rust + Python in 2 years. Daft even has a good landing in the internal scenes of Amazon and Together AI. I hope Daft’s development goes smoothly!

Management

The management of data in Data Lake (Data Share, Permissions, Security, Lineage) usually relies on Catalog. I understand that there are currently at least four mainstream open source projects working on Iceberg-related Catalogs.

Apache Polaris: An open-source component led by Snowflake that focuses on making the Iceberg Catalog.
Apache Gravitino: An open-source project contributed by Datastrato and donated to the Apache community. I learned that Xiaomi and Bilibili (China’s Youtube) have had good implementations on Gravitino.
UnityCatalog: A Unity Catalog open-sourced by Databricks.
LakeKeeper: Open sourced by a start-up company called Vakamo, they focus on open sourcing and products of the Iceberg Catalog.

In summary, Polaris and LakeKeeper are a Catalog service that focuses on managing Iceberg, while Gravitino and UnityCatalog are positioned as a unified Catalog for Data + AI. Some people may think that Catalog is an easy thing to do, but I don’t think so. Because Catalog involves many troublesome issues such as cross-organizational customization standards, cross-project collaboration, revenue weighting, management permissions, SLAs, etc. Fortunately, Apache Iceberg not only defines the industry Table Format specification, but also defines the Catalog specification. Unfortunately, there is still a lack of industry standards in the non-Iceberg data management field. Whether it is AI computing engines or developers in the AI/ML field, they generally do not perceive the concept of Catalog. These pain points are opportunities in the Catalog field!

Streaming + Data Lake

In the field of Streaming + DataLake, I saw four companies at the venue: Confluent, Redpanda, StreamNative, and RisingWave.

Generally speaking, Confluent, Redpanda, and StreamNative are three companies that are focusing on DataLake from the perspective of Streaming Storage. Taking Confluent as an example to illustrate.

From the perspective of Confluent, it manages a large amount of Kafka data, which ultimately needs to flow into AWS Redshift, Databricks Delta, Snowflake Tables, etc. From a product perspective, data flowing into any of them cannot be accessed by other services, which is a huge lock-in for users. From a business perspective, Kafka only buffers data from the last 7 days, and the massive historical data accumulated continuously ultimately flows away from Confluent, which is unfavorable for business and revenue. Custom Table Format is not suitable for Confluent because the lack of computing ecology will make users feel lock-in. Therefore, Confluent chose Iceberg and launched TableFlow [11]. This product allows Kafka’s Topic to become an open Iceberg table with one click, which can be managed within Confluent (this part is the new revenue added by Confluent) or in users’ buckets. “One click” solves many pain points in the middle (I explained the same pain points in 2021 [12]), including the cost of users managing Spark/Flink jobs themselves, the complexity of converting heterogeneous data sources such as Avro/parquet/JSON/Protobuf to Iceberg, pipeline interruptions caused by Schema evolution, Iceberg compaction and file management, etc. Overall, I believe that TableFlow is a win-win solution for both users and Confluent.

Regarding RisingWave + DataLake, I consulted Wu Yingjun and found a slightly different approach. RisingWave is essentially a Streaming Database, including computation and storage. Its Postgres-compatible SQL can express both stream semantics and batch semantics. Users first create a Materilized View through SQL, and then query the Materialized View results in real-time through SQL to obtain millisecond-level query responses for complex data insights. The storage part of RisingWave can store both RisingWave built-in formats and Iceberg formats. The core advantage of storing Iceberg format is that users can easily ingest into the lake using PG SQL, and can also easily perform AP analysis through PG SQL. Finally, this data can be exposed to other Iceberg ecosystem computing engines for analysis. This is the RisingWave’s Streaming Lakehouse.

HTAP + Data Lake

The Sun Zhou and Chen Cheng teams I know are a startup team in the HTAP + DataLake direction, and the start-up company is called Mooncake. Before starting their start-up, both of them were principle engineers at SingleStore for many years, and SingleStore has always maintained a leading position in the HTAP field.

The Mooncake team implemented a Postgres extension service in Rust, which can subscribe to CDC data from Postgres OLTP tables and periodically convert CDC data to Apache Iceberg table format. In addition, they also implemented a pg_mooncake Postgres extension plugin, which allows Postgres to use Vectorized engine to quickly analyze Iceberg tables. By combining Postgres native capabilities, Rust server, and pg_mooncache, they built an HTAP solution around Postgres. This solution has several advantages that distinguish it from other solutions.

First, an HTAP solution built around the native Postgres ecosystem.
Secondly, the AP and TP sides ensure millisecond-level freshness and query latency. OLAP queries will join CDC incremental data and Iceberg table data in real time to achieve millisecond-level freshness.
Third, Mookcache implements an Data Open HTAP solution based on the Apache Iceberg format.

Mooncake’s positioning is very unique in the market, possibly the only open DataLake HTAP solution built around the Postgres ecosystem. I think the Mooncake team is awesome because they solved a difficult problem in a simple and elegant way.

Summary

At the Iceberg Summit 2025, I saw Iceberg successfully connect various open source and commercial ecosystems with its completely open Table Format, turning the world into a truly open Data Lake. From “Iceberg v3 and beyond” to here, I saw the infinite prospects of a MultiModal Machine Learning Data Lake that covers structured, semi-structured, and unstructured data for “Data + AI”. From various creative teams, I saw their enthusiasm, ideas, paths, and success. Anyway, this is just my partial view. There are more talks and scenorios at Iceberg Summit 2025. Welcome to explore together!

AI + Data Lake#

Storage#

Processing#

Management#

Streaming + Data Lake#

HTAP + Data Lake#

Summary#

Reference#