I recently had the pleasure of joining Databricks, but I often reflect on my profoundly rewarding experience at ByteDance. Before joining ByteDance, I was a software engineer who had written code for a decade. My core responsibility was to tackle system design and code development for complex software while ensuring robust operations in production environments. At ByteDance, however, I led a 16-person technical team (mostly Senior+ engineers) responsible for building the competitiveness of ByteDance’s public cloud EMR open-source engines—a role that was immensely challenging yet transformative for me. Through this article, I want to share what I gained from working at ByteDance.
Background
Due to my contributions to the Apache community, ByteDance’s EMR director contacted me and invited me to join the EMR team to lead the open-source engine direction. In June 2021, ByteDance officially announced its entry into the public cloud market and launched Volcano Engine. Consider this: Alibaba Cloud started in 2009, Tencent Cloud was announced in 2013, and Huawei Cloud began external services in 2016—mainstream players in China’s public cloud market had all undergone roughly a decade of refinement. Given the investment cycles of the cloud computing industry and market dynamics, the industry generally held a pessimistic view of Volcano Engine’s prospects at the time. Of course, times have changed. With ByteDance’s heavy investments in AI, Volcano Engine’s market position is now incomparable to its early days. It must be said that the ByteDance EMR director was an exceptionally charismatic leader. He told me: “In this new team, you can lead the team to build many new projects from scratch and create massive impact for customers, products, and the team.” This resonated deeply with Steve Jobs’ famous question to John Sculley: “Do you want to spend the rest of your life selling sugared water, or do you want a chance to change the world?” The idea of “leading a team to build impactful new projects from scratch” fascinated me, so I chose to join.
When I joined, ByteDance EMR was still in a relatively early stage. The classic EMR product covered components like Hadoop, Hive, Spark, Presto & Trino, StarRocks & Doris, HBase, Kafka, and Flink, spanning batch processing, OLAP analytics, and streaming. This vast scope gave our team immense potential. However, as an early-stage product, we abandoned overly ambitious and impractical “grandiose” goals, focusing instead on “how to iteratively launch product commercialization from zero and incrementally shape engine competitiveness through customer scenarios and industry product reviews.” We conducted in-depth research into the core advantages and investment directions of mainstream cloud vendors and startups, including but not limited to:
Storage-Computation Separation: After 2020, migrating data from traditional HDFS to cloud object storage (e.g., AWS S3, ByteDance TOS, Aliyun OSS) became a consensus among vendors, customers, and the industry. Thus, the team universally prioritized this initiative. Object storage is not a file system. In terms of semantics, S3 lacks file and directory concepts, has slow List performance, and does not support Rename. In terms of performance, QPS and bandwidth throttling of object storage are critical bottlenecks. The question was: How to invest? The industry has “transparent acceleration” solutions like Alluxio, S3FS, and GooseFS, as well as “non-transparent acceleration” solutions like JuiceFS and OSS-HDFS. “Non-transparent acceleration” designs its own metadata, theoretically enabling better semantics and performance, but risks user lock-in by maintaining closed data on S3. “Transparent acceleration” keeps S3 data open and transparent to users, natively compatible with all S3 ecosystems. Matching the semantics and performance of “non-transparent acceleration” wasn’t impossible—just more complex. Ultimately, we chose “keep complexity for ourselves, simplicity for customers” and developed a proprietary transparent acceleration solution: the Proton project [7]. This proved to be an extremely correct decision, rapidly accelerating our product’s revenue growth.
Lakehouse & Data Lake Direction: Delta [1], open-sourced by Databricks in 2019, popularized the Data Lake concept in the data field. Iceberg and Delta became the hottest projects. We explored sub-directions of Iceberg, such as materialized views, secondary indexes, and page-level caching acceleration. However, considering the maturity of the LakeHouse concept in China and commercialization priorities, these sub-directions were temporarily deprioritized. Still, we maintained continuous investment in open-source Lakehouse solutions to ensure competitiveness.
Intelligent Data Optimization: Industry players like AWS Redshift [2], Snowflake [3], and Databricks [4] have their own intelligent data optimization solutions. For example, Redshift intelligently detects interactive queries and physical data to automatically select Sort Keys, Distribution Keys, and column compression methods for workload optimization. Unlike these single-engine solutions, EMR is a hybrid product covering batch, streaming, and OLAP. Customers might combine engines in C(20, n) ways, making it difficult to predict and quantify the benefits of single-engine intelligence. We ultimately deprioritized investments in this area short-term.
Spark Native Direction: Databricks’ 2022 Photon paper [5] was highly influential. In my view, after over a decade of iteration, revolutionary innovations like Photon—which can improve Spark’s performance by multiples for general workloads—are exceedingly rare. After reading the Photon paper, I felt exhilarated, thinking: “This is the direction we need to invest in.” Meanwhile, ByteDance runs tens of millions of Spark cores internally, and internal teams were considering similar investments. The two sides aligned perfectly. Today, Spark Native solutions are widely deployed in ByteDance’s internal and EMR production environments, becoming a core competitive advantage.
OLAP Direction: In North America, leading products like AWS Athena, Starburst, and Dremio are largely built on Presto/Trino/Drill analytics engines and S3. However, China’s customer market is entirely different. Initially, Baidu’s team forked Impala for secondary development, creating the Apache Doris MPP database. Another company, CelerData, built the StarRocks project on Apache Doris, achieving significant commercial progress. Later, Baidu’s team spun off to establish SelectDB. Both SelectDB and CelerData secured tens of millions in venture funding, building strong technical competitiveness (subsecond query latency, fast row-level updates, native vectorized execution, CBO optimization) and commercial traction in China. Based on market and industry trends, we decided to invest in StarRocks/Doris, launching the Serverless StarRocks service [8] with storage-computation separation. Revenue scale and growth in this area have been exceptional.
In the big data era, the above strategies succeeded during the cold-start phase. Leadership clearly aligned customer, sales, product, R&D, and testing workflows to ensure high-priority initiatives received resources and execution. As an R&D team, we closely followed customer scenarios and iterated rapidly. Ultimately, Volcano EMR achieved successful commercialization: execution → customer satisfaction → revenue growth → resource allocation → stronger execution → … Once the product gained momentum, it snowballed. In short, our product revenue charted an impressive growth curve.
As the era’s train races forward, the AI age arrived swiftly—GPT-3 in November 2022, GPT-4 in March 2023. Starting in 2023, ByteDance’s Volcano Engine fully embraced AI. Leveraging ByteDance’s intensive investments in AI infrastructure, GPU resources, and talent (¥80 billion in 2024, doubling to ¥160 billion in 2025), Volcano Engine quickly outpaced competitors, capturing 46.4% of China’s token call market share in 2024 [9], with revenue multiplying (2025 projections suggest doubling again [10]). Without doubt, ByteDance’s Volcano Engine successfully seized AI opportunities. Within this trend, I believe EMR also capitalized on the wave—for instance, through investments in Lance [11].
Setting aside details, I want to emphasize core principles I learned at ByteDance:
Customer First
Customers are paramount, a principle reiterated in many companies’ leadership tenets. Amazon’s “Customer Obsession” states: “Leaders start with the customer and work backwards. They work vigorously to earn and keep customer trust. Although leaders pay attention to competitors, they obsess over customers.” ByteDance’s teams took this to extremes. Every member—from sales to engineers—closely tracked customer needs. We even invited customers to co-review products, iterating based on their requirements.
From another angle, customers wield immense influence. Their legitimate demands often secure company-wide resource allocation. Thus, addressing customer needs equates to securing more resources, enabling teams to deliver better results.
Finally, consistent positive customer feedback creates a self-reinforcing cycle across “customer → sales → product → R&D,” amplifying the snowball effect.
Radical, Radical, Radical
ByteDance CEO Zhang Yiming once said: “If I could advise my past self from five years ago, it would be: Be more radical.” This ethos is encoded in ByteDance’s DNA. From Toutiao’s rise to Douyin/TikTok’s 2016-2017 launch and subsequent dominance in ads, e-commerce, and local services—all scaling to a hundred-billion-dollar company in years. In AI, ByteDance’s aggressiveness continued: after ChatGPT-3’s 2022 debut, ByteDance invested heavily. Its 2023 Doubao model underperformed, but by 2024, it led China’s token call volume. The 2025 plan to double investments exemplifies extreme radicalism. During DeepSeek’s model releases in late 2024, many ByteDance teams studied papers and models during holidays to absorb traffic. Recognizing Doubao’s lag behind DeepSeek, ByteDance immediately recruited Wu Yonghui to lead AGI efforts. EMR’s team mirrored this radicalism—rapidly closing gaps with competitors and aggressively investing in AI for revenue growth.
Bold Hypotheses, Rigorous Validation
This complements “Radical, Radical, Radical.” Radicalism isn’t aimless—it’s aggressively exploring new directions, rigorously evaluating them, then concentrating resources on core bets. ByteDance calls this “Achieving miracles through brute force”—where “brute force” means radicalism across strategy, execution, and commercialization, while “miracles” emerge only in select directions. For example, over the past year+, we aggressively researched subfields including:
- Data Pre-training (SparkML, Ray [12], Daft [13])
- Vector Databases (Milvus, ElasticSearch)
- RAG (LLamaIndex, LangChain, Glean [14])
- Post-training Fine-tuning (Databricks ML, Amazon SageMaker)
- Hybrid Search (ElasticSearch, Rockset)
- GraphRAG [15]
- Graph Databases (Nebula Graph, Neo4j)
- LLM Data Processing (data-juicer [16])
- Multimodal Data Lakes (DeepLake [17], LanceDB [18], MosaicML Streaming [19])
- Data Version Control (LakeFS [20], DVC [21], Git-LFS [22])
- Data Annotation (Scale.AI [23]).
After rigorous review, we cautiously invested in a few directions. Product revenue and cross-functional feedback confirmed our choices. While cross-domain leaps (e.g., Data → AI) have low success rates, radical investment paired with rational insights makes success possible. This is the power of “Bold Hypotheses, Rigorous Validation.”
Leverage Synergies
Before ByteDance, I underestimated organizational leverage. My manager taught me: “Always borrow strength.” This means identifying shared interests, building partnerships, and creating win-win outcomes. Our team maximized this—collaborating with internal/external teams to rapidly achieve product goals.
Simplify Complexity to the Extreme
“Simplify complexity to the extreme” is a principle I internalized through software design. Code often involves tangled if-else logic, loops, and nested method calls. As products evolve, complexity grows exponentially with code volume. The difference between senior and junior engineers lies in the former’s ability to manage complexity via abstraction and trade-offs—abstracting requirements, design, and code while balancing performance, features, and complexity.
During Proton’s development, I applied this mindset. I reviewed every design and code detail, relentlessly reducing complexity. The results were extraordinary: we built Proton, a hundreds-of-thousands-of-line storage middleware, serving massive customer datasets (from TBs to hundreds of PBs) with zero data loss or downtime. One incident stands out: a customer found Impala + ProtonCache + TOS slower than Impala + HDFS. Three team members spent a week troubleshooting. We suspected ProtonCache’s IO optimizations until discovering Impala’s HDFS-specific fd caching. Adding similar caching for ProtonCache resolved the issue. This taught me that extreme simplification and full ownership of engineering details deliver stability beyond expectations.
Conclusion
My ByteDance experience was the fastest-growing chapter of my career. I witnessed a product’s journey from 0→1→100—a thrilling “startup” journey. I gained profound insights into industry knowledge, leadership, cross-team collaboration, methodology, and engineering rigor. I’m deeply grateful to my manager for mentoring me and to all colleagues whose collaboration made miracles possible. For everyone, Wish all the best !
References
- https://www.databricks.com/company/newsroom/press-releases/databricks-open-sources-delta-lake-for-data-lake-reliability
- https://docs.aws.amazon.com/redshift/latest/dg/t_Creating_tables.html
- https://www.snowflake.com/en/blog/automatic-query-optimization-no-tuning/
- https://docs.databricks.com/aws/en/delta/clustering
- https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf
- https://doris.apache.org/
- https://www.volcengine.com/docs/6491/149821
- https://www.volcengine.com/docs/6491/1134307
- https://finance.sina.com.cn/stock/relnews/hk/2025-04-11/doc-inesuhfw5087507.shtml
- https://finance.sina.com.cn/jjxw/2024-12-31/doc-ineciptz2736843.shtml
- https://openinx.github.io/posts/2025-04-11-iceberg-summit-2025-2/
- https://github.com/ray-project/ray
- https://www.getdaft.io/
- https://www.glean.com/
- https://arxiv.org/abs/2404.16130
- https://github.com/modelscope/data-juicer
- https://github.com/activeloopai/deeplake
- https://lancedb.github.io/lancedb/basic/
- https://github.com/mosaicml/streaming
- https://scale.com/
- https://dvc.org/
- https://git-lfs.com/