分布式系统 | 我们能从Roblox长达三天的宕机中学到什么

112 阅读6分钟

Roblox 游戏平台

本文首发于我的领英。

2021至2024这三年间我在美国的一家科技创业公司工作,我的manager(在我们的轨迹相交之前,她在微软和Affirm工作)向我分享了她所读过的最高质量的软件事故分析——Roblox官方发布的复盘,详尽地回顾了2021年10月期间所有服务器宕机整整三天的惨况。

这篇文章的确是不可多得的学习资源,并且非常有趣,既可以看到一家撑起了庞大市值的游戏公司如何在技术上依赖开源数据库BoltDB, 而这款数据库只是一位工程师在业余时间里出于兴趣爱好而做的toy project, 与此同时又能看到在出现系统宕机时,公司的运维团队是如何手忙脚乱地“诶要不试一下这个”,像极了我学习写代码期间的低效debug状态——光顾着动手跑命令,却不去深度动脑思考。

我反复读了多遍这篇复盘,琢磨其中的技术细节,并且找到了相应的开源代码来研究,在Reddit论坛上也读了大家对这次事故的讨论(和吐槽),巧合的是,BoltDB的作者本人Ben Johnson也参与了讨论,互联网的世界可真小呀。

由于我所接触到的绝大部分素材以及和ChatGPT进行的讨论都是用英文,本文也用英文进行写作。如果你对数据库、分布式系统、基础架构、时间复杂度优化、祛魅以及Go, site reliability, single point of failure等等话题感兴趣,希望你会喜欢我的这篇文章。

另外,哔哩哔哩也有一篇非常高质量的事故复盘文章 2021.07.13 我们是这样崩的,同样推荐给大家。


In October 2021, Roblox suffered the longest outage in its history—73 hours of complete downtime, affecting millions of players worldwide. The root cause? A subtle yet devastating issue buried deep inside an outdated database structure. What makes this case fascinating is that the issue wasn’t an obvious system failure or an external attack—it was a slow-burning technical debt hidden inside the database.

I recently did a deep dive into the post-mortem and gained valuable insights. This is arguably one of the most detailed and thorough outage reports available online. (Thanks Megan Keehan for the recommendation!)

🌐 Background:

Roblox employs a microservices architecture for its backend, using HashiCorp’s Consul for service discovery, enabling internal services to locate and communicate with each other. On the afternoon of October 28th, a single Consul server experienced high CPU load. The Consul cluster's performance continued to degrade, ultimately bringing down the entire Roblox system since Consul acted as a single point of failure. Engineers from both Roblox and HashiCorp collaborated to diagnose and resolve the issue.

🔧 Debugging Attempts:

  1. Suspected Hardware Failure: The team replaced one of the Consul cluster nodes, but the issue persisted.
  2. Suspected Increased Traffic: The team replaced all Consul cluster nodes with more powerful machines featuring 128 cores (2x increase) and faster NVMe SSD disks. This also did not resolve the issue.
  3. Resetting Consul’s State: The team shut down the entire Consul cluster and restored its state using a snapshot from a few hours before the outage began. Initially, the system appeared stable, but it soon degraded again, returning to an unhealthy state.
  4. Reducing Consul Usage: Roblox services that typically had hundreds of instances running were scaled down to single digits. This approach provided temporary relief for a few hours before Consul again became unhealthy.
  5. Identifying Resource Contention Issues: Upon deeper inspection of debug logs, the team discovered resource contention problems. They reverted to machines similar to those used before the outage. Eventually, they identified the issue: Consul’s new streaming feature. This feature utilized fewer concurrency control elements (Go channels), which led to excessive contention on a single Go channel under high read/write loads. Disabling streaming dramatically improved the Consul cluster’s health.
  6. Leader Election Optimization: The team observed Consul intermittently electing new cluster leaders, which was normal. However, some leaders exhibited the same latency issues. The team worked around this by preventing problematic leaders from staying elected.

✅ Following these measures, the system was finally stable. The team carefully brought Roblox back online by restoring caching systems and gradually allowing randomly selected players to reconnect. After 73 hours, Roblox was fully operational.

🔍 BoltDB’s Freelist: A Silent Culprit

The root cause of the outage stemmed from two key issues: the Consul streaming feature (as mentioned above) and Consul’s underlying database BoltDB exhibiting severe performance degradation, which I find pretty interesting.

Consul uses Raft consensus for leader election to ensure data consistency in a distributed environment. For persistence, it relies on BoltDB, a popular embedded key-value store, to store Raft logs.

Like many databases, BoltDB has a freelist, which tracks free pages—disk space that was previously occupied but is now available for reuse. This mechanism is crucial for efficient database performance, preventing unnecessary disk growth and optimizing read/write operations.

However, BoltDB’s freelist implementation had a critical inefficiency (see source code here). It used an array to store the ID of each free page, meaning that every database read/write operation involved a linear scan of the freelist. As the freelist grew large, the cost of operations increased significantly.

📈 Interestingly, this performance issue was first reported in 2016 (GitHub Issue), but it was never fixed. The author of BoltDB, Ben Johnson, stopped maintaining the project in 2017, stating:

"Maintaining an open-source database requires an immense amount of time and energy. Changes to the code can have unintended and sometimes catastrophic effects, so even simple changes require hours of careful testing and validation.

Unfortunately, I no longer have the time or energy to continue this work. Bolt is in a stable state and has years of successful production use. As such, I feel that leaving it in its current state is the most prudent course of action

Although BoltDB was no longer maintained, the Go community forked it into a new project called bbolt (bbolt GitHub) to continue active maintenance and add new features. Unfortunately, Consul was still using the outdated, unmaintained version of BoltDB.

In 2019, the freelist performance issue was finally resolved in bbolt (Alibaba Cloud Blog). The fix was straightforward: using a hashmap instead of an array, reducing the linear scan overhead and allowing instant lookups. ( 🚀 I love how a simple idea brings huge performance boost!)

Since this fix was committed to bbolt—not BoltDB—Consul did not benefit from the improvement, ultimately leading to the three-day Roblox outage in 2021.

🤔 Unanswered Questions

This post has covered a lot, but there’s only so much that can fit into a single post. As an engineer, I find myself eager to explore further details. Several intriguing questions remain unanswered:


Why didn’t Roblox roll back Consul’s streaming feature sooner?

Given that Consul was clearly the culprit early on and a significant change had just been made to its infrastructure, rolling back should have been one of the first things attempted. What factors delayed this decision?


Why did only some Consul servers experience the BoltDB freelist performance issue?

In theory, all servers should have been in a similar state since the leader is usually ahead of its followers only by a small margin. Yet, only some instances suffered severe degradation. What caused this inconsistency?


Why didn’t restoring Consul’s state using a previous snapshot fix the issue?

My hypothesis is that restoring Consul’s state did not reset BoltDB’s underlying raft.db file on each server, meaning the bloated freelist persisted even after the rollback. If true, this suggests snapshots do not include critical optimizations for internal database structures.


Why did reducing Consul usage work temporarily before failing again?

If the freelist was already too large, reducing usage shouldn’t have provided any relief. Did scaling down slow the growth of the freelist temporarily, delaying the inevitable, or was another factor at play?


Why did the new streaming feature work for a day before the outage occurred?

If the new Consul streaming feature was inherently flawed, why didn’t the system immediately degrade? Was there an initial buffer that temporarily masked the issue, or did a specific traffic pattern trigger the breakdown?


Since the BoltDB freelist performance issue has been around for years, why didn’t Roblox experience such system performance degradation in earlier months?

BoltDB’s freelist inefficiency had been a known issue since 2016. What changed in Roblox’s workload or data structure that made this issue surface now? Did the new Consul streaming feature exacerbate the problem by dramatically increasing write operations to BoltDB?

💡 Endnotes

This post-mortem report offers invaluable lessons, and I highly encourage everyone to check it out! There are also extensive discussions about this outage on Hacker News, where BoltDB's author, Ben Johnson, also participated in the conversations.

As a software engineer, I firmly believe that a key trait of a great engineer is the ability to efficiently navigate large, complex systems and diagnose issues under pressure. I deeply admire and respect the engineers from Roblox and HashiCorp, who worked tirelessly under immense pressure to investigate and resolve the issue. Hats off to them for their resilience and expertise.

Thank you for reading this far! If you found this post useful, I’d truly appreciate it if you could share it with others.