A couple of weeks ago, Bungie was forced to take Destiny 2 offline for the bulk of a day to fix a bug in an update that caused widespread loss of glimmer and rare enhancement materials (opens in new tab). Yesterday, it happened again (opens in new tab)—it didn't take quite as long to fix (because developers knew what was going on this time) but it was still several hours of downtime for a major online game.
You might wonder how this kind of thing happens in a well-established game being run by a large, experienced game studio. If so, that large, experienced game studio is here to explain: With the game back online, Bungie posted an unusually deep and detailed explanation (opens in new tab) of what went wrong, and what it's doing to avoid that kind of gong show in the future.
It's a long and complicated (but also legitimately interesting) tale, but the short version, as it so often is, is that a bunch of small problems snowballed into a big problem, and then, well, mistakes were made. And then another problem, entirely unrelated, caused the re-emergence of the first problem, which is how yesterday's mess came about.
It's all fixed up now, although not without another character rollback. (The second in Destiny 2's history, apparently—Bungie said that the rollback two weeks ago was the first.) To help prevent this particular issue (but not others, because that's the way she goes) from happening again, Bungie also specified seven "preventive measures" it's taking going forward:
- We have added further safeguards to our process for “hot-patching” our servers to ensure that they cannot start with an unexpected version. This change is in place as we spin up the game today.
- We have fixed the issue that caused a small fraction of WorldServers to crash on startup. This fix will be deployed with Season 10.
- The permanent fix for character corruption will be rolled into the next update as an executable change, removing the need for the configuration override. (Unfortunately the 18.104.22.168 Hotfix was too far along to benefit from this).
- Looking ahead, we are investigating ways to speed up our rollback and recovery mechanisms.
- In a future release, we will address the issue that can cause servers to skip loading configuration data.
- We will also add more protections to the login-account clean-up code, to help prevent future bugs from being introduced into such a critical area.
- We are updating our development methodologies to catch issues like this earlier in the release pipeline.
Downtime and rollbacks are frustrating, but the explainer gives us a little more insight than we normally get into how and why things can go so completely wrong, so quickly: from a "conceptually reasonable" fix several months ago that spun off "subtle side effects," to "what we thought was an impossible situation" that caused yesterday's problems.
"We know today’s outage and character rollback has been frustrating for you, especially with launch of Crimson Days, just as it’s been frustrating for us to realize that this is a problem we should have been able to avoid," Bungie said. "We’re sorry for the frustration and inconvenience this caused and will continue to work to prevent these kinds of things from happening again."