What happens when online games go down for maintenance

The studio goes into lockdown. The lights dim. Programmers frantically type code and run between clusters of desks. Silent warning lights pulse in the bright server room as coveralled engineers slide out slabs of silicon and perform delicate operations on them, wiping beads of sweat from their brows.

Then, with seconds to go, everyone finally completes their tasks, and the Chief Engineer pushes a large lever back to the ONLINE position. The lights flicker and everyone catches their breath, waiting for the server’s computerised voice to confirm maintenance has been successful. Then everyone cheers. The May update is complete. Well done, everybody.

In theory, the number of things that could go catastrophically wrong is rather terrifying.

Glen Miner, Digital Extremes

That’s the scene I imagine when a game goes into maintenance. I suppose I need to feel there’s the kind of high stakes drama going on in some distant office or server farm that justifies the fact I can’t play the game I want to. And obviously, it’s not what really happens. Keeping online services for big games running is a slick and controlled business, honed through years of experience and best practices, and necessitated by the expectations of thousands or millions of players and millions of dollars of investment. 

"In theory, the number of things that could go catastrophically wrong is rather terrifying," says Glen Miner, technical director of Warframe. But in the face of all that threat, developers work hard to minimise it. "The last update we deployed only took 26 seconds."

Warframe is subjected to regular weekly updates, but new features and fixes are added as soon as they’re completed, so overall, the game is updated several times a week, and sometimes several times a day, in a process that’s honed so sharply that it’s almost always completed within two minutes.

"The most common thing we do is upgrade the server software to match updates to the game," says Miner. "This involves pushing server code, restarting scripts that keep the world alive, and enabling new content." Digital Extremes aims to give players new stuff as soon as possible, so the bulk of its maintenance is about doing little changes as soon as they’re tested and ready. 

Less frequently, maintenance is about hardware. "Particularly SSDs," says Miner. "A few years ago we had server problems right before Christmas that we traced back to an SSD that had become exhausted by the firehose we had been blasting it with. We had to perform some emergency upgrades while the holiday load got heavier and heavier which was extremely stressful."

Warframe

For Worlds Adrift, Bossa Studios’ physics sandbox MMO in which players sail airships and swing around on grapple hooks, maintenance is carried out daily, and it’s all about preventing the world from getting out of control. 

Most MMOs avoid physics and having persistent objects rolling around, because they’re incredibly difficult to govern over a network. But Worlds Adrift is not your usual MMO.

"Everything players do in the world is remembered by the game," says lead developer Tristan Cartledge. "If a player cuts down a tree or destroys a ship on an island, the remnants of their actions will persist in the world until another player or a natural phenomena, like a storm, comes to disturb that state. Because we are storing all this information, the size of the data required to record this can grow unbounded."

The longer the game runs, the more memory its servers require to keep it going, and so Worlds Adrift’s regular maintenance is all about taking the game offline for an hour and cleaning up a snapshot of the world and running compression algorithms on its data to reduce and remove anything that isn’t important. Perhaps the game doesn’t really need to remember the exact position and rotation of a Thuntomite’s corpse or the amount of wood left in a log, and can estimate it instead. But important objects, such as ships, chests and living creatures, are left completely intact.

This process is now entirely automatic, even down to Worlds Adrift’s system flying bots out into the world to test things, ensuring that its physics is active by cutting down a tree and other checks. In fact, the development team wouldn’t know anything about what’s going on unless the bot spots something’s up and sends out an alert. 

Worlds Adrift

Not that Bossa hasn’t experienced some weird problems. A while back, Worlds Adrift had a bug in which the fuel pod item wasn’t spawning into the world correctly. While they worked on a proper fix, the team spawned them manually during maintenance but failed to take into account the fact they wouldn’t all be harvested between one day and the next. As the days passed, the number of fuel pods in the world after maintenance grew and grew until they had hundreds of them on each floating island. "It made them look very much like strange sorts of hedgehog," says Cartledge.

Since Warframe runs on clusters of servers, the team can take a node out of service, tend to it, and then add it back into the pool without players noticing.

"The only stress comes from a low-level worry that something will go wrong during maintenance which could result in a snapshot being corrupted." In that case, the team will have to roll back the world to the last good shapshot, which could be between 10 minutes and a whole day of lost progress, depending on what happened. Not ideal.

For Digital Extremes, updating Warframe is similarly stress-free, aside from dealing with dead hard drives at Christmas. "The most stress comes from problems that are outside our control," says Miner, remembering situations in which the whole game was at the mercy of network issues affecting their own suppliers. "In cases like that we’re practically helpless and it’s extremely stressful."

Fatshark, maker of the Vermintide series, have offloaded the stress of maintenance entirely. For the first Vermintide game, they built their own backend platform, which was regularly maintained. "That took quite some effort from our IT team," says CEO Martin Wahlund. So for Vermintide 2, they turned to a third party company called Playfab to take care of all the game’s online services so Fatshark can focus on development.

Playfab even performs maintenance without needing to take the game offline, so Fatshark doesn’t have to worry about keeping players abreast of day-to-day fixes.

A Facebook server farm, via Mark Zuckerberg

Digital Extremes is also able to do live updates with most of Warframe’s maintenance. Some updates flow out to its datacenters ahead of release so they’re all ready for when the team flips the switch. Many updates simply happen in the background, with the only effect on players being that they can’t save until they’re complete.

A minority of software or hardware upgrades might require the game to be taken offline, but even here, players can keep playing. Since Warframe runs on clusters of servers, the team can take a node out of service, tend to it, and then add it back into the pool without players noticing. 

Given that these tasks reduce the capacity of the system, Digital Extremes schedules them for times of the day when there’s less activity. Bossa schedules its regular maintenance in the same way, depending on whether the servers are in the Europe or the US. "We attempt to do it as close as possible to off-peak but we still have to run maintenance during office hours for Bossa, so there are team members available to intervene if anything goes wrong."

Bossa schedules its updates around staff availability, too, particularly QA, who are there to check that everything runs correctly for when the game goes live again. They can’t practically perform rigorous testing because it would take too long, but they can ensure Worlds Adrift’s most basic features still function, like physics, ship building, flying and character progression. 

Naturally, QA will have already tested all of a game’s new features prior to release, so the period before maintenance is usually more fevered than maintenance itself. That’s certainly true for Digital Extremes. "Since we’re always trying to cram as many improvements as we can into each update, there’s usually a frantic sprint of ‘just one more change, please,'" says Miner. 

A retired World of Warcraft server blade

"When we start the countdown and start running the scripts to make the changes, there’s a brief window of terrified calm while we wait to see if we missed anything," he continues. The maintenance script resets a leaderboard which details all Warframes crashes, and the developers’ eyes lock on to it to see if the bugs they fixed stop appearing on it.

Then the community team fires up. "No matter how big your QA team is, your playerbase is usually thousands of times bigger and players can often be extremely helpful," says Miner. "Sometimes the most rare and unusual bugs can be fixed easily when community managers can get us diagnostics from players and so they’re often busy after an update, collecting and isolating problems the players have found."

Several times this year this stampede was even bad enough to trigger problems with our network partners.

Glen Miner, Digital Extremes

But the real challenge isn’t so much the maintenance, nor even checking that it worked. Maintenance, paradoxically, is often the calm before the storm.

"The main issues with maintenance for Vermintide 1 have been when it was over and a lot of people tried to login to the game at the same time," says Wahlund.

It’s the same for Warframe. "One of the things that’s been a regular challenge is dealing with an ever-increasing number of players hammering our servers waiting for the maintenance to be over," says Miner. Even though the team optimised downtime to just a few minutes, the sheer volume of network connections in that time was enough to overwhelm Digital Extremes’s systems. 

"Several times this year this stampede was even bad enough to trigger problems with our network partners. Luckily, we were able to upgrade a key network device and, with some clever configuration tricks, we’ve managed to practically eliminate this problem for now."

Maintenance is necessary, complex and dangerous. And that’s just the kind of challenge that inspires a company to work to make it as painless as possible for players—and for themselves. There’s a lot that’s magic about how games connect players and let them play together, but updating and fixing themselves while they’re still running has to be one of their cleverest tricks.