Facebook has provided more information on the reason for its global outage yesterday that lasted six hours and likely cost the company tens of millions of dollars in lost revenue. The root cause: A bug in a software program that was supposed to identify and prevent commands being issued that could take systems offline accidentally.
In a blog post, Santosh Janardan, Facebook’s vice president of infrastructure, said that the outage was triggered by engineers working on the maintenance of its global backbone, which is made up of tens of thousands of miles of fiber-optic cables and numerous routers that connect the company’s data centers around the globe.
Some of those centers are tasked with linking the backbone to the wider internet. When users open one of the company’s apps, content is channeled to them via the backbone from Facebook’s largest data centers, which house millions of servers. The backbone requires frequent maintenance work, such as testing routers or replacing fiber cables.
It was during one of these routine sessions that disaster struck. According to Janardan, a software command was issued that was meant to test the availability of global backbone capacity. Facebook has developed special audit software that checks such commands won’t cause chaos, but this time it failed to spot that the instruction was flawed. (Facebook hasn’t yet said exactly what was wrong with it.)
The result was a cascade of failures. The rogue command took down all of the backbone’s connections, effectively isolating the company’s data centers from one another. That, in turn, triggered an issue with Facebook’s Domain Name System (DNS) servers. The DNS is like the phonebook of the internet. It translates website names typed into a browser into series of punctuated numbers, known as an IP address, that other computers can recognize.
MORE FOR YOU
DNS servers for large companies are typically associated with a set of IP addresses advertised to other computers via a system known as Border Gateway Protocol, or BGP. This is akin to an electronic postal system that chooses the most efficient route to send messages across the many different networks that make up the internet.
Going dark
If BGP addresses disappear, other computers can’t connect to DNS servers—and thus to the networks behind them. Facebook had programmed its DNS servers to stop BGP advertisements if the servers were cut off from its data centers, which is what happened after the rogue command was issued. That effectively stopped the rest of the internet from finding Facebook’s servers and its customers from accessing its services.
Facebook engineers trying to fix the problem couldn’t connect to its data centers remotely via computer networks because the backbone was no longer working and the outage also took down the software tools needed to tackle emergency outages. That meant engineers had to go to data centers in person and work on servers. As the centers and the servers in them are deliberately difficult to access for security reasons, getting to them took time—which helps explain why the outage dragged on for so long.
What it doesn’t explain is how the software audit tool missed the problem in the first place, nor why Facebook’s network management strategy apparently didn’t involve segmenting at least some of its data centers with a backup backbone so they would not all go dark at the same time. There are plenty of other questions to answer, including whether engineering team members could have been located in ways that would have given them faster access to the data centers in a crisis.
Managing a global network the size and complexity of Facebook’s is undoubtedly one of the hardest technical challenges that any company has ever had to face, so any further findings and lessons that emerge from yesterday’s events will be of immense value to CIOs and businesses everywhere.