Facebook and its various social media services, including Messenger, WhatsApp and Instagram, were out of action for almost six hours today. It’s still not clear what exactly triggered the outage—or why it took so long to get them back up. But tech experts say the nub of the issue looks like it could involve two crucial elements of internet infrastructure known as Domain Name System, or DNS, and Border Gateway Protocol, or BGP.
The DNS is like the phonebook of the internet. When someone types facebook.com or the name of any other website into a browser, it converts it into a long, punctuated string of numbers known as an IP address that computers can recognize. The DNS tells the browser the address of the site a user’s looking for and the browser then sends an electronic message to that address requesting the images and content of the site associated with it.
The system sounds relatively straightforward, but it involves an extensive hierarchy of servers. When a web address is entered, the search is masterminded by what’s known as a “recursive resolver”, which is a DNS server that kicks off a series of communications with others as it hunts for the correct IP address. When the address is identified, it alerts the browser. Recursive resolvers often store frequently accessed DNS records so they can serve them up in the blink of an eye.
If DNS is the internet’s phonebook, BGP is its postal service. The internet isn’t a homogeneous network, but rather a massive collection of smaller ones known in tech jargon as autonomous systems. These smaller networks are made up of a bunch of routers, often run by a single organization such as internet service providers (ISPs) like Verizon or AT&T. Data needs to be sent across these smaller networks and BGP is a set of rules for determining the most efficient path for routing it.
Routes and routers
“BGP is a system by which internet firms channel messages across the web,” says Doug Madory of Kentik, which provides analysis and data on IT networks to businesses. “There’s a constant exchange of updates [between routers] about how to reach certain blocks of internet addresses.” Companies are responsible for monitoring the routes chosen for their addresses in conjunction with web groups such as the Internet Assigned Numbers Authority, or IANA.
MORE FOR YOU
Johannes Ulrich, head of research at SANS Technology Institute, says BGP is a complex protocol to administer and that it’s not uncommon for tech teams to make mistakes when making updates, which needs to happen reasonably often as the router setups used by autonomous systems change.
Malicious activity can also be to blame. In 2008 a Pakistani ISP used BGP to block local users from visiting YouTube by deliberately directing traffic to a dead end. The move triggered a domino effect that ended up causing a widespread outage of YouTube for several hours. There have been other examples of “BGP hijacking”, including one in 2018 in which hackers deliberately diverted traffic meant for Amazon’s DNS to themselves, stealing a reported $100,000 in cryptocurrency as a result.
Initial reports of Facebook’s problems zeroed in on DNS issues, but Cricket Liu, an executive at networking-monitoring company Infoblox, notes that it’s easy to conflate DNS issues with BGP ones. DNS servers may appear to be down, but that could in fact be the result of BGP routes blocking access to certain servers.
It’s not clear yet what caused Facebook’s widespread and prolonged outage, but Cloudflare, a company that monitors and manages internet sites, reported early Monday that it had seen signs that Facebook’s BGP routes had been removed from the web, meaning that the directions for how to get to its DNS server’s addresses were not available. Whatever the reason for this, the consequences caused big problems for the social media giant. “The duration of the outage [at Facebook] is arguably unprecedented for a company of this scale,” says Kentik’s Madory.