Words by Dominic West, Staff Writer
What happened with Facebook a fortnight ago? What is a DNS error? Why did it take so long to fix? Let’s take a look.
Facebook (the company) owns Facebook (the website), Instagram, Facebook Messenger, WhatsApp, Oculus and a few other smaller brands. To understand the magnitude of the outage, it’s important to first note that it was Facebook Inc. that went down, not simply the Facebook website. That’s why all the Facebook Inc. brands went down as well, and why they all went down globally.
Some internet basics: The internet is comprised of servers organised into clusters, and these servers each have their own IP addresses. When you visit a website, you connect to an endpoint server by hopping via various servers along the way (it’s more complicated than this, but that’s the basic idea). Because servers aren’t always online and the internet is a dynamic system, it’s important that some fancy computing magic takes you by the hand and leads you via the current correct route.
Facebook has a huge backbone network comprised of many thousands of servers and fibre optic cables. In addition to its various subsidiary companies, such as Instagram and WhatsApp, the data centres for its internal communications (company messages, access passes for the building, troubleshooting tools etc.) also rely on this backbone network.
To understand what went wrong, you’ll need to know two acronyms: DNS and BGP. In layman’s terms, DNS converts a human language address (‘www.facebook.com’) into a server address (184.108.40.206), and BGP – using some computing magic – finds the best route via multiple other servers into the Facebook network to get to that destination server. In short, DNS provides the location and BGP provides the route.
So, here’s what happened. Facebook, like other Silicon Valley companies, uses programmes to audit parts of its infrastructure. In this case, for reasons that are currently unclear, a command issued to test the capacity of the backbone system inadvertently took down the entire network. To make matters worse, once the backbone network went down, Facebook’s DNS servers stopped announcing addresses to the BGP system. The DNS servers saw that Facebook’s data centres were offline, so instead of continuously advertising Facebook’s server addresses for the BGP protocol to provide the route, the DNS servers simply went quiet. Without those endpoint addresses, there were now no BGP routes into the Facebook network. This effectively removed Facebook and all its subsidiaries from the internet. While the physical connections to the servers still existed, from the perspective of your computer or phone, the entire Facebook network did not exist.
Additionally, Facebook engineers couldn’t access the usual methods for fixing the problem, because they couldn’t access their own network. You can’t fix something if you can’t find it, and without the BGP protocol providing a route in, the network could not be found. If the Facebook network were a car, the keys would have been locked in the boot. The car also contained the phone needed to call for help, and all the tools needed to retrieve the keys or make new ones.
Eventually, Facebook engineers managed to break into their own system (by physically gaining access to the data centres) and slowly got things back online. It took an unusually long time because data centres are designed to be difficult to physically access and manipulate. You don’t want such a vital system to be taken online or offline by simply opening a door and flicking a switch.
It was a very bad week for Mark Zuckerberg, with a whistle-blower testifying in congress and on prime-time television about Instagram’s awareness of its social harms, and his net worth plummeting as a result of the network outage. Although there’s little doubt the Zuck will recover financially, this incident was for many a stark reminder of our dependence on Silicon Valley and the power it exerts on the daily lives of so many people.