What happened to facebook yesterday?

Oct 06, 2021 16:06


If you're sensible enough not to use Facebook, WhatsApp, or Instagram, or to have set up "log in with Facebook" on any site you use regularly, you might not have noticed that they all disappeared from the internet for about six hours yesterday. Or if you noticed, you might not have cared. But you might have read some of the news about it, and wondered what the heck BGP and DNS are, and what they had to do with it all.
And if not, I'm going to tell you anyway.
You're more likely to have heard of DNS: that's the Internet's phone book. Your web browser, and every other program that connects to anything over the Internet, uses the Domain Name System to look up a "domain name" like, say, "www.facebook.com", and find the numerical IP address that it refers to. DNS works by splitting the name into parts, and looking them up in a series of "name servers". First it looks in a "root server" to find the address of the Top-Level Domain (TLD) server that holds the lookup table for the last part of the name, e.g., "com". From the TLD server it gets the address of the "authoritative name server" that holds the lookup table for the next part of the name, e.g., facebook, and looks there for any subdomains (e.g. "www").
(When you buy a "domain name", what you're actually buying is a line in the TLD servers that points to the DNS server for your domain. You also have to get somebody to "host" that server; that's usually also the company that hosts your website, but it doesn't have to be.)
All this takes a while, so the network stack on your computer passes the whole process off to a "caching name server" which remembers every domain name it looks up, for a time which is called the name's "time to live" (TTL). Your ISP has a caching name server they would like you to use, but I'd recommend telling your router (if you have full control over it) to use Cloudflare's or Google's nameserver, at the IP address 1.1.1.1 or 8.8.8.8 respectively. Your router will also keep track of the names of the computers attached to your local network.
Finally, we get to the Border Gateway Protocol (BGP). If DNS is the phone book where you look up street addresses, BGP is the road map that tells your packets how to get there from your house, and in particular what route to take.
The Internet is a network of networks, and it's split up into "autonomous systems (AS), each of which is a large pool of routers belonging to a single organization. Each AS exchanges messages with its neighbors, using BGP to determine the "best" route between the itself and every other AS in the Internet. (The best route isn't always the shortest; the protocol can also take things like the cost of messages into account.) BGP isn't entirely automatic -- there's some manual configuration involved.
What happened yesterday was that somebody at Facebook accidentally gave a command that resulted in all the routes leading to Facebook's data centers being withdrawn. In less than a minute Facebook's DNS servers noticed that their network was "unhealthy", and took themselves offline. At that point Facebook had basically shot themselves in the foot with a cannon.
Normally, engineers can fix server configuration problems like this by connecting to the servers over the internet. But Facebook's servers weren't connected to the internet anymore. To make matters worse, the computers that control access to Facebook's buildings -- offices as well as data centers -- weren't able to connect to the database that told them whose badges were valid.
Meanwhile, computers that wanted to look up Facebook or any of its other domains (like WhatsApp and Instagram), kept getting DNS failures. There isn't a good way for an app or a computer to determine whether a DNS lookup failure is temporary or permanent, so they keep re-trying, sometimes (as Cloudflare's blog post puts it) "aggressively". Users don't usually take an error for an answer either, so they keep reloading pages, restarting their browsers, and so on. "Sometimes also aggressively." Traffic to Facebook's DNS servers increased to 30 times normal, and traffic to alternatives like Signal, Twitter, Telegram, and Tiktok nearly doubled.
Altogether a nice demonstration of Facebook's monopoly power, and great fun to read about if you weren't relying on it. Resources

Another fine post from The Computer Curmudgeon (also at computer-curmudgeon.com).
Donation buttons in profile.

[Crossposted from mdlbear.dreamwidth.org, where it has
comments. You can comment here, or there with openID, but wouldn't you really rather be on Dreamwidth?]

computers, curmudgeon

Previous post Next post
Up