A casual stroll through LiveJournal's backend...: lj

bradfitz in lj_maintenance

A casual stroll through LiveJournal's backend...

May 28, 2003 23:53

I haven't been posting here enough lately, mostly because it gets embarassing and boring to repeatedly write: "Site's slow. We know. We're working on it."

Usually when things are slow, it's because of some big, known problem, and nobody has time to offer a good explanation that would make sense to the average reader except "Site's slow. We know. We're working on it.", which likely reads about the same to everybody as "We're stupid. We know. It'll be slow forever."

But it won't be slow forever. In fact, it's poised to become faster than ever... ridiculously fast, actually. I just haven't had time to explain what we're up to.

A good explanation is a technical explanation, but most people wouldn't understand it, so I'm going to try and write this starting generally and increasing in complexity. Maybe I'll throw in some lame analogies to make things more clear.

Site architecture
Okay... let's start with a description of the current site architecture. I'm sure this has been written about a dozen times, but it doesn't hurt to repeat every so often.

Load balancers
A web request (from your browser) comes in from the internet and goes to one of our two load balancers. Actually, only one is ever active at a time. The other is a hot stand-by, meaning it'll kick in automatically should the other die. If things were to get really chaotic, they could work together also, but that should never be necessary: just one alone can take many hundred times more traffic than LiveJournal generates.

Anyway, the load balancer looks at the request, decides if it's mail delivery, a web request, what type of web request, who it's from, what it's for, etc., and then delivers it to one of the multiple mail servers or one of the many webservers.

The proxy web servers
Actually, we have two layers of web servers. The first layer, which the load balancer connects to first, are the "proxy" servers. The proxy servers pass on the web request to the real servers (the int_web pool), via the load balancers again. The load balancer is constantly monitoring all the proxies and web machines, making sure they're running and available before sending them traffic.

Anyway, the proxy servers "buffer" the responses from the int_web machines. That is, they take the real response, hold it in memory, and slowly feed it to your slow Internet connection. And when I say slow, I mean slower than 50 MB/s, which includes you cable/DSL people. Basically, we can't bog down our int_web machines doing trivial tasks like slowly pushing numbers onto the Internet when they could otherwise be working on making pages.

The proxy servers are really simple and dumb. Because their logic is so small (read from here, write to there), each on can serve thousands of concurrent users. We have nine machines doing this, though some of those nine are doing other things as well. We basically have way more than we need, just in case machines die.

The real web servers
Now, once a real web server gets a request from the proxy, it has to look at the URL, and anything you posted to it with a form, and figure out what to do with it. I'll skip over the details because you can go read the code if you're curious, but the gist of it is that the webservers usually have to get information from the databases, build the page, then send it off.

We have 20 machines or so doing this. All dual-processor P3 or P4, 512M-1G of RAM. Some are doing other tasks as well.

Database servers
We have a bunch of database servers which are grouped into clusters. Each cluster has a master and 1 or more slaves, which replicate from their master. Replication usually happens in under a second, but when things get really bad, replication gets behind, and then users start to notice. (entries temporarily disappearing and then showing up again randomly, etc...)

There are two types of clusters: the global cluster, and user clusters.

The global cluster stores small, generic, non-user-specific stuff like friend relationships, permissions, system styles. There's only one global cluster.

On the other hand, we have 4 user clusters, a syndication (RSS) user cluster (without a slave, so not really a cluster), a 6th user cluster on its way (hardware here, but the slave has a faulty drive we discovered during burn-in), and a 7th cluster we've built for inactive users, to get their data off the fast, expensive 15k SCSI disks. (when users are gone for 6 months or a year, we'll move them to the slow, inactive cluster)

Each user is assigned to one of the clusters. The web servers know how to find a suitable database given a user object.

Why things can be slow
Things can be slow for two different reasons:

1) the webservers are blocking.
2) the database servers are blocking.

Blocking means "waiting for something".

(oh, and lisa wanted me to mention there are numerous 3rd reasons, like major net outages, but they're rare....)

Why the webservers block, reason #1
The web servers generally block when there's no process available to handle the request (so the request is sitting at the load balancer, waiting for somebody to open up).

Now, that problem's easy. As long as people buy paid accounts we have the money to buy new webservers and run more processes.

Until a few weeks ago we were artificially limiting the number of web server processes we were running, trying to limit the number of open connections between web servers and database servers.

See, we were under the false impression that it was beneficial to hold open connections to the databases, rather than closing them down when we're done. And because database connections take memory on the database side, we can only have so many open at a time.

But... the database software we use makes connections virtually instantaneous. Historically, database servers have taken a long time to connect to. That's no longer the case.

Sure, if you have one webserver and one database, or both on the same machine, caching database connections may get you a 1-2% speed increase, but it's just plain stupid for a site as big as we are. It's more important to run tons of web processes and have them connect to the databases only as needed, rather than keeping open connections to tons of machines you don't need all the time. This is all especially necessary in our case, where we have so many different types of database servers and any given web process is very unlikely to need the same databases is just used in a subsequent request. It's best to just shut them down and waste a few microseconds.

I take all the blame for not realizing this a year ago. I'd let you kick me for it, but I'm kicking myself enough lately.

All the popular codepaths have been cleaned up to use the new database connection style. We'll be working on the remaining low-traffic codepaths over the next few weeks.

Why the webserers block, reason #2
The database servers are blocking.

Why the database servers block, reason #1.
Their disks can't move fast enough. Even with 15K SCSI disks (the best you can get), disks are just incredibly lame and slow, which is why sites like Google refuse to use them, preferring instead to keep everything in memory which is thousands upon thousands of times faster.

Within a cluster, all database servers do the same writes. If you post a comment, it gets written to 2 or 3 database servers. And since those database servers also have RAID (redundant array of independent disks), you're actually writing to 4-6 disks. Writes are slow.

On the other hand, database reads (viewing journals and comments and whatnot) can be split between any machine in the cluster.

This is why users are on different clusters. We have to split the writes up. If we had one huge cluster, it'd always be doing writes and it'd never be able to read. (this is officially called "horizontal partitioning", I found out a year or so after we started doing it...)

Why the database servers block, reason #2.
MySQL supports multiple backends. The old, tried-and-true one is MyISAM. MyISAM is fast as hell but pretty braindead. It doesn't support transactions or any of the real things databases are supposed to do. The best part about it is that its data files are really small and that it's really fast, when all you're doing is reading.

The very bad thing about MyISAM is that a single write blocks all other writes and reads. That means if you're posting a comment, nobody else can read a comment on your cluster for the disk rotation or so it takes to post that comment. This is totally lame. With enough user clusters, however, we spread the writes out so they're minimal and the reads have enough time to catch up.

(BTW, replication lag happens when the slaves are getting too many reads and can't get around to their writes.)

Another popular table handler for MySQL is InnoDB which is essentially an entirely separate database wrapped inside MySQL. InnoDB does everything a real database is supposed to. Unfortunately, it's a lot slower than MyISAM and takes up a lot more space on disk. We've been using it for a long time for our directory search, but not for anything else. (the LJ code's recently been ported to work with either MyISAM or InnoDB though)

So, our main database problem is that when we get behind on buying new database clusters, our existing user clusters get overloaded and too many writes happen on them, making them unable to do anything else usefully.

Like I said earlier, we have a new database cluster available, but we're waiting on a replacement disk before we can put it live and start moving users to it, lessening the load on all the other databases.

Summary of problems
Now that we don't have to artificially limit the number of web servers we have running and we can just keep buying more as needed (barring paid account revenue), the performance bottleneck shifts forever to the databases.

So, how to solve the database problem? Tons of ways!

Database fix #1: more machines
We could keep buying more database clusters, but they're really, really damn expensive, they take up a lot of space (we have 3 cabinets, but they're always filling up), they require extra administration, and so this option is the least desirable. (I'm already overworking lisa as it is!) Nevertheless, a new database cluster is going live soon. This will be inevitable as we grow. We just want to minimize how often we have to do it.

Database fix #2: smarter software
The annoying thing about our databases lately is that they're blocking against themselves more often than they're doing real work. They have plenty of CPU and their disks aren't moving too fast, yet they're not accomplishing much. This is because of MyISAM.

We could move to InnoDB, and we're considering it, having ported all the LJ code to run on either MyISAM or InnoDB, but its track record isn't as good as MyISAM, it takes up a ton of disk space, and it's not so fast, under its ideal conditions.

An alternative is to keep using MyISAM and partition users up into different database names within a cluster. Basically, this is a lame hack to trick MyISAM into doing multiple things at once, we're telling it where the logical boundaries are.

More than likely our new database cluster will run partitioned MyISAM and we'll be converting all our old database clusters to use that method as well.

Database fix #3: don't use the databases!
If MyISAM can't do multiple things at once, and if the databases are so slow when they do get around to things (what with having slow moving parts like hard drives), why not just ignore the database and not using it at all?

Introducing the memory cache...

Memory Cache Daemon
What we're working on lately (this past week) is a hard-core distributed memory caching system. Basically, we're putting up a bunch of servers that do nothing but keep frequently-used LiveJournal objects in memory. Objects can be users, logins, colors, styles, journal entries, comments, site text, anything...

Any program that runs on a 32-bit computer (like our servers) can only address (or see) 4 gigabytes of memory, even if the server has more than that. But, a server with more memory can run multiple processes each using 4 GB. And we can run this memory cache on any number of machines.

Basically, given a unique object name or its corresponding owner's journal id, we run it through a hash function which consistently maps it to one of potentially infinite memory caches, each with its own 512MB - 3GB of memory, depending on the machine and its resources.

We've been running this for the past few days and when it's working, it's so effective that the database servers are sitting around doing almost nothing.

Our goal is for most page views to be served from memory and we've been converting all the LJ code to use the memcache whenever possible.

The problem is, we've still got some bugs in the memcache server. The first version, written in Perl, was really more of a prototype for testing the client APIs. The second version, written by avva is written in C and is insanely fast ( more info). This new version works fine until it's allocated all its memory, and then the memory space gets too fragmented and the memory allocator can't efficiently find places for things. We suspected this problem though, and we're re-writing the allocator using modern OS allocator techniques. (props to ff, jeffr, and taral for their advice)

In the next day or so the new memory cache server should be live and stable, at which time we'll be putting it on a lot more machines than we are now.

Once that's live, the site will be so fast the bottleneck will again be webservers, but we'll just add a ton more, now that we don't have to worry about maxing out database connections. (which won't even be used on most pages, with the distributed memory cache)

Conclusion
I know there are a lot of people out there saying we're lazy bums who don't know what we're doing. We kinda deserve it. The site has been slow on and off a lot lately. But, we are working hard on it, and hardly sleeping.

Although it may seem like no progress is being made, I've never been so excited since we originally started clustering users. This is a major new step in our architecture.

With luck, everything should be flying soon and I can get back to working on all the fun stuff which I always put off when the site architecture needs work.

I apologize for all the confusion, slowness, and lack of information. Hopefully that'll all get better very soon here.

[Comments being left on, but I probably won't be able to read/reply to everything.... Also, please stay on topic.]