The "multiget hole" and how none of this is new: dormando

dormando

The "multiget hole" and how none of this is new

Oct 27, 2009 00:11

http://highscalability.com/blog/2009/10/26/facebooks-memcached-multiget-hole-more-machines-more-capacit.html

This has been making the rounds today... In my usual fashion I'm going to write an overly complicated post in response.

The basic claims of the memcached "multiget hole" are thus:

- If you are primarily using multigets to batch requests.
- Memcached is out of CPU.
- Adding a second memcached instance will split the batch request across both hosts.
- This will make things slower, since your multiget request gets split in half and hits both servers, instead of just one.

Lets break down this last claim into some detail, then discuss potential workarounds! Everybody put on your bath robe and thinking cap.

Claim: A multiget, split into two, will be slower than a single multiget against a single server.

What's really happening: A multiget, as referenced here, is when you combine a fetch for several keys into a single request. Lets say in this exercise you are trying to fetch keys 'foo1' through 'foo100', in one single request. The process for a typical memcached client and server instance is:

- Take the full list of keys requested.
- Hash each key individually against the list of memcached servers. If you have one server, they all go to the same place, if you have two, they are split.
- For each server that will get keys, issue a special multiget request against that server. For the ASCII protocol this looks like: `get foo1 foo2 foo3` to the first instance, and `get foo4 foo5 foo6` sent to the second instance. A single write, will get multiple responses back. This is faster than doing them one at a time, since you would be waiting for a response between each get. "get foo1" (wait for response) "get foo2" (wait for response) etc.
- Wait for each server to respond, collect keys, return to caller.

Lets break down the steps even more!

- For each multiget request issued, a *client* may either use a *blocking* or *non blocking* mode.
- In an optimized case, the client will issue a multiget against *both* servers *in parallel* and then call poll(2) (or similar) and wait for the responses.
- In a non-optimised case, the client will issue multigets to each server in turn and wait for the response. libmemcached did this until recently, so you might be surprised, if you look!

On the server end:

- Read all keys requested.
- For each key, hash the key and look it up against the internal hash table.
- Load any valid items for return and...
- ... write them to the socket.
The binary protocol more closely combines all these steps, but the idea is the same.

What the hell are you getting at?

Well, my point is slicing a multiget actually *shifts a tradeoff* as much as it becomes more or less efficient. There is a certain amount of overhead for a *server* to read from a client and respond, but there is also a particular amount of effort for that server to look up the key in its hash table and build a response. It is a fact that issuing a smaller multiget against a particular server will take *less* CPU time than a larger one. Adding servers does reduce CPU time on the server.

However, when the client has to issue separate writes to more servers, it is doing more complex work and will thus take longer and use more CPU time than if all of the requests were in a single write.

Hence, adding servers to a cluster *will* reduce the CPU usage on the cluster. The addition is non linear, but it will not make it *worse*. It could however negatively affect clients, and a bad client can be especially affected, if it has to wait for responses in serial.

Part the next: The subtle issue

Depending on how large your multigets are, it may take less time to split them an issue them against multiple servers. This is *entirely up to you* on how you want to handle, requires testing, and can be affected by kernel tunables.

If you are issuing a multiget with many keys, or with very large responses, you will be more likely to run up against the (I hope I'm quoting this right) TCP window scale. After so many bytes TCP needs to roundtrip an ACK packet to confirm that the remote end has received the preceeding data. This window will open at a certain size, and then expand or contract depending on how you're using the connection.

This is why some downloads or uploads will start out a little slow, then rapidly speed up. It's also why connections over a laggy or long link might not go above, say, 40k per second, but you can open multiple connections and run them all at 40k/sec to the same server (see also: download accelerators).

This last example should illustrate what I mean here. Stuffing too much down the pipe at once will cause more roundtrips to the remote server for the TCP acks. If you split a large list and run the data nonblocking, in parallel, to multiple servers, it might take less time to issue the request, but will use more CPU on the client.

Part the next to last: The workarounds

With the above in mind, the typical workaround has been in use for ages. A long time ago, in a galaxy far far away, brad fitz (or someone over there, I'm not sure who) realized that when fetching all cache keys for a livejournal profile is a trivial multiget of 10+ keys. It was also stupid to issue this (relatively small) multiget across all of the memcached hosts.

So he added a set and fetch by master key mode to the perl client (Cache::Memcached at the time). When you issue a set or a get request in "by key" mode, you give any given key a second key. That is, the key your client uses to hash your data out to the list of memcached servers, is different from the key you hand to memcached for storage. So:

- You assign a master key "dormando", to keys "dormando-birthday", "dormando-website", etc. This is bad key naming, but bear with me.
- Your client, instead of using "dormando-birthday" to decide where to store the keys, uses the key "dormando"
- Your cilent then sends *just* "dormando-birthday" and "dormando-website" to whatever server "dormando" hashed to.
- Memcached happily stores those keys, without any idea of what the master key was (you can't get it back).

Then you issue a multiget back, with the master key of "dormando". Both keys resolve to the *same* server, and the multiget hits a single host. With a single write, and ideally a single roundtrip.

If I had a metric pantload of keys to fetch and don't want to issue them all to the same server (noting the above subtle issue), I can semi-intelligently split the master keys into "dormando-chunk1", "dormando-chunk2" - depends on what your app can handle.

This is a simple and elegant way of avoiding having multigets spread thinly across your memcached cluster.

You could use UDP!

Yeah I guess you could. What about keys with larger responses? This does have a lot of the same issues, but in a different flavor. Could be faster or slower depending on what you're doing.

You could REPLICATE!

annnnnnnnnd do something really complicated where you have to store all of your keys in two places (halving the effective size of your cache!) and having your client randomly pick where each key goes to or comes from each time they're fetched? When issuing against a cluster of more than *two* machines, this isn't going to help nearly as much as cutting 50 separate fetches down into a single request, deterministically, by clustering the keys intelligently.

Note that replication adds a lot more failure scenarios. Network blips can lead to inconsistent cache data, among other things.

Sounds simpler to use a feature that already exists (look for "mget_by_key")?

But you could make either work.

Fortunately, there's also a really short answer to all of this.

memcached