Facebook's Memcached Multiget Hole: More machines != More Capacity
When you are on the bleeding edge of scale like Facebook is, you run into some interesting problems. As of 2008 Facebook had over 800 memcached servers supplying over 28 terabytes of cache. With those staggering numbers it's a fair bet to think they've seen their share of Dr. House worthy memcached problems.
Jeff Rothschild, Vice President of Technology at Facebook, describes one such problem they've dubbed the Multiget Hole.
You fall into the multiget hole when memcached servers are CPU bound, adding more memcached servers seems like the right way to add more capacity so more requests can be served, but against all logic adding servers doesn't help serve more requests. This puts you in a hole that simply adding more servers can't dig you out of. What's the treatment?
Dr. House would immediately notice the hidden clue, we are talking requests not memory. We aren't running out of memory to store stuff, we are running out of CPU power to process requests.
What happens when you add more servers is that the number of requests is not reduced, only the number of keys in each request is reduced. The number keys returned in a request only matters if you are bandwidth limited. The server is still on the hook for processing the same number of requests. Adding more machines doesn't change the number of request a server has to process and since these servers are already CPU bound they simply can't handle more load. So adding more servers doesn't help you handle more requests. Not what we usually expect. This is another example of why architecture matters.
Understanding Multiget
To understand why we are serving the same number of requests we have to understand memcached's multiget request. The multiget request allows the multiple keys to be retrieved in one request. If a user has 100 friends, for example, the changes for each of those friends can be retrieved by making one request. This if far more efficient than making 100 individual requests.
Multiget allows library makers to transparently use two classic scalability tactics: batching and parallelization.
Let's say there's a memcached pool containing two servers and 50 friends are stored on each server. What a smart library implementation can do is batch up the requests destined for each memcached server and run those requests in parallel. Memcached works by mapping keys to memcached servers, in practice there's no reason keys would be distributed 50 to each server, but this is just an example. Instead of sending a request per friend we are just sending one request per server. The power of batching is to radically reduce request latency by reducing the number of requests.
Now let's say we make 50 requests for all 100 hundred friends of our moderately popular user. Each server will see 50 requests because half the friends are on each server. If we see that the pool servers are running out of CPU our most likely reaction is to add another server to the pool.
What does adding another server to the pool accomplish? It means 33ish friends will be stored on each server. When we send out 50 requests to gather info for the 100 friends each server is still seeing 50 requests because to collect all 100 friends we have to hit each server. We've done absolutely nothing to reduce the usage of our scarce resource which is CPU. True, we'll use less bandwidth per server, but that doesn't matter because we have enough bandwidth.
The astounding result of this exercise is that adding more machines does not add more capacity. Mr. Rothschild said this isn't a problem they sat down and reasoned through from first principles. This is a problem they saw in the field and learned about from experience. They saw that adding more machines to increase capacity and had to work out what the heck was happening.
How do you solve the multiget hole problem?
One solution to the multiget hole problem is replication. Since the problem is a lack of CPU power more CPU needs to be applied.
One classic technique to allow more CPU to churn on data is to replicate the data and load balance requests between the replicas. In our example we would create two pools of two servers each. Each pool would get half the requests so they do half the work and they would no longer CPU bound. Now you've doubled the capacity of your system and avoided stepping into the hole.
Related Articles
Reader Comments (14)
Does MC support a client cache module with automatic updates from the main cache alla Coherence or Gigaspaces, and would that, given FB's access patterns, help?
Seems like a prime case of needing a process switched to cGPU
-bob rizzle.
Won't replication result in a higher probability of cache inconsistencies? What if, like a traditional CPU cache, one request leads to not only one specific (friend) retrieval, but a list of results (based on cache locality, or in this case, friend locality)? This will be combined, not with a multiget request, but a traditional single-get request which loads all this 'additional' friend data as well. This should reduce the requests. No?
Much of the multi-get hole can be fixed by using appropriate hints when you get/set data.
Instead of letting your memcache client compute the hash key you can specify a natural partitioning key. This allows you to group logical items on the same memcache server.
For example if you have the following keys describing inbox messages inside folder ID #111:
msg:111:1, msg:111:2, msg:111:3
You would force the hash to '111' for these items which would insure that all of these related items are stored on the same memcache server.
Obviously you have to choose a good natural key as you hash -- cardinality must be greater than the number of servers in the pool and you should have a good distribution in the address space.
Something smells like a bad design. Was it not better to push the event happening to the friends instead of polling ?
Rather odd post. Should have been titled something more like "Trying to optimize Memcached performance". Not to mention the problem was that throwing hardware at the issue didn't scale. So the solution was to throw twice as much hardware at the problem. :)
But the general idea is correct, you have to setup a form of load balancing if you want to surpass a certain level of performance from memcached (and just about everything else on the planet). This means that you need a wrapper layer over your client that can change the pool configurations easily.
In our moderate setup of 45 servers with 1.35T of storage we use a wrapper library for determining the pool based on site/role of requests. I also added a "redundant" flag to the wrapper which isn't perfect, but better than nothing. Basically when you set something it gets written to a primary and a backup server. Then when it gets read if it's missing from the primary it checks the backup and repopulates the primary if it finds it. It doesn't check the backup when it gets from the primary since that is more overhead than I'm willing to accept. It could be easily modified to set on multiple and read from either.
In terms of performance we too have had issues with a couple of very popular keys being on the same server and it getting overloaded with requests. Most of the time our issue was more with not allowing enough connections on the box and not with CPU though. If we have the socket support we usually get the data no questions asked and with essentially no lag.
I think if you're trying to get the best bang for the buck you'd have to come up with a very fast lookup system or a different hashing system to try to balance out server load since hashing out keys may distribute the keys somewhat evenly but it doesn't take into account the popularity of a key. But at that point you're closer to a commercial managed memory storage system than Memcached.
salm, not that I've ever seen. Best thing I can think of is Tokyo Tyrant with replication and memory tables. But that's a basic Master->Slave setup. Should give you write once read from many. But then you have to manage relationships like you would MySQL configurations. It would just be a lot faster in terms of key lookup. You could also change one of the slaves to be disk bound for backup I suppose. We've done tricks like that with MySQL.
Wouldn't a simple solution be just to have less memory in each server, given that replication would require more servers anyway.
Also do Facebook use the binary protocol yet? I thought they were pushing that to reduce CPU load.
As usual I give a length answer to a simple question: http://dormando.livejournal.com/521163.html :)
The simple takeaway is that multiget is an optimization that works well for small memcached server clusters, but its value diminishes as you scale.
We've never used multiget at Tagged -- serial gets are fast enough, and it's dangerous to rely on multiget to make your application fast because the advantage will disappear when you grow and spread memcached across many servers.
I did not get the multiget issue.
We have 100 keys and want to get 100 values.
Our client library knows what key is where so it splits 100 keys into 2 sets and issues 2 concurrent requests of 50 keys each. So far so good. Each server gets 50 keys in a request.
Now we add a new MC node. 33 keys on each server. Now our client library knows what key is where so it splits 100 keys into 3 sets and executes 3 concurrent requests of ~33 keys each. Each server only get 33 keys in a request! I do see decrease in CPU usage here...
What is it that I got wrong?
ikatkov, you are on the right path.
The article isn't about how a small cluster does so well with multiget, but how as your cluster grows it loses it's efficiency. So with <6 servers and 100+ keys you still get a pretty good ratio of keys to server (assuming even spread, 20+ per single connection to a server). But in the case of companies like facebook and my company we have many more machines. So if you have 50+ servers and you're looking for 5 keys, the likelihood of getting multiple keys in a single request is very slim and you'll still end up having to make 5 connections.
That's where you have to work on partitioning. Just like our MySQL clusters you have to setup application level partitioning of the pool of servers in order to get best usage.
Hope that helped.
@Jon Stephens
Thanks a lot. Clear now.
Following up on @ikatkov's question and @JonStephens answer.
"Memcached works by mapping keys to memcached servers" - author
I don't understand how 50 requests need to be made to every server, when we already to have a key to server map.