Troubles with Sharding - What can we learn from the Foursquare Incident?
Friday, October 15, 2010 at 8:15AM
HighScalability Team in Strategy, postmortem, sharding

For everything given something seems to be taken. Caching is a great scalability solution, but caching also comes with problems. Sharding is a great scalability solution, but as Foursquare recently revealed in a post-mortem about their 17 hours of downtime, sharding also has problems. MongoDB, the database Foursquare uses, also contributed their post-mortem of what went wrong too.

Now that everyone has shared and resharded, what can we learn to help us skip these mistakes and quickly move on to a different set of mistakes?

First, like for Facebook, huge props to Foursquare and MongoDB for being upfront and honest about their problems. This helps everyone get better and is a sign we work in a pretty cool industry.

Second, overall, the fault didn't flow from evil hearts or gross negligence. As usual the cause was more mundane: a key system, that could be a little more robust, combined with a very popular application built by a small group of people, under immense pressure, trying to get a lot of work done, growing really fast (3 million users, 200 million checkins, 18,000 new per day), deprioritized a few steps that eventually turned into a cascading series of failures.

Was it preventable? Yes, but it was also understandable. Current systems from the dominant paradigm are stable precisely because they have been through years and years of these kind of war stories and have been hardened over time into an enviable robustness. Whatever weaknesses they do have, everyone just "knows" how to route around using tools and tricks that have by now become second nature. NoSQL hasn't gone through this maturation process yet. Tools, tricks, and best practices are still being developed. NoSQL and other new approaches will have to follow the same lifecyle, but hopefully by this kind of public disclosure and discussion we can really jump down that learning curve.

What Happened?

The problem went something like:

The number of issues this scenario brings up is astonishing. They range from low level cache and system design, questioning the ability of cloud IO to support high performance applications, deciding on the role key infrastructure software should play in it's own management, arguing over what standards of professionalism should be applied to quick moving startups, proposing how severe system problem should be handled in this litigious age, to if handling a problem well should outweigh in a user's mind the problem in the first place. A lot to consider, but we'll stick to the more technical aspects of sharding.

What is a Shard?

If we are going to get a better feel for how shards work and how they can fail, we first need to know what they are. I'm going to take the traditional definition of shard and generalize it a bit:

  1. shard: a subset of a data set that is stored and accessed on a particular node. 
  2. sharding: the process of creating and managing shards to properly match data size to resource usage and availability. Resources can include: CPU, memory, disk, network, performance and reliability. 

Some miscellaneous notes and elaborations: With sharding an entire data set is split into chunks and spread horizontally across a set of nodes, this is horizontal partitioning. The data set can be fixed in size or grow continually. Data and indexes can be completely in-memory, primarily on disk, indexes primarily in-memory and data on disk, or it can be a goal to have a working set can be in memory and the rest on disk. If your algorithms walk across a lot of pages or use a lot of historical data, then a RAM only and a working set approach may not work well. Databases can use the operating system to manage paging to disk when the data is larger than available RAM or the database can manage on disk storage itself. The format of the data and the operations allowed on the data are system specific. Systems vary a lot in the data types they support:  BLOBs, documents, JSON, columns, and references as first class citizens. Systems vary a lot in the operations they support: key-value lookups, increment operators, map-reduce operations, queries, limited transactions, set operations, searching, secondary indexes, auto-sharding, monitoring, REST API, and a management UI. The sharding process can be completely manual or be under different degrees of automation. Some systems take a few configuration parameters at startup and the system is under manual control after that. Some systems offer a high degree of automation that removes much of the responsibility off of developers. Usually it's somewhere in between.

Data is typically sharded when it no longer "fits" on one node. We generally think of sharding as a means to get around RAM or disk limitations, but it's a more general principle than that, it's a way to get around all types of limitations by matching the data size to a node such that there are enough resources to do whatever needs to be done with the data. Limitations can be any combination of:  CPU, memory, disk, network, performance and reliability. 

Some examples. The common case is for a shard to grow and use too much RAM, so the shard must be managed (split, moved, etc) to remove the RAM constraint. It's also possible for all your data to fit in RAM, but to not have enough available CPU to operate on the data. In that case a shard would have to moved, split, etc so that the CPU constraint is removed. It's possible to have enough RAM, but not enough reliable network to perform the required number of IO operations. In this case the data must be moved, split, or replicated such that the IOPS target can be reached.  And so on. They key is to move the data around in such away that constraints are removed and performance targets are met.

The goal of sharding is to continually structure a system to ensure data is chunked in small enough units and spread across enough nodes that data operations are not limited by resource constraints. 

A good product match for you is one that accomplishes this goal in a way that matches your expectations, capabilities, and budget.

A Trouble Typology for Shards

Sharding is often seen as the secret sauce of scaling, horizontally inclined nirvana, but there are possible problems in paradise not everyone may have fully considered. We need to build up the same sort of "knowing" about potential problems with sharding that we have built up for RDBMSs. Here's a first cut at a typology of sharding problems to look out for. It's in no way meant to be an exhaustive list or a great explanation of all the issues. It's just a start.

I'm sure there are a lot more potential troubles out there. What do you think?

And what can you do with all this information? Check how your system handles these issues and if you are happy with how they are handled.  If not, what can you about it?

Possible Responses to Troubled Shards

What can be done about the troubles? Some possible options are:

  1. Use more powerful nodes. Obvious, but scaling-up is often the best solution.
  2. Use a larger number of nodes. Using fewer nodes makes you much more susceptible to problems on those nodes.
  3. Move a shard to a different node. If that node has some headroom then your performance should improve. How easy is it to move shard around with your system? Is the bucket model virtualized so it's relatively easy to move shards around?
  4. Move keys from a shard to another shard or into its own shard. A very hot key needs to placed into a shard where it can get the resources. Can your system move individual keys around? What happens if an individual key requires sharding itself?
  5. Add more shards and move existing data to those shards. Can the data be moved fast enough to fix the overload problems? Are requests queuing up? Will enough memory be freed on the overloaded system or will fragmentation issues prevent the memory from being reclaimed?
  6. Condition traffic. This is just smarter handling when problems occur. Prioritize requests so management traffic can be received by nodes even when the system is thrashing. Requests can be dropped, prioritized, load balanced, etc rather than being allowed to take down an entire system. 
  7. Dark-mode switches. Have the ability to turn off parts of system so load can be reduced enough that the system can recover. 
  8. Read-only disaster mode. Have a read-only option so the system can fail to a mode where your system is accessible for reads, even if it can't handle writes, while repairs are going on.
  9. Reload from a backup.
  10. Key choice. Use a uniform distribution mechanism to better spread keys across shards. Select a better key and reshard.
  11. Key mapping. Mapping data to a shard via hashing is quick and efficient, but it's not the only way to map keys to shards. Flickr, for example, uses and index server that directly maps individual users to a shards. Should developers ever even need to now about shards?
  12. Replicate the data and load balance requests across replicas. Does your database support selectable consistency policies? Can some of your data use a more lenient policy? Does your code support using read replicas when the main servers are down?
  13. Monitor.  Monitor resource usage so you can get a heads up and take preventative action. This is obvious and was hammered on pretty hard by commenters. Foursquare knew they should monitor, but didn't have the time. A good question is if this is something they need to build from scratch? Isn't there a service the connects MongoDB and EC2 together without a lot of fuss and bother? MongoDB does provide such statistics.  Look for a spike in disk operations per second, which would indicate increased disk activity. Look for increased request queue sizes and request latency, both of which indicate something is going wrong in the core processing engine. 
  14. Automation. Use a system that has more automated elasticity, sharding, and failure handling. Can it connect into systems like RightScale or Amazon's AutoScale?
  15. Background reindexing and defragmentation. Give consideration to query latency during these operations.
  16. Capacity plan. Better estimate your system requirements. Continually capacity plan to continually adjust resources to fit projected needs. The growth rate for Foursquare was predictable, so it shouldn't have been a complete surprise that at some point in the growth curve more resources would be needed. This is where capacity planning comes in. Difficult for a startup, but it would nice if this functionality was part of a general dashboard system.
  17. Selection. Select a system with memory management policies and other features that fit your use case. If you know you won't have time to manage a system then select a system that will do more of that for you. If you aren't sure about key distribution pick a system that helps you with that. If you aren't sure about your working set size then pick a system that manages that in a robust fashion.
  18. Test. Test your system under realistic conditions. Test failure and recovery scenarios. Test rapid allocation/delete scenarios. Test mixes of allocation sizes. Test query response times during failure scenarios, garbage collection, and management operations like adding shards. Does everything still work as promised? Integrate testing into your build system so it's possible to know when any key metric changes. Again, this advice is from Captain Obvious, but in the cloud this is all doable. If you don't test your customers will be testing for you. 
  19. Figure a way to get out of technical debt. Startups are short on time and people. They are busy. Shortcuts are taken and technical debt is incurred. How can you get out of debt so that features aren't dropped on the floor? Look for services that can be integrated via a simple API. Hire contractors. Don't be so picky about who you hire. Think about it at least. Maybe there's a creative solution other than pushing it out to a much later release.
  20. Better algorithm selection. Use, when possible, incremental algorithms that just need an event and a little state to calculate the next state. A common example is when calculating an average use a running average algorithm rather than one that requires all values for for every calculation. One of the problems Foursquare had was they use a person's entire check-in history to award badges. This means they have to store a lot of "useless" data that will definitely kick in paging while  walking these data structures. It may be better to drop a feature if it can't be calculated efficiently.
  21. Separate historical and real-time data. Historical data could be kept in a different datastore so that it didn't interfere with real-time operations. Then background calculations can be run that update the real-time system without impacting the real-time system.
  22. Use more compact data structures. Compact data structures will use less memory which means more data can fit in memory.
  23. Faster IO. If you are expecting that all your data will not fit in RAM, then make sure your IO system isn't the bottleneck.  
  24. Think about SPOF. Single points of failure are not the end of the word if you know about them and work it into your MTBF calculations. The point is to know they could be a problem and determine how the database you are using deals with them.
  25. Figure out how you will handle downtime. Let people know what's happening and they will still love you. Foursquare now: tweets status messages, provides regular updates during outages, added a status blog, and created a more useful error page that had more information on it. All good.

Again, I'm sure there are lots more possible solutions. What are yours?

Clearly no system is perfect or will be used perfectly. The point here is really just to consider what issues are out there, bring up some points you may need to think about, and think about how you might fix them.  

Related Articles

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.