« The Canonical Cloud Architecture | Main | Stack Overflow Architecture »
Thursday
Aug062009

An Unorthodox Approach to Database Design : The Coming of the Shard

Update 4: Why you don’t want to shard. by Morgon on the MySQL Performance Blog. Optimize everything else first, and then if performance still isn’t good enough, it’s time to take a very bitter medicine.
Update 3: Building Scalable Databases: Pros and Cons of Various Database Sharding Schemes by Dare Obasanjo. Excellent discussion of why and when you would choose a sharding architecture, how to shard, and problems with sharding.
Update 2: Mr. Moore gets to punt on sharding by Alan Rimm-Kaufman of 37signals. Insightful article on design tradeoffs and the evils of premature optimization. With more memory, more CPU, and new tech like SSD, problems can be avoided before more exotic architectures like sharding are needed. Add features not infrastructure. Jeremy Zawodny says he's wrong wrong wrong. we're running multi-core CPUs at slower clock speeds. Moore won't save you.
Update: Dan Pritchett shares some excellent Sharding Lessons: Size Your Shards, Use Math on Shard Counts, Carefully Consider the Spread, Plan for Exceeding Your Shards

Once upon a time we scaled databases by buying ever bigger, faster, and more expensive machines. While this arrangement is great for big iron profit margins, it doesn't work so well for the bank accounts of our heroic system builders who need to scale well past what they can afford to spend on giant database servers. In a extraordinary two article series, Dathan Pattishall, explains his motivation for a revolutionary new database architecture--sharding--that he began thinking about even before he worked at Friendster, and fully implemented at Flickr. Flickr now handles more than 1 billion transactions per day, responding in less then a few seconds and can scale linearly at a low cost.

What is sharding and how has it come to be the answer to large website scaling problems?

Information Sources

What is sharding?

While working at Auction Watch, Dathan got the idea to solve their scaling problems by creating a database server for a group of users and running those servers on cheap Linux boxes. In this scheme the data for User A is stored on one server and the data for User B is stored on another server. It's a federated model. Groups of 500K users are stored together in what are called shards.

The advantages are:

  • High availability. If one box goes down the others still operate.
  • Faster queries. Smaller amounts of data in each user group mean faster querying.
  • More write bandwidth. With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
  • You can do more work. A parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data, because there are parallel paths through your system. You can load balance web servers, which access shards over different network paths, which are processed by separate CPUs, which use separate caches of RAM and separate disk IO paths to process work. Very few bottlenecks limit your work.

    How is sharding different than traditional architectures?

    Sharding is different than traditional database architecture in several important ways:

  • Data are denormalized. Traditionally we normalize data. Data are splayed out into anomaly-less tables and then joined back together again when they need to be used. In sharding the data are denormalized. You store together data that are used together.

    This doesn't mean you don't also segregate data by type. You can keep a user's profile data separate from their comments, blogs, email, media, etc, but the user profile data would be stored and retrieved as a whole. This is a very fast approach. You just get a blob and store a blob. No joins are needed and it can be written with one disk write.

  • Data are parallelized across many physical instances. Historically database servers are scaled up. You buy bigger machines to get more power. With sharding the data are parallelized and you scale by scaling out. Using this approach you can get massively more work done because it can be done in parallel.

  • Data are kept small. The larger a set of data a server handles the harder it is to cash intelligently because you have such a wide diversity of data being accessed. You need huge gobs of RAM that may not even be enough to cache the data when you need it. By isolating data into smaller shards the data you are accessing is more likely to stay in cache.

    Smaller sets of data are also easier to backup, restore, and manage.

  • Data are more highly available. Since the shards are independent a failure in one doesn't cause a failure in another. And if you make each shard operate at 50% capacity it's much easier to upgrade a shard in place. Keeping multiple data copies within a shard also helps with redundancy and making the data more parallelized so more work can be done on the data. You can also setup a shard to have a master-slave or dual master relationship within the shard to avoid a single point of failure within the shard. If one server goes down the other can take over.

  • It doesn't use replication. Replicating data from a master server to slave servers is a traditional approach to scaling. Data is written to a master server and then replicated to one or more slave servers. At that point read operations can be handled by the slaves, but all writes happen on the master.

    Obviously the master becomes the write bottleneck and a single point of failure. And as load increases the cost of replication increases. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled.

    Sharding cleanly and elegantly solves the problems with replication.

    Some Problems With Sharding

    Sharding isn't perfect. It does have a few problems.

  • Rebalancing data. What happens when a shard outgrows your storage and needs to be split? Let's say some user has a particularly large friends list that blows your storage capacity for the shard. You need to move the user to a different shard.

    On some platforms I've worked on this is a killer problem. You had to build out the data center correctly from the start because moving data from shard to shard required a lot of downtime.

    Rebalancing has to be built in from the start. Google's shards automatically rebalance. For this to work data references must go through some sort of naming service so they can be relocated. This is what Flickr does. And your references must be invalidateable so the underlying data can be moved while you are using it.

  • Joining data from multiple shards. To create a complex friends page, or a user profile page, or a thread discussion page, you usually must pull together lots of different data from many different sources. With sharding you can't just issue a query and get back all the data. You have to make individual requests to your data sources, get all the responses, and the build the page. Thankfully, because of caching and fast networks this process is usually fast enough that your page load times can be excellent.

  • How do you partition your data in shards? What data do you put in which shard? Where do comments go? Should all user data really go together, or just their profile data? Should a user's media, IMs, friends lists, etc go somewhere else? Unfortunately there are no easy answer to these questions.

  • Less leverage. People have experience with traditional RDBMS tools so there is a lot of help out there. You have books, experts, tool chains, and discussion forums when something goes wrong or you are wondering how to implement a new feature. Eclipse won't have a shard view and you won't find any automated backup and restore programs for your shard. With sharding you are on your own.

  • Implementing shards is not well supported. Sharding is currently mostly a roll your own approach. LiveJournal makes their tool chain available. Hibernate has a library under development. MySQL has added support for partioning. But in general it's still something you must implement yourself.

    See Also

  • The Flickr Architecture for more interesting ideas on how to implement sharding.
  • The Google Arhitecture.
  • The LiveJournal Architecture. They talk quite a bit about their sharding approach and give a lot of helpful details.
  • The Shard category.
  • Reader Comments (54)

    This is probably the most interesting post I've read in a long long time. Thanks for sharing the advantages and drawbacks of sharding.... and thanks for putting together all these resources/info about scaling... it's really really interesting.

    December 31, 1999 | Unregistered CommenterTim

    Thanks for this info. Helped me understand about what the heck to do with all this user data coming my way!!!

    December 31, 1999 | Unregistered CommenterVinit

    I dislike how your link of livejournal does not actually go to a livejournal website, or information about their toolchain.

    December 31, 1999 | Unregistered CommenterRyan T Mulligan

    I am not sure what you mean about live journal. It goes to a page on this site which references two danga.com sites. Oh I see, memcached goes to a category link which doesn't include memcached. The hover text does include the link, but I'll add it in. Good catch. Thanks.

    December 31, 1999 | Unregistered CommenterTodd Hoff

    "Sharding cleanly and elegantly solves the problems with replication."

    Is this true? You do need to replicate still right? You need duplication and a copy of the data that is not too stale in case one of your shards go down? So you still need to replicate correct?

    December 31, 1999 | Unregistered Commentertim wee

    > Is this true? You do need to replicate still right?

    You won't have the problems with replication overhead and lag because you are writing to a appropriately sized shard rather than a single master that must replicate to all its slaves, assuming you have a lot of slaves.

    You will still replicate within the shard, but that will be more of a fixed reasonable cost because the number of slaves will be small.

    Google replicates content 3 times, so even in that case it's more of a fixed overhead then chaining a lot of slaves together.

    That's my understanding at least.

    December 31, 1999 | Unregistered CommenterTodd Hoff

    We've been building out a sharding framework for use with Spinn3r. It's about to be deployed into production this week.

    We're very happy with the way things have moved forward and will probably be OSSing it.

    We aggregate a LOT of data on behalf of our customers so have huge data storage requirements.

    Kevin

    December 31, 1999 | Unregistered CommenterKevin Burton

    Do you still need a master lookup table? How do you know which shard has the data you need to access?

    December 31, 1999 | Unregistered CommenterAnonymous

    I think a lookup table is the most flexible option. It allows for flexible shard assignment algorithms and you can change the mapping when you need to. Here a few other ideas. I am sure there are more.

    http://highscalability.com/flickr-architecture">Flickr talks about a "global ring" that is like DNS that maps a key to a shard ID. The map is kept in memcached for a 1/2 hour or so. I bought Cal Henderson's book and it should arrive soon. If he has more details I'll do a quick write up.
    Users could be assigned to shards initially on a round robin basis or through some other whizzbang algorithm.

    You could embed shard IDs into your assigned row IDs so the mapping is obvious from the ID.

    You could hash on a key to a shard bucket.

    You could partition based on on key values. So users with names starting with A-C go to shard1, that sort of thing. MySQL has a number of different partition rules as examples.

    December 31, 1999 | Unregistered CommenterTodd Hoff

    It's helpful to know how the big players handle their scaling issues.

    Thanks for sharing!

    December 31, 1999 | Unregistered CommenterFrank

    Attacking sharding from the application layer is simply wrong. This functionality should be left to the DBMS. Access from the app layer would be transparent and it would be up to the DB admin to configure the data servers correctly for sharding to automatically and transparantly scale out across them.

    If you are implementing sharding from the app layer you are getting yourself in a very tight corner and one day will find out how stuck you are there. This is the cornerstone of improper delegation of the functionalities in a multi-tier system.

    December 31, 1999 | Unregistered CommenterAnonymous

    >How do you partition your data in shards? What data do you put in which shard? Where do comments go? Should all user data really go together, or just their profile data? Should a user's media, IMs, friends lists, etc go somewhere else? Unfortunately there are no easy answer to these questions.

    I have exactly the question to ask..
    I've refered the architectures of LiveJournal and Mixi, both of which introduce shards.
    Howerver, I saw a "Global Cluster" which store meta informations for other clusters. By doing this we get an extreme heavy cluster, it must handle all the cluster_id <-> user_id metas and lost the advantage of sharding...Is it?
    The other way, partition by algorithms on keys, is difficult in transition.

    So, could you give me some advice? Thank you very much for sharing experiences :)

    December 31, 1999 | Unregistered CommenterDiogin

    > By doing this we get an extreme heavy cluster, it must handle
    > all the cluster_id <-> user_id metas

    I think the idea is that because these mapping is so small they can all be cached in RAM and thus their resolution can be very very fast. Do you think that would be too slow?

    December 31, 1999 | Unregistered CommenterTodd Hoff

    Yeah, you reminded me! I have a little doubts on this before, when I think the table would be terribly huge as all the table records on other clusters are all gathered in this table, and all queries should first refer to this cluster.
    Maybe I can use memcached clusters to cache them. Thank you :)

    December 31, 1999 | Unregistered CommenterDiogin

    If you look at the architectures of all the major internet-scale sites (such as eBAy, Amazon, Flicker etc.) you'd see they've come to the same conclusion and same patterns

    I've also published an article on http://www.infoq.com/news/2007/08/denormalization"> InfoQ discussing this topic yesterday (I only found this site now or I would have included it in the article)

    Arnon

    December 31, 1999 | Unregistered CommenterArnon Rotem-Gal-Oz

    Jeremy Cole and Eric Bergen have an excellent section on database partitioning starting on about http://jcole.us/mysqluc/2007/MySQL%20Scaling%20and%20High%20Availability%20Architectures.pdf">page 14 of MySQL Scaling and High Availability Architectures. They talk about different partitioning models, difficulties with partitioning, HiveDB, Hibernate Shards, and server architectures.

    I'll definitely do a write up on this document a little later, but if interested dive in now.

    December 31, 1999 | Unregistered CommenterTodd Hoff

    I agree with the above post questioning the placement of partitioning logic in the application layer. Why not write the application layer against a logical model (NOT a storage model!) and then just engineer the existing data storage abstraction mechanism (DBMS engine) such that it will handle the partitioning functionality in a parallel manner? I would be very interested to see a study done comparing the architectures of this sharding concept against a federated database design such as what is described on this site http://www.sql-server-performance.com/tips/federated_databases_p1.aspx

    December 31, 1999 | Unregistered CommenterNorman Harebottle

    > I would be very interested to see a study done comparing the
    > architectures of this sharding concept against a federated database design

    Most of us don't have accesses to a real affordable federated database (parallel queries), so it's somewhat a moot point :-) And even these haven't been able to scale at the highest levels anyway.

    The advantage of partitioning in the application layer is that you are not bottlenecked on the traffic cop server that must analyze SQL and redistribute work to the proper federations and then integrate the results. You go directly to where the work needs to be done.

    I understand the architectural drive to push this responsibility to a logical layer, but it's hard to ignore the advantages of using client side resources to go directly to the right shard based on local context. What is the cost? A call to library code that must be kept in sync with the partitioning scheme. Is that really all that bad?

    December 31, 1999 | Unregistered CommenterTodd Hoff

    Sorry, this isn't new, except maybe to younger programmers. I've been using this approach for many years. Federated databases with horizontally partioned data is old news. Its just not taught as a standard technique for scaling, mostly because scaling isn't taught as a core subject. (Why is that, do you suppose? Too hard to cover? Practical examples not practical in an academic setting?)

    The reason this is getting attention now is a perfect storm of cheap hardware, nearly free network connectivity, free database software, and beaucoup users (testers?) over the 'net.

    December 31, 1999 | Unregistered CommenterAnonymous

    Great content and great site... except for the worshipful adoration of these young teams who seem to think they've each discovered America.

    I could go on at length about Flickr's performance troubles and extremely slow changes after their early success, Twitter's meltdowns and indefensible defense of the slowest language in wide use on the net, and old-school examples of horizontal partitioning (AKA "sharding" heh), but I'll spare you. This cotton candy sugary sweet lovin' of Web 2.0 darlings really is a bit tiresome.

    Big kudos though on deep coverage of the subject matter between the cultic chants. :-D

    December 31, 1999 | Unregistered CommenterMarcus

    Very good post.

    I've been wondering how all of this works out with database connection pooling mechanism?

    Hypothetical case:
    ---------------------
    100 application servers
    12 physical MySQL databases (1 shard per physical host to make it simple).

    Would all 100 application servers hold their connection pools to the 12 shards? What would happen if I add another application server to the app server cluster. How does the pooling mechanism scale? Can the 12 databases handle ever more growing connections as I keep adding application servers?

    Can someone please help with my questions?

    November 29, 2009 | Unregistered CommenterPeter

    You could add SQL Azure Federations to your discussion now too, early stages but coming up for release very soon http://social.technet.microsoft.com/wiki/contents/articles/sharding-with-sql-azure.aspx

    March 28, 2011 | Unregistered CommenterNikolai Blackie

    this article has more logical holes then i can count on one hand. Many correctly pointed at in the previous posts. There is an old principle why experienced professionals oppose new terminology - it obscures the well known principles. So, please , drop this "shredding" mumbling - use proper term - horizontal partitioning and use of the brains when design the RDBMS for large volume of data. Vertical scaling, as well as horizontal are well known practice for a dozen of years.

    June 28, 2011 | Unregistered CommenterAZ

    Allot has changed since this article was written - Sharding became much easier with off-the-shelf tools like ScaleBase (disclosure: I work there) that solves most of the issues raised in this article and let you scale existing relational databases.

    September 8, 2011 | Unregistered CommenterLiran Zelkha

    Why are all the comments dated Nov 29, 1990?

    September 29, 2014 | Unregistered CommenterJay

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Post:
     
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>