Entries by Todd Hoff (380)

Wednesday
Aug262009

Hot Links for 2009-8-26

  • I'm Going To Scale My Foot Up Your Ass - Shut up about scalability, no one is using your app anyway.
  • Multi-Tenant Data Architecture - Microsoft's take on different approaches to multitenancy.
  • Cloud computing rides on spiraling Energy costs - A report by US researchers has shown the increasing cost of power and cooling in the data centre is a driver towards cloud computing.
  • Interview: Apple’s Gigantic New Data Center Hints at Cloud Computing - Companies building centers this big are getting into cloud computing. Running apps in the cloud requires massive infrastructure: Google-size infrastructure.
  • What Does Cloud Computing Actually Cost? An Analysis of the Top Vendors - Amazon is currently the lowest cost cloud computing option overall. At least for production applications that need more than 6.5 hours of CPU/day, otherwise GAE is technically cheaper because it's free until this usage level.
  • no:sql(east) - October 28–30, 2009, Atlanta, GA. Very cute page playing off of SQL syntax.

    New Products and Updates

  • Gear6 Web Cache Virtual Appliance - a feature complete virtual machine (VM) of the Gear6 Web Cache software. It includes all the functionality of the Gear6 Web Cache including simulating Gear6 high density RAM-flash architecture.
  • Seamlessly Extending the Data Center - Introducing Amazon Virtual Private Cloud (VPC) - We have developed Amazon VPC to allow our customers to seamlessly extend their IT infrastructure into the cloud while maintaining the levels of isolation required for their enterprise management tools to do their work.
  • NetApp reveals cloud computing plan, new Data OnTap OS - Our research shows users are very interested in scale-out technology," she said. "What's nice about it is as you add processor and storage resources, you get much higher storage utilization rates and the new scale-out system grows up to 14 petabytes, but it can still be managed in a single array.
  • The Big Cheese: Powerful Version Of Google Search Appliance Can Grow Exponentially.

    Updates to Articles on High Scalability

  • Streamy Explains CAP and HBase's Approach to CAP - We plan to employ inter-cluster replication, with each cluster located in a single DC. Remote replication will introduce some eventual consistency into the system, but each cluster will continue to be strongly consistent. Updated: How Google Serves Data from Multiple Datacenters.

    The fantasy sponsor for this post are those little food kiosks outside Home Depot stores. I love their Fire Dogs. Hot and yummy. I bet most home improvement projects in America are inspired by cravings for one of these little beauties.
  • Monday
    Aug242009

    How Google Serves Data from Multiple Datacenters

    Update: Streamy Explains CAP and HBase's Approach to CAP. We plan to employ inter-cluster replication, with each cluster located in a single DC. Remote replication will introduce some eventual consistency into the system, but each cluster will continue to be strongly consistent.

    Ryan Barrett, Google App Engine datastore lead, gave this talk Transactions Across Datacenters (and Other Weekend Projects) at the Google I/O 2009 conference.

    While the talk doesn't necessarily break new technical ground, Ryan does an excellent job explaining and evaluating the different options you have when architecting a system to work across multiple datacenters. This is called multihoming, operating from multiple datacenters simultaneously.

    As multihoming is one of the most challenging tasks in all computing, Ryan's clear and thoughtful style comfortably leads you through the various options. On the trip you learn:

  • The different multi-homing options are: Backups, Master-Slave, Multi-Master, 2PC, and Paxos. You'll also learn how they each fair on support for consistency, transactions, latency, throughput, data loss, and failover.
  • Google App Engine uses master/slave replication between datacenters. They chose this approach in order to provide:
    - lowish latency writes
    - datacenter failure survival
    - strong consistency guarantees.
  • No solution is all win, so a compromise must be made depending on what you think is important. A major Google App Engine goal was to provide a strong consistency model for programmers. They also wanted to be able to survive datacenter failures. And they wanted write performance that wasn't too far behind a typical relational database. These priorities guided their architectural choices.
  • In the future they hope to offer optional models so you can select Paxos, 2PC, etc for your particular problem requirements (Yahoo's PNUTS does something like this).

    There's still a lot more to learn. Here's my gloss on the talk:

    Consistency - What happens happens after you read after a write?

    Read/write data is one of the hardest kinds of data to run across datacenters. Users a expect a certain level of reliability and consistency.

  • Weak - it might be there, might not. Best effort. Like memcached. It's OK to drop for some applications like Voip, live video, and multiplayer games. You care more about where things are now, not where they where. For data this is not good.
  • Eventual - You eventually see the stuff you wrote, just not right away. Email is a good example. You send it but it doesn't arrive right away, but it gets there, eventually. DNS change propagation, SMTP, Amazon S3, SimpleDB, search engine indexing are all of this type. There's a delay after a write when a read won't see what was written, but the writes eventually push through. Still not ideal for data.
  • Strong - The ideal solution for a structured data system. You get what you put it in. Simplest to program against and think about. Any read after a write will return what was written. AppEngine, file systems, Microsoft Azure, and RDBMSes work this way.
  • Once we move data across datacenters what consistency guarantees do we have? We can give up some guarantees, but we should know what we are getting.

    Transactions - Extended form of consistency across multiple operations.

  • Transaction Properties: Correctness, consistency, enforce variants, ACID.
  • Example: bank transaction. Transfer money from A to B. Subtract money from A and add to B. These happen at different times. What happens if another transfer happens for A in-between? What happens if there's a failure? What happens of program reads from A or B? You want guarantees. On a crash will money added to B still be added to B? Will money taken from A still be taken from A? You don't want to lose or create money.
  • When you start operating across datacenters it's even harder to enforce transactions because more things can go wrong and operations have high latency.

    Why Operate in Multiple Datacenters?

  • Sh*t happens - datacenters fail for any number of reasons.
  • Performance - geolocality allows operations to be moved closer to the user. The speed of light limits limits how fast data can be transferred and becomes significant when operating across the world. Going through multiple router hops also slows traffic. So closer is better and you can only be closer if your data is near the user which requires operating in multiple datacenters. CDNs do this for you, especially for more static data. They put data everywhere.

    Why Not Operate in Multiple Datacenters?

  • Operating in a single datacenter is easy: Low cost bandwidth. Low latency. High bandwidth. Easy operations. Easier code.
  • Operating in multiple datacenters is hard: high cost, high latency, low latency, difficult operations, harder code.
  • It's especially hard if you have a read/write structured data system where you accept writes from more than one location. You have consistency problems. Maintaining consistency in the face of the distances and failures is non-trivial.

    Your Different Architecture Options

  • Single Datacenter. Don't bother operating in mutiple datacenters. This is the easiest option and is what most people do. But datacenters fail, you could lose data, and your site could go down.
  • Bunkerize. Create a Maginot Line for the Ultimate Datacenter. Make sure your datacenter doesn't ever go down. SimpleDB and Azure use this strategy.
  • Single Master. Pick a master datacenter that writes go to and other sites replicate to. The replicates sites off read-only services.
    - Better, but not great.
    - Data are usually replicated asynchronously so there's a window of vulnerability for loss.
    - Data in your other datacenters may not be consistent on failure.
    - Popular with financial institutions.
    - You get geolocation to serve reads. Consistency depends on the technique. Writes are still limited to one datacenter.
  • Multi-Master. True multihoming. The Holy Grail. All datacenters are serving reads and writes. All data is consistent. Transactions just work. This is really hard.
    - So some choose to do it with just two datacenters. NASDAQ has two datacenters close together (low latency) and perform a two-phase commit on every transaction, but they have very strict latency requirements.
    - Using more than two datacenters is fundamentally harder. You pay for it with queuing delays, routing delays, speed of light. You have to talk between datacenters. Just fundamentally slower with a smaller pipe. You may pay for with capacity and throughput, but you'll definitely pay in latency.

    How Do You Actually Do This?

    What are the techniques and tradeoffs of different approaches? Here's the evaluation matrix:

      Backups M/S MM 2PC Paxos
    Consistency Weak Eventual Eventual Strong Strong
    Transactions No Full Local Full Full
    Latency Low Low Low High High
    Throughput High High High Low Medium
    Data loss Lots Some Some None None
    Failover Down Read-only Read/Write Read/Write Read/Write


    - M/S = master/slave, MM - multi-master, 2PC - 2 Phase Commit
    - What kind of consistency, transactions, latency throughput do we get for a particular approach? Will we lose data on failure? How much will we lose? When we failover for maintenance or we want to move things, say decommissioning a datacenter, how well do we do that, how well do the techniques support it?

  • Backups - Make a copy of your data that's secret and safe. Generally weak consistency. Usually no transactions. Used for the first internal datastore launch. Not good enough for a production system. Lose data since last backup. You are down while restoring a backup to another datacenter.

  • Master/Slave Replication - Writes to a master are also written to one or more slaves.
    - Replication is asynchronous so good for latency and throughput.
    - Weak/eventual consistency unless you are very careful.
    - You have multiple copies in the datacenters, so you'll lose a little data on failure, but not much. Failover can go read-only until the master has been moved to another datacenter.
    - Datastore currently uses this mechanism. Truly multihoming adds latency because you have to add the extra hop between datacenters. App Engine is already slow on writes so this extra hit would be painful. M/S gives you most of the benefits of better forms while still offering lower latency writes.

  • Multi-Master Replication - support writes from multiple datacenters simultaneously.
    - You figure out how to merge all the writes later when there's a conflict. It's like asynchronous replication, but you are serving writes from multiple locations.
    - Best you can do is Eventual Consistency. Writes don't immediately go everywhere. This is a paradigm shift here. We've assumed with a strongly consistent system that backup and M/S that they don't change anything. They are just techniques to help us multihome. Here it literally changes how the system runs because the multiple writes must be merged.
    - To do the merging you must find away to serialize, impose an ordering on all your writes. There is no global clock. Things happen in parallel. You can't ever know what happens first. So you make it up using timestamps, local timetamps + skew, local version numbers, distributed consensus protocol. This is the magic and there are a number of ways to do it.
    - There's no way to do a global transaction. With multiple simultaneous writes you can't guarantee transactions. So you have to figure out what to do afterward.
    - AppEngine wants strong consistency to make building applications easier, so they didn't consider this option.
    - Failover is easy because each datacenter can handle writes.

  • Two Phase Commit (2PC) - protocol for setting up transactions between distributed systems.
    - Semi-distributed because there's always a master coordinator for a given 2PC transaction. Because there are so few datacenters you tend to go through the same set of master coordinators.
    - It's synchronous. All transactions are serialized through that master which kills your throughput and increases latency.
    - Never serious considered this option because write throughput is very important to them. No single point of failure or serialization point would work for them. Latency is high because of the extra coordination. Writes can be in the 200msec area.
    - This option does work though. You write to all datacenters or nothing. You get strong consistency and transactions.
    - Need N+1 datacenters. If you take one down then you still have N to handle your load.

  • Paxos - A consensus protocol where a group of independent nodes reach a majority consensus on a decision.
    - Protocol: there's a propose step and then an agree step. You only need a majority of nodes to agree to say something is persisted for it to be considered persisted.
    - Unlike 2PC it is fully distributed. There's no single master coordinator.
    - Multiple transactions can be run in parallel. There's less serialization.
    - Writes are high latency because of the 2 extra round coordination trips required in the protocol.
    - Wanted to do this, but the they didn't want to pay the 150msec latency hit to writes, especially when competing against 5msec writes for RDBMSes.
    - They tried using physcially close datacenters but the built-in multi-datacenter overhead (routers, etc) was too high. Even in the same datacenter was too slow.
    - Paxos is still used a ton within Google. Especially for lock servers. For coordinating anything they do across datacenters. Especially when state is moved between datacenters. If your app is serving data in one datacenter and it should be moved to another that coordination is done through Paxos. It's used also in managing memcache and offline processing.

    Miscellaneous

  • Entity Groups are the unit of consistency in AppEngine. Operations are serialized on Entity Groups. The log for each commit to an entity group is replicated. This maintains consistency and provides transactions. Entity Groups are essentially shards. Sharding enables scaling because it allows you to handle a lot of writes. Datastore shards in entity group size chunks. BuddyPoke has 40 million users, each of which has an entity group. That's 40 million different shards.
  • Eating your own dog food is a strategy used a lot at Google. Iterate and make people use new features internally. Using a ton of stuff that's very early. You can iterated many many times so that improves it before you are ready to launch.
  • They see relational databases in the datacenter as their competition as much as Azure and SimpleDB. Inserts into RDBMS are in low milliseconds. Writes into AppEngine are 30-40 msecs. Reads are fast. They like this trade-off because on the web reads vastly out number writes.

    Discussion

    A few things I wondered through the talk. Did they ever consider a distributed MVCC approach? That might be interesting and wasn't addressed as an option. Clearly at Google scale an in-memory data grid isn't yet appropriate.

    A preference for the strong consistency model was repeatedly specified as a major design goal because this makes the job of the programmer easier. A counter to this is that the programming model for Google App Engine is already very difficult. The limits and the lack of traditional relational database programming features put a lot of responsibility back on the programmer to write a scalable app. I wonder if giving up strong consistency would have been such a big deal in comparison?

    I really appreciated the evaluation matrix and the discussion of why Google App Engine made the choices they did. Since writes are already slow on Google App Engine they didn't have a lot of headroom to absorb more layers of coordination. These are the kinds of things developers talk about in design meetings, but they usually don't make it outside the cubicle or conference room walls. I can just hear Ryan, with voiced raised, saying "Why can't we have it all!" But it never seems we can have everything. Many thanks to Ryan for sharing.

    Related Articles

  • Slides for the Talk
  • ZooKeeper - A Reliable, Scalable Distributed Coordination System
  • Yahoo!'s PNUTS Database: Too Hot, Too Cold or Just Right?
  • Paper: Consensus Protocols: Paxos by Henry Robinson
  • Paper: Consensus Protocols: Two-Phase Commit by Henry Robinson
  • Paper: Dynamo: Amazon’s Highly Available Key-value Store
  • Are Cloud Based Memory Architectures the Next Big Thing?
  • Tuesday
    Aug182009

    Real World Web: Performance & Scalability

    We've referenced this 189 slide masterpiece by Ask Bjorn Hansen before, but it was hidden without its own first class link. He describes his presentation as 3 hours of 5 minute lightening talks and that sounds about right.

    The presentation covers: overall platform and architecture considerations involved in tuning applications from a holistic perspective. You’ll be shown design scalable architectures for dynamic, high-volume web sites. Topics covered include caching, scalable database design, replication architecture, load-balancing, and architectural decisions derived from many years of experience.

    His prime directive of scaling: Think Horizontally at every point in your architecture, not just at the web tier.

    You may not agree with everything, but there's a lot of useful advice. Here's a summary of some of what is covered:

  • Benchmarking
  • Vertical scaling sucks.
  • Horizontal scaling rocks.
  • Run many application servers
  • Don't keep state in the app server
  • Be stateless
  • Optimization is necessary, but is different than scalability.
  • Cache things you hit all the time.
  • Measure, don't assume, check.
  • Make pages static.
  • Caching is a trade-off.
  • Cache full pages.
  • Cache partial pages.
  • Cache complex data.
  • MySQL query cache is flushed on update.
  • Cache invalidation is hard.
  • Replication scales reads, not writes.
  • Partition to scale writes. 96% of applications can skip this step.
  • Master-master setup facilitates on-line schema changes.
  • Create summary tables and summary databases rather than do COUNT and GROUP-BY at runtime.
  • Make code idempotent. If it fails you should just be able to run it again.
  • Load data asynchronously. Aggregate updates into batches.
  • Move processing to application and out of the database as much as possible.
  • Stored procedures are dangerous.
  • Add more memory.
  • Enable query logging and take a look at what your app is doing.
  • Run different MySQL instances for different work loads.
  • Config tuning helps, query tuning works.
  • Reconsider persistent DB connections.
  • Don't overwork the database. It's hard to scale.
  • Work in parallel.
  • Use a job queuing system.
  • Log http requests.
  • Use light processes for light tasks.
  • Build on APIs internally. Clean loosely coupled APIs are easy to scale.
  • Don't incur technical debt.
  • Automatically handle failures.
  • Make services that always work.
  • Load balancing is the key to horizontal scaling.
  • Redundancy is not load-balancing. Always have n+1 capacity.
  • Plan for disasters.
  • Make backups.
  • Keep software deployments easy.
  • Have everything scripted.
  • Monitor everything. Graph everything.
  • Run one service per server.
  • Don't ever swap memory for disk.
  • Run memcached if you have extra memory.
  • Use memory to save CPU or IO. Balance memory vs CPU vs IO.
  • Netboot your application servers.
  • There's lot of good slides on what to graph.
  • Use a CDN.
  • Use YSlow to find client side problems.

    This is just a high level blitz through the presentation. Topics are given a lot more detail in the presentation. Audio of Ask's dulcet tones would be nice, but there's still a lot to learn here.
  • Sunday
    Aug162009

    ThePort Network  Architecture

    ThePort Network's Director of Engineering, TJ Muehleman was kind of enough to share some of the architectural details for their white label social media system. It currently runs about 50 social networks varying in size from less than 1000 members to more than 300,000 members, all on a Microsoft stack. In addition to their social networking platform, they offer Javascript APIs and web service APIs (both REST and SOAP) which account for a significant percentage of overall system usage.

    ThePort is an excellent example of a real world in-the-trenches product offering real value to customers. One of the most interesting problems they have to solve is multi-tenancy. How do you provide good performance, complete customization, support, develop new features, and provide individual search indexes for each customer? It's not an easy problem to solve.

    How did they solve their problems and build a successful system? 

    Site: http://theport.com

    Platform

  • Microsoft.NET 3.5
  • C# / VB.NET
  • SQL Server 2005
  • Visual Studio 2008 Pro Edition
  • Prototype
  • Subversion
  • TortoiseSVN
  • Trac (for internal defect tracking. Will possibly move all internal and external issue tracking to it)
  • Beyond Compare 3
  • Web Tier
    * 6 x Dell blade servers running windows 2008 / IIS 7
  • Data Tier
    * 1 r/w SQL Cluster – dell 6850s (6 single core processors, 32 GB RAM)
    * 2 read-only dell 2950 (2 quad core processors, 16 GB RAM)
    * 1 distribution server – dell 2950 (2 quad core processors, 16 GB RAM)
  • We also use SQL Server Service Broker as a queuing system for some of our saves. It's an alternative to MSMQ that uses the DB for persistence in case of failure. We will most likely be moving to MSMQ in the near future to remove us from SQL dependence.
  • Caching
    * 2 Dell blade servers 8 GB RAM each to total 16 GB of available RAM
    * Running SharedCache (Basically an open source .NET port of MemCacheD. We initially looked at MemCacheD but our internal benchmarking indicated SharedCache had better performance – at least w/in a Microsoft environment. We may still investigate Microsoft's Velocity cache platform when it goes live)
  • Search
    * 2 Dell 2950s with 725 GB Storage
    * Running Lucene + SOLR
    * We chose Lucene over Lucene.NET because Lucene.NET's wildcard search was a little buggy in our initial beta testing. SQL Full Text wasn't a viable option because there was no clear and easy way to split indexes between customers. SOLR cores make this part easy. Above and beyond that, Lucene is lightning fast and is available with features we couldn't turn down (proximity search, searching w/in documents, and built-in RESTful APIs to name a few)

    How do you handle multi-tenancy?

    A multi-tenant platform has two primary hurdles to overcome:

    1. Preventing a single, large customer from overwhelming the system?

    The primary bottleneck for this is in the data layer. Our current DB architecture has helped mitigate this problem. The read-only servers help offset most of this by absorbing the bulk of the data calls. We did have to beef up the distribution server because latency between the r/w server and the read only servers had crept too high. Getting a new machine (2 quad cores with 16 GB of RAM) helped reduce the latency to less than a second.

    However robust the cluster is, we've concluded that we will eventually have to move to a sharded architecture with MySQL. MS SQL licensing fees makes both continuing to enhance the cluster and scaling out to multiple machines prohibitive. Additionally, sharding allows us to scale either by customer (because some may be more active than others) or by functional area (photos, comments, etc).

    2. Allowing clients to have total control over the look, feel, and user experience of their sites.


    Allowing CSS control isn't enough; we needed a templating system that allows total control over the site. We looked at using .NET master pages and user controls to accomplish this. But that assumes a level of knowledge in .NET for outside developers. We built a proprietary templating system that unfortunately became too limiting and would one day lead to a drag on performance.

    So we settled on using XML / XSLT. All of our business / entity objects are serializable to XML. This made XSLT a natural choice from the templating angle. We've seen a considerable boost in performance from this upgrade and an even greater increase in flexibility in terms of what our designers can do. Once the learning curve is overcome, the web designers love the amount of control they get.

    What did you do that was especially cool that people could learn from?

    XSLT as Custom Templating System

    Building a templating system in XSLT that actually allows the template author to make a web service call to our internal web service layer (or external web services) straight from the templating system. This allowed the development team to build a flexible, powerful system that allows a web designer to embed real-time calls into a given template. We accomplish this using XSLT Extension Objects. What we've found in our internal testing is that these extension objects scale way better than our previous templating system (a homegrown proprietary system). We've used ANTS profiler to compare the two and the difference is in orders of magnitude.

    Obviously we have to cache the hell out of this or the performance of the pages the calls are embedded in would suffer. For now, we make the internal web services calls via HTTP, but we will soon be moving this to a TCP call to take advantage of the better connection pooling offered by TCP. We're most likely to use WCF because of it's native support of TCP bindings. However, we haven't yet benchmarked that so it's possible it could change.

    Not Using the Database to Build Collections

    Another cool thing we've done is to move strongly away from using the database to retrieve collections of 'things'. For instance, if we needed a collection of comments, previously we'd hit the database for the 5, 10, 100, etc comments we wanted, do the sorting / filtering in the DB, return a single dataset, cache that, and then display.

    However, this is a database intensive operation, especially if you're going to join against user data (which you inevitably will). What we've started doing recently is caching the recent comment objects, and using our cache providers MultiGet ability to simultaneously retrieve all comments at the same time. We then sort / filter in memory in the application tier, discard whatever comments we don't need, and then display. We found that doing it this way, we save lots of hits to our database and in fact, saw a considerable performance gain from it.

    Our tests (on a developer laptop) fetched 10,000 objects from cache in about 1 second, then sorted them by date time in about .015 second.

    What prompted you to move to a SOA architecture?

    To better compartmentalize our code.

    Given the growth of our templating system mentioned above, we realized it was best to truly separate the tiers into discrete areas. Since our application is easily accessed via a set of REST APIs and our own internal skinning system (and who knows what in the future), dividing the application like this gives us a lot of leeway in being able to swap out components. Additionally, we're doing more and more queuing which lines up nicely with SOA.

    Performance

    Since modern web apps deal with complex data, breaking the work into more discrete operations handled by offline processes on their own infrastructure makes a lot of sense from a performance point of view.

    How do you handle consistency between the database and the search engine?

    We have a multi-threaded windows service that scans our database once every 5 minutes looking for new data. The service then adds the new items to the Lucene index. We keep audit columns on all our database tables so capturing new data is pretty simple. Once a night, we purge the Lucene index and run a full rescan of the database. We think this system will work for the near to mid term but long term, we'll take advantage of a queuing system to keep the index in sync.

    How you handle your release, support, bug fixing, development, etc.

    We have a decent sized dev team. 1 platform architect responsible for overall system architecture (selecting which systems to use, tuning them), 1 lead software architect, and 3 senior – mid level developers. Since we're a start-up in a fast evolving market (social media) we find that we're constantly having to adjust to market demands and the latest in social functionality. So we have a 2 month build cycle which is pretty aggressive.

    In terms of actual development, we've found the following to be keys to success:
    1. Daily stand-ups: it's absolutely necessary for everyone on the team to know what the other is doing. A code base as large as ours, it's very likely I'm writing a function someone has already written or solving a problem someone has solved previously. Daily stand-ups help with that
    2. Iterate: Build the core functionality, get it into QA and / or beta, beat the bugs out of it, move to the next piece. We've found this to be easier said than done. Market pressures sometimes dictate you roll with something more feature rich than you'd like. Sticking to an iterative cycle creates better code and more market ready products.
    3. Beta test: This goes hand-in-hand w/ #2 above. Get something done and get it in the hands of actual users. This is the best way to find where your app falls down

    With regard to support / bug fixing, we're moving to a forums based support model for many of our customers. We've found the same problems, especially in an app as configurable as ours, occur over and over. Getting those answers into an open, searchable format should hopefully cut down on confusion and get developers talking directly to developers.

    Internally we use Trac for bug tracking and devote roughly 20% of our week maintaining, supporting, and fixing issues. That may seem like a lot but given how configurable our system is, we're essentially running 50 heavily data driven websites.

    WCF sounds like a buggy underpeforming mess. How is it working?

    So far we have no complaints with WCF. I think baking it directly into .NET 3.5 helped iron a lot of the big kinks out. It does come with it's quirks, no doubt. We built our REST libraries on top of it and found that posting XML is not exactly the easiest thing in the world. But it was more than made up for with the ease in deploying all our GET operations with REST. Our next step will be to set up TCP and MSMQ bindings with WCF to handle our internal service requests and queuing, respectively. Since WCF exposes all of these bindings natively, we think we will see a lot of effective code re-use out of this.

    I'd like to thank TJ for taking the time and making the effort to write up their architecture for people to learn from. I'm sure it will help others when they are trying to build their own systems.

    You too can share the architecture for your amazing system. Come on, you've learned a lot from others, it's time to return the favor and give back. It's not that hard, really. If interested please contact me and we can get started.
  • Thursday
    Aug132009

    Reconnoiter - Large-Scale Trending and Fault-Detection

    One of the top recommendations from the collective wisdom contained in Real Life Architectures is to add monitoring to your system. Now! Loud is the lament for not adding monitoring early and often. The reason is easy to understand. Without monitoring you don't know what your system is doing which means you can't fix it and you can't improve it. Feedback loops require data.

    Some popular monitor options are Munin, Nagios, Cacti and Hyperic. A relatively new entrant is a product called Reconnoiter from Theo Schlossnagle, President and CEO of OmniTI, leading consultants on solving problems of scalability, performance, architecture, infrastructure, and data management. Theo's name might sound familiar. He gives lots of talks and is the author of the very influential Scalable Internet Architectures book.

    So right away you know Reconnoiter has a good pedigree. As Theo says, their products are born of pain, from the fire of solving real-life problems and that's always a harbinger of good things to come.

    The problem Reconnoiter is trying to solve is monitoring thousands of nodes across many datacenters where the nodes can vary widely in power, architecture, and software configuration. With that kind of problem what they really want is the ability to:

     

  • Configure everything from one place.
  • Cheap checks that are made on the specified time interval and aren't late and don't cause a heavy load on the machine.
  • Change the configuration from any datacenter without coordination.
  • Add checks in the field.
  • Separate data collection from visualization and fault-detection.
  • Analyze trends for long-term capacity planning and postmortem analysis.
  • Detect when faults have happened and when they are about to happen.
  • Support trending: the intelligent data correlation, regression analysis/curve fitting and looking into the past to see how much you go where you are now so you can do better next time.
  • Create a monitoring system that doesn't require a separate powerful network and its own set of hosts on which to run.

    If you've ever used or written a distributed stats collection system the architecture of Reconnoiter will look somewhat familiar:


    Some of the more interesting bits of the architecture are:
  • PostgresSQL stores all the data. The data isn't stuck in funky little files.
  • Fault-detection is based on Esper, a streaming complex event processing system. It's not clear how well this approach will work but the hooks are there.
  • A Comet-style web server is used to feed real-time updates. Much better than your traditional polling cycle.
  • Although the web console is PHP based, PHP is used mainly to execute Json calls. Rendering happens in the browser in an AJAX client.
  • Canvas is used for real time graphics. No images are created on the fly.
  • Data is transferred securely over SSL.
  • The system is robust against failures.
  • Data is not thrown away as it is with some systems so you can check against history.

    Reconnoiter isn't completely pain free. Lua for an extension language is an interesting choice. The installation and configuration process is very complex. There are a lot of separate steps and bits to configure. Another potential problem is monitoring produces a lot of real-time data. I have to wonder if PostgresSQL can handle that flow for very large systems. The data is partitioned by month, but a large number of machines and a large number of events can be crushing. And I wasn't sure if graph data could be correlated with released features or other system changes. In the video Theo mentions seeing in the graphs that using deflate improved performance, but I'm not sure just looking at the graph how you would be able correlate system data with system changes.

    It's droolingly clear where Reconnoiter shines is on creating complex graphs, charts, and other visualizations. The graphs look useful and quick to render. The real time visualizations are spectacular and extremely are difficult to do in other systems.

    Related Articles



  • OmniTI Reconnoiter: Web Management and Analysis by Eric J. Bruno
  • Reconnoiter Update by Theo Schlossnagle
  • Reconnoiter Project Home Page
  • Video: Reconnoiter: a whirlwind tour
  • Big Picture of the Overall System
  • Reconnoiter: Monitoring and Trend Analysis from OSCON
  • OmniTI Unveils Open Source Monitoring Tool, Reconnoiter by Jayashree Adkoli
  • The sad state of open source monitoring tools by Grig Gheorghiu
  • How to Succeed at Capacity Planning Without Really Trying : An Interview with Flickr's John Allspaw on His New Book.
  • New open source IT management tool: Lighter-weight than Nagios, more granular than Cacti by Matt Stansberry
  • Tuesday
    Aug112009

    13 Scalability Best Practices

    AFK Partners has release what they feel are the Best Practices for Scalability:

    1. Asynchronous - Use asynchronous communication when possible. 
    2. Swim Lanes – Create fault isolated “swim lanes” of hardware by customer segmentation.
    3. Cache - Make use of cache at multiple layers.
    4. Monitoring - Understand your application’s performance from a customer’s perspective. 
    5. Replication - Replicate databases for recovery as well as to off load reads to multiple instances.
    6. Sharding - Split the application and databases by service and / or by customer using a modulus. 
    7. Use Few RDBMS Features – Use the OLTP database as a persistent storage device as much as possible. 
    8. Slow Roll – Roll out new code versions slowly, to a small subset of your servers without bringing the entire site down.
    9. Load & Performance Testing – Test the performance of the application version before it goes into production. 
    10. Capacity Planning / Scalability Summits – Know how much capacity you have on all tiers and services in your system. 
    11. Rollback – Always have the ability to rollback a code release.
    12. Root Cause Analysis - Ensure you have a learning culture that is evident by utilizing Root Cause Analysis to find and fix the real cause of issues.
    13. Quality From The Beginning – Quality can’t be tested into a product, it must be designed in from the beginning.
    This is just a quick summary, more details on their site.

    Saturday
    Aug082009

    Yahoo!'s PNUTS Database: Too Hot, Too Cold or Just Right?

    So far every massively scalable database is a bundle of compromises. For some the weak guarantees of Amazon's eventual consistency model are too cold. For many the strong guarantees of standard RDBMS distributed transactions are too hot. Google App Engine tries to get it just right with entity groups. Yahoo! is also trying to get is just right by offering per-record timeline consistency, which hopes to serve up a heaping bowl of rich database functionality and low latency at massive scale:

    We describe PNUTS [Platform for Nimble Universal Table Storage], a massively parallel and geographically distributed database system for Yahoo!’s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of con-current requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.

    Some of the cool things about PNUTS are:

  • They actually talk about the hard problem of how to scale a system to 10 different data centers (each with 1,000+ servers) while supporting secondary indexes, materialized views, the ability to create multiple tables, and hash-organized tables. Multi-datacenter operation is so difficult it's usually ignored. PNUTS is designed specifically to operate in many datacenters with a strongish consistency model, which makes it a very interesting design point.
  • You can subscribe to a reliable ordered stream of updates on a table. This is massively convenient. For many applications numerous processes are tied to data changes and this is normally a pain to implement.
  • The consistency model is a per-record timeline consistency: all replicas of a given record apply all updates to the record in the same order. This provides a consistency model that is between the two extremes of serialized transactions and eventual consistency. Conflicting records can't exist at the same time as is allowed by Dynamo.
  • Supports records, but accepts queries only on individual tables. There's no fixed schema for records and columns can be typed or be blobs. Transactions exist only at the record level. This means like with other NoSQL databases denormalization is the modeling strategy as there's no way to have transactions across tables.
  • The degree of read consistency desired can be specified. Records are versioned. You can ask for the latest record version or allow for potentially stale records.
  • Asynchronous replication is used to ensure low write latency while providing geographic replication. Reads should be fast everywhere and may return older versions, writes should be fast locally (in the same datacenter). [3]
  • A message broker that serves both as the replication mechanism and redo log of the database. The message broker guarantees no replica can receive updates out of order because it provided a reliable, totally ordered message channel. This also means you can't have transactions across tables, consistent joins, or foreign keys. They chose this approach over a gossip mechanism (like Dynamo) because it can be optimized for geographically distant replicas and because replicas do not need to know the location of other replicas.
  • PNUTS is hosted and centrally managed. It was built to reduce the overhead of creating and maintaining new applications rather than every property creating their own system. Failover, adding capacity, resharding, performance isolation and supporting applications with different usage profiles are all completely automated.
  • Predicate queries are supported using a scatter-gather mechanism which sends the query to every relevant storage tablet at once. The are gathered sent back to the client.
  • Performance is 1-10ms/request when caching layers are in place.

    From a system perspective PNUTS offers a lot of the good things: hosted, reliability, lowish latency, automation, scalability, supports many application models, and there's a lot of room to improvement that all applications will be able to take advantage of when available.

    From a programmer perspective the are also a lot of good things: it's hosted so fewer worries, notifications, flexible schemas, ordered records, secondary indexes, lowish latency, strong consistency on a single record, scalability, high write rates, reliability, and range queries over a small set of records.

    Unfortunately Goldilocks still needs to keep searching for just right, though she may be getting closer. From a system perspective Yahoo!'s ideas are good, but they don't help you as the system isn't available for you to use. From a programmer perspective the programmer's job is still way too hard. To be just right programmer's need low latency aggregate operators, complex transactions, scalable counters, automatic relationship management, and all the other features that will help them just buy instant porridge and be done with it.

    Related Articles

  • Anti-RDBMS: A list of distributed key-value stores
  • Details on Yahoo's distributed database by Greg Linden
  • Thoughts on Yahoo's PNUTS distributed database by Marton Trencseni
  • Data Challenges at Yahoo! - Ricardo Baeza-Yates & Raghu Ramakrishnan
    Yahoo! Research
  • PNUTS: Yahoo!'s Hosted Data Serving Platform by Lucian
  • Yahoo’s PNUTS by Henry Robinson. A very thoughtful and informative overview of the paper.
  • How robust are gossip-based communication protocols? by Lorenzo Alvisi et al.
  • Trading consistency for scalability in distributed architectures.
  • Asynchronous View Maintenance for VLSD Databases by Parag Agrawal et al.
  • BigTable
  • Dynamo
  • Are Cloud Based Memory Architectures the Next Big Thing?
  • The Story of Goldilocks and the Three Bears
  • PNUTS - Platform for Nimble Universal Table Storage
  • Friday
    Aug072009

    Strategy: Break Up the Memcache Dog Pile 

    Update: Asynchronous HTTP cache validations. A proposed HTTP caching extension: if your application can afford to show slightly out of date content, then stale-while-revalidate can guarantee that the user will always be served directly from the cache, hence guaranteeing a consistent response-time user-experience.

    Caching is like aspirin for headaches. Head hurts: pop a 'sprin. Slow site: add caching. Facebook must have a lot of headaches because they popped 805 memcached servers between 10,000 web servers and 1,800 MySQL servers and they reportedly have a 99% cache hit rate. But what's the best way for you to cache for your application? It's a remarkably complex and rich topic. Alexey Kovyrin talks about one common caching problem called the Dog Pile Effect in Dog-pile Effect and How to Avoid it with Ruby on Rails. Glenn Franxman also has a Django solution in MintCache.

    Data is usually cached because it's too expensive to calculate for every hit. Maybe it's a gnarly SQL query you want to avoid and a little stale data is OK. Or maybe the amount of data you have is simply larger than physical memory on any one machine. Or maybe you have the temerity to write to your database and cause its cache to flush so database caching isn't sufficient at a certain level of scale.

    Typical examples are for caching article vote counts, comment threads, and event streams. One familiar example that bit me hard is displaying the the top N blog articles. Do you want to scan through your entire access log table for every page display? Absolutely not. Especially when the nightly backups are going on and the network is very slow. Not good :-) Yet you still want to update the results every X minutes so the stats stay fresh.

    Data freshness requires a refrigeration truck or an expiry time on your cache entry that causes stats to be periodically recalculated. Now, what happens when your cached data expires and a 1000 requests simultaneously try to recalculate the expensive to calculate data? Database load spikes and the world nearly ends. And since memcached operations are not atomic it's possible stale data could be cached and you'll serve stale data. Which kind of defeats of the purpose of taking load off the data while providing accurate data. So, how do you unpile the dogs?

    No Expire Solution

    If cache items never expire then there can never be a recalculation storm. Then how do you update the data? Use cron to periodically run the calculation and populate the cache. Take the responsibility for cache maintenance out of the application space. This approach can also be used to pre-warm the the cache so a newly brought up system doesn't peg the database.

    The problem is the solution doesn't always work. Memcached can still evict your cache item when it starts running out of memory. It uses a LRU (least recently used) policy so your cache item may not be around when a program needs it which means it will have to go without, use a local cache, or recalculate. And if we recalculate we still have the same piling on issues.

    This approach also doesn't work well for item specific caching. It works for globally calculated items like top N posts, but it doesn't really make sense to periodically cache items for user data when the user isn't even active. I suppose you could keep an active list to get around this limitation though.

    Stale Date Solution

    This solution introduces a stale date in addition to the expiration date. Glen describes it as:


    The first client to request data past the stale date is asked to refresh the data,
    while subsequent requests are given the stale but not-yet-expired data as if it
    were fresh, with the understanding that it will get refreshed in a 'reasonable'
    amount of time by that initial request


    In the memcached FAQ a one key approach is described:

  • Set the cache item expire time way out in the future.
  • Embed the "real" timeout serialized with the value. For example, set the item to timeout in 24 hours, but the embedded timeout might be five minutes in the future.
  • On a get from the cache determine if the stale timeout expired and on expiry immediately set a time in the future and re-store the data as is. This closes down the window of risk.
  • Fetch data from the DB and update the cache with the latest value.

    Alexey describes a different two key approach:
  • Create two keys in memcached: MAIN key with expiration time a bit higher than normal + a STALE key which expires earlier.
  • On a get read STALE key too. If the stale has expired, re-calculate and set the stale key again.

    I dislike embedding meta data with data so I like Alexey's approach a bit better, even though it doubles the key space.

    None of these options prevent the problem for ever happening, but they do greatly reduce the failure window for relatively little cost.

    Related Articles

  • Memcached Tag at High Scalability
  • Caching Makes Your Brain Explode by Craig Ambrose.
  • The Secret to Memcached by Tobias Lütke.
  • Memcached FAQ.
  • Dog-pile Effect and How to Avoid it with Ruby on Rails memcache-client Patch by Alexey Kovyrin.
  • MintCache by Glenn Franxman.
  • Advanced Rails Caching.. on the Edge by Aaron Batalion.
  • Friday
    Aug072009

    The Canonical Cloud Architecture 

    Update 2: Elastic Load Balancer and EC2 instance bandwidth. It turns out we are limited by bandwidth and not by CPU. Solution: use DNS Round Robin for two to three HighCPU medium instances.
    Update: The Skinny Straw: Cloud Computing's Bottleneck and How to Address It. For cloud computing, bandwidth to and from the cloud provider is a bottleneck. Solution: Evaluate application architecture and consider application partitioning.

    I'm writing this post as a sort of penance. My sin was getting involved in another mutli-threaded mess of a program that was rife with strange pauses and unexpected errors. I really should have known better. But when APIs choose to make callbacks from some mystery thread pool it's hard to keep things straight. I eventually sobered up and posted all events to a queue so I could make sure the program would work correctly. Doh. I may never know why the .Net console output stopped working, but I'll live with it.

    And that reminded me that I've been meaning to write a post on the standard Cloud Architecture. I've tried to hit all the common architectures at one time or another, but there have been some excellent sources lately on structuring programs in a cloud that people may "know" in the same way I knew what not to do, but when the code hits the editor those thoughts may have hidden like a kid next to a broken cookie jar.

    The easiest way to create a scalable service is to compose the service from other scalable services. This is how Google AppEngine works and is largely how AWS works as well (EC2, S3, SQS, SimpleDB, etc), though AWS also functions as a blank canvas on which you can draw your own designs.

    The canonical cloud architecture that has evolved revolves around dynamically scalable CPUs consuming asynchronous, persistently queued events. We talked about this idea already in Flickr - Do the Essential Work Up-front and Queue the Rest. The cloud is just another way of implementing the same idea.

    Amazon suggests a few applications of the Cloud Architecture as:

  • Processing Pipelines
    - Document processing pipelines – convert hundreds of thousands of documents from Microsoft Word to PDF, OCR millions of pages/images into raw searchable text
    - Image processing pipelines – create thumbnails or low resolution variants of an image, resize millions of images
    - Video transcoding pipelines – transcode AVI to MPEG movies
    - Indexing – create an index of web crawl data
    - Data mining – perform search over millions of records
  • Batch Processing Systems
    - Back-office applications (in financial, insurance or retail sectors)
    - Log analysis – analyze and generate daily/weekly reports
    - Nightly builds – perform nightly automated builds of source code repository every night in parallel
    - Automated Unit Testing and Deployment Testing – Test and deploy and perform automated unit testing (functional, load, quality) on different deployment configurations every night
  • Websites
    - Websites that “sleep” at night and auto-scale during the day
    - Instant Websites – websites for conferences or events (Super Bowl, sports tournaments)
    - Promotion websites
    - “Seasonal Websites” - websites that only run during the tax season or the holiday season (“Black Friday” or Christmas)

    A good list, but after having worked on a seasonal website for taxes AWS is a horrible match. AWS only works on the instance level, so you need a whole instance turned on all the time even when there's no demand. This is a complete waste of money. An AWS model truly based on use combined with an SLA driven dashboard would be very convenient. But on to cases where AWS is a good fit.

    SmugMug's Cloud Architecture

    AWS pioneer Don MacAskill of SmugMug details how they process high-resolution photos and high-definition video use a cloud hosted queuing architecture in SkyNet Lives! (aka EC2 @ SmugMug).

    SkyNet, as you might expect, operates completely without human minders and automatically scales up and down in relation to the work load. Their system has several components:
  • Work Initiators - Work comes in from your website and/or other software subsystems and is queued up for processing in the Queue Service. Work doesn't have to be large requests either. Work can be small independent parts of an overall pipeline. Don't keep state in the Workers. Bundle what you need done into a work request in shoot back into the Queuing Service for processing.
  • Provisioning Service - This is Amazon's infrastructure that allows instances to be automatically scaled up and down in relation to the work load. This will be the major difference between your VPS or typical datacenter setup. There's an API for starting and stopping AMIs and
    mechanisms for automatically configuring and running VMs.
  • Workers - These are the guys that continually pull work off queues and do something interesting with it. For SmugMug the results are stored on S3 but the results could be put in your own database, SimpleDB or whatever.
  • Queuing Service - This is where work is queued for consumption by the workers. SmugMug built their own queuing service, but you could just as easily use Amazon's own SQS. Creating a scalable, distributed, performant, highly available queue service is not easy, so you may want to take a look at a number of different queue product suggestions in Flickr - Do the Essential Work Up-front and Queue the Rest.
  • Controller - This component monitors many variables related to the work flow and decides how many instances of EC2 are necessary based on optimizing a small set of goals. Instances are add and removed as needed.

    Don shares a lot of practical detailia on how to efficiently use AWS, how their queue service works, and how their controller manages to balances minimizing cost while still being responsive to users. Achieving fairness and balance in a queue system can be difficult, but SmugMug appears to have done a good job of that.

    What rocks about queuing architectures is that they are just so damn robust. Work is safe in the queues. A random reboot won't cause a loss. If one component is producing events too fast the queue will buffer up events until they can be processed. New components can be cleanly added and removed from the system at any time. Timing isn't critical. Work is processed when someone gets around to it. Timeouts and retries are unnecessary. Programs are simple loops that block on the queue, do something, persist results, and feed back more parallelizable work requests back into the queue. Very hard to screw up. Compare and contrast to complex multi-threaded system with shared-state.

    Building GrepTheWeb in the Cloud

    Amazon has published a great couple of articles on building a canonical Cloud Architecture: Building GrepTheWeb in the Cloud, Part 1: Cloud Architectures and Building GrepTheWeb in the Cloud, Part 2: Best Practices.

    These are really tight and well written articles so I'll just hit certain high points. The example used is an application called GrepTheWeb. GrepTheWeb searches using a regular expression across millions of web documents. So it's a grep for the web, ah got it now. The idea is to take an unpredictable but possibly large number of search requests, apply the search expression to hundreds of terabytes of documents, and return the results in a reasonable period of time.

    How exactly would you do such a thing? Here's how you do it in the cloud:
  • Amazon S3 for retrieving input datasets and for storing the output dataset
  • Amazon SQS for durably buffering requests acting as a “glue” between controllers
  • Amazon SimpleDB for storing intermediate status, log, and for user data about tasks
  • Amazon EC2 for running a large distributed processing Hadoop cluster on-demand
  • Hadoop for distributed processing, automatic parallelization, and job scheduling

    Clearly these are all (except for Hadoop) built on Amazon services, but the general ideas apply anywhere. For storing large amounts of data and accessing it efficiently in parallel you need a distributed file system like S3. To coordinate and dispatch work you need a queuing service like SQS. For keeping intermediate state you need a scalable database store like SimpleDB, though you could also imagine using S3. For dynamically scaling processing nodes something like EC2 is necessary. And for actually carrying out the document search a framework like Hadoop provides a lot of features, though you can imagine using other compute grid products.

    Here's their fabulous picture of what the system looks like:


    All the parts and linkages are described in the paper. What's important to note is that even though there are a lot of independently moving parts all the boundaries are clear and well described. In your typical program few will have any idea how it works. Using Cloud Architecture principles it's possible to create a system which both scales and easy to understand and explain.

    The paper makes several key architectural recommendations:
  • Use Scalable Ingredients - Ensure that your application is scalable by designing each component to be scalable on its own. If every component implements a service interface, responsible for its own scalability in all appropriate dimensions, then the overall system will have a scalable base.
  • Have Loosely Coupled Systems - For better manageability and high-availability, make sure that your components are loosely coupled. The key is to build components without having tight dependencies between each other, so that if one component were to die (fail), sleep (not respond) or remain busy (slow to respond) for some reason, the other components in the system are built so as to continue to work as if no failure is happening.
  • Think Parallel - Implement parallelization for better use of the infrastructure and for performance. Distributing the tasks on multiple machines, multithreading your requests and effective aggregation of results obtained in parallel are some of the techniques that help exploit the infrastructure.
  • Utilize On-Demand Requisition and Relinquishment - After designing the basic functionality, ask the question “What if this fails?” Use techniques and approaches that will ensure resilience. If any component fails (and failures happen all the time), the system should automatically alert, failover, and re-sync back to the “last known state” as if nothing had failed.
  • Use Designs that Are Resilient to Reboot and Re-Launch - Don’t forget the cost factor. The key to building a cost-effective application is using on-demand resources in your design. It’s wasteful to pay for infrastructure that is sitting idle.

    All good stuff which is why I like this paper so much. There's a big conceptual shift here, especially of you are used to relatively simple client-server and N-tier systems. It's like simulating in your mind how to keep an army of ants all working independently while still communicating, coordinating, and making progress on a goal. We implemented similar architecture in datacenters long before the cloud, it was just a lot harder as everything was roll your own. The cloud makes all the necessary components standard, featureful, and relatively inexpensive. This opens any application to completley different ways of structuring their backends than they did in the past.

    Related Articles

  • SkyNet Lives! (aka EC2 @ SmugMug).
  • Flickr - Do the Essential Work Up-front and Queue the Rest
  • Hadoop
  • GridGain: One Compute Grid, Many Data Grids
  • Building GrepTheWeb in the Cloud, Part 1: Cloud Architectures
  • Building GrepTheWeb in the Cloud, Part 2: Best Practices.

  • Thursday
    Aug062009

    An Unorthodox Approach to Database Design : The Coming of the Shard

    Update 4: Why you don’t want to shard. by Morgon on the MySQL Performance Blog. Optimize everything else first, and then if performance still isn’t good enough, it’s time to take a very bitter medicine.
    Update 3: Building Scalable Databases: Pros and Cons of Various Database Sharding Schemes by Dare Obasanjo. Excellent discussion of why and when you would choose a sharding architecture, how to shard, and problems with sharding.
    Update 2: Mr. Moore gets to punt on sharding by Alan Rimm-Kaufman of 37signals. Insightful article on design tradeoffs and the evils of premature optimization. With more memory, more CPU, and new tech like SSD, problems can be avoided before more exotic architectures like sharding are needed. Add features not infrastructure. Jeremy Zawodny says he's wrong wrong wrong. we're running multi-core CPUs at slower clock speeds. Moore won't save you.
    Update: Dan Pritchett shares some excellent Sharding Lessons: Size Your Shards, Use Math on Shard Counts, Carefully Consider the Spread, Plan for Exceeding Your Shards

    Once upon a time we scaled databases by buying ever bigger, faster, and more expensive machines. While this arrangement is great for big iron profit margins, it doesn't work so well for the bank accounts of our heroic system builders who need to scale well past what they can afford to spend on giant database servers. In a extraordinary two article series, Dathan Pattishall, explains his motivation for a revolutionary new database architecture--sharding--that he began thinking about even before he worked at Friendster, and fully implemented at Flickr. Flickr now handles more than 1 billion transactions per day, responding in less then a few seconds and can scale linearly at a low cost.

    What is sharding and how has it come to be the answer to large website scaling problems?

    Information Sources

    What is sharding?

    While working at Auction Watch, Dathan got the idea to solve their scaling problems by creating a database server for a group of users and running those servers on cheap Linux boxes. In this scheme the data for User A is stored on one server and the data for User B is stored on another server. It's a federated model. Groups of 500K users are stored together in what are called shards.

    The advantages are:

  • High availability. If one box goes down the others still operate.
  • Faster queries. Smaller amounts of data in each user group mean faster querying.
  • More write bandwidth. With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
  • You can do more work. A parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data, because there are parallel paths through your system. You can load balance web servers, which access shards over different network paths, which are processed by separate CPUs, which use separate caches of RAM and separate disk IO paths to process work. Very few bottlenecks limit your work.

    How is sharding different than traditional architectures?

    Sharding is different than traditional database architecture in several important ways:

  • Data are denormalized. Traditionally we normalize data. Data are splayed out into anomaly-less tables and then joined back together again when they need to be used. In sharding the data are denormalized. You store together data that are used together.

    This doesn't mean you don't also segregate data by type. You can keep a user's profile data separate from their comments, blogs, email, media, etc, but the user profile data would be stored and retrieved as a whole. This is a very fast approach. You just get a blob and store a blob. No joins are needed and it can be written with one disk write.

  • Data are parallelized across many physical instances. Historically database servers are scaled up. You buy bigger machines to get more power. With sharding the data are parallelized and you scale by scaling out. Using this approach you can get massively more work done because it can be done in parallel.

  • Data are kept small. The larger a set of data a server handles the harder it is to cash intelligently because you have such a wide diversity of data being accessed. You need huge gobs of RAM that may not even be enough to cache the data when you need it. By isolating data into smaller shards the data you are accessing is more likely to stay in cache.

    Smaller sets of data are also easier to backup, restore, and manage.

  • Data are more highly available. Since the shards are independent a failure in one doesn't cause a failure in another. And if you make each shard operate at 50% capacity it's much easier to upgrade a shard in place. Keeping multiple data copies within a shard also helps with redundancy and making the data more parallelized so more work can be done on the data. You can also setup a shard to have a master-slave or dual master relationship within the shard to avoid a single point of failure within the shard. If one server goes down the other can take over.

  • It doesn't use replication. Replicating data from a master server to slave servers is a traditional approach to scaling. Data is written to a master server and then replicated to one or more slave servers. At that point read operations can be handled by the slaves, but all writes happen on the master.

    Obviously the master becomes the write bottleneck and a single point of failure. And as load increases the cost of replication increases. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled.

    Sharding cleanly and elegantly solves the problems with replication.

    Some Problems With Sharding

    Sharding isn't perfect. It does have a few problems.

  • Rebalancing data. What happens when a shard outgrows your storage and needs to be split? Let's say some user has a particularly large friends list that blows your storage capacity for the shard. You need to move the user to a different shard.

    On some platforms I've worked on this is a killer problem. You had to build out the data center correctly from the start because moving data from shard to shard required a lot of downtime.

    Rebalancing has to be built in from the start. Google's shards automatically rebalance. For this to work data references must go through some sort of naming service so they can be relocated. This is what Flickr does. And your references must be invalidateable so the underlying data can be moved while you are using it.

  • Joining data from multiple shards. To create a complex friends page, or a user profile page, or a thread discussion page, you usually must pull together lots of different data from many different sources. With sharding you can't just issue a query and get back all the data. You have to make individual requests to your data sources, get all the responses, and the build the page. Thankfully, because of caching and fast networks this process is usually fast enough that your page load times can be excellent.

  • How do you partition your data in shards? What data do you put in which shard? Where do comments go? Should all user data really go together, or just their profile data? Should a user's media, IMs, friends lists, etc go somewhere else? Unfortunately there are no easy answer to these questions.

  • Less leverage. People have experience with traditional RDBMS tools so there is a lot of help out there. You have books, experts, tool chains, and discussion forums when something goes wrong or you are wondering how to implement a new feature. Eclipse won't have a shard view and you won't find any automated backup and restore programs for your shard. With sharding you are on your own.

  • Implementing shards is not well supported. Sharding is currently mostly a roll your own approach. LiveJournal makes their tool chain available. Hibernate has a library under development. MySQL has added support for partioning. But in general it's still something you must implement yourself.

    See Also

  • The Flickr Architecture for more interesting ideas on how to implement sharding.
  • The Google Arhitecture.
  • The LiveJournal Architecture. They talk quite a bit about their sharding approach and give a lot of helpful details.
  • The Shard category.