Entries in Paper (127)

Wednesday
Jul092008

Federation at Flickr: Doing Billions of Queries Per Day

Flickr's lone database guy Dathan Pattishall made his excellent presentation available on how on how Flickr scales its backend to handle tremendous loads. Some of this information is available in Flickr Architecture, but the paper is so good it's worth another read. If you want to see sharding done right, at scale, take a look.

Click to read more ...

Tuesday
Mar252008

Paper: On Designing and Deploying Internet-Scale Services

Greg Linden links to a heavily lesson ladened LISA 2007 paper titled On Designing and Deploying Internet-Scale Services by James Hamilton of the Windows Live Services Platform group. I know people crave nitty-gritty details, but this isn't a how to configure a web server article. It hitches you to a rocket and zooms you up to 50,000 feet so you can take a look at best web operations practices from a broad, yet practical perspective. The author and his team of contributors obviously have a lot of in the trenches experience. Many non-obvious topics are covered. And there's a lot to learn from.

The paper has too many details to cover here, but the big sections are:

  • Recommendations
  • Automatic Management and Provisioning
  • Dependency Management
  • Release Cycle and Testing
  • Operations and Capacity Planning
  • Graceful Degradation and Admission Control
  • Customer Self-Provisioning and Self-Help
  • Customer and Press Communication Plan

    In the recommendations we see some of our old favorites:
  • Expect failure and design for failure.
  • Implement redundancy and fault recovery.
  • Depend upon a commodity hardware slice.
  • Keep things simple and robust.
  • Automate everything.

    Personally, I'm still trying to figure out how to make something simple.

    Next are some good thoughts on how to design operations friendly software:
  • Quick service health check. This is the services version of a build verification test.
  • Develop in the full environment.
  • Zero trust of underlying components.
  • Do not build the same functionality in multiple components.
  • One pod or cluster should not affect another pod or cluster.
  • Allow (rare) emergency human intervention.
  • Enforce admission control at all levels.
  • Partition services.
  • Understand the network design.
  • Analyze throughput and latency.
  • Treat operations utilities as part of the service.
  • Understand access patterns.
  • Version everything.
  • Keep the unit/functional tests from the last release.
  • Avoid single points of failure.
  • Support single-version software. Have all your customers run the same version.
  • Implement multi-tenancy. Apparently a lot of software requires cloning hardware installations to support multiple customers. Don't do that. Have your software work for multiple customers all on the same hardware.

    And the paper continues along the same lines in each section. Good detailed advice on lots of different topics.

    You'll undoubtedly agree with some of the advice and disagree with some. Greg wants faster release cycles, thinks having server affinity for some things is OK, and thinks the advice on allowing humans to throttle load won't work in a crisis. Perfectly valid points, but what's fun is to consider them. Some companies, for example, have a dead-man's switch that must be thrown before one master can failover to another in a multi-datacenter situation. Is that wrong or right? Only the shadow knows.

    The advice to "document all conceivable component failures and modes and combinations" sounds good but is truly difficult to do in practice. I went through this process once on a telco project and it took months just to cover all the failure scenarios on a few cards. But the spirit is right I think.

    My favorite part of the whole paper is:
    We have long believed that 80% of operations issues originate in design and development, so this section
    on overall service design is the largest and most important. When systems fail, there is a natural tendency
    to look first to operations since that is where the problem actually took place. Most operations issues,
    however, either have their genesis in design and development are best solved there.

    Understand this and I think much of the rest follows naturally.
  • Thursday
    Mar202008

    Paper: Asynchronous HTTP and Comet architectures

    Comet has popularized asynchronous non-blocking HTTP programming, making it practically indistinguishable from reverse Ajax, also known as server push. This JavaWorld article takes a wider view of asynchronous HTTP, explaining its role in developing high-performance HTTP proxies and non-blocking HTTP clients, as well as the long-lived HTTP connections associated with Comet.

    Click to read more ...

    Monday
    Mar172008

    Paper: Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

    Consistent hashing is one of those ideas that really puts the science in computer science and reminds us why all those really smart people spend years slaving over algorithms. Consistent hashing is "a scheme that provides hash table functionality in a way that the addition or removal of one slot does not significantly change the mapping of keys to slots" and was originally a way of distributing requests among a changing population of web servers. My first reaction to the idea was "wow, that's really smart" and I sadly realized I would never come up with something so elegant. I then immediately saw applications for it everywhere. And consistent hashing is used everywhere: distributed hash tables, overlay networks, P2P, IM, caching, and CDNs. Here's the abstract from the original paper and after the abstract are some links to a few very good articles with accessible explanations of consistent hashing and its applications in the real world. Abstract: We describe a family of caching protocols for distributed networks that can be used to decrease or eliminate the occurrence of hot spots in the network. Our protocols are particularly designed for use with very large networks such as the Internet, where delays caused by hot spots can be severe, and where it is not feasible for every server to have complete information about the current state of the entire network. The protocols are easy to implement using existing network protocols such as TCP/IP, and require very little overhead. The protocols work with local control, make efficient use of existing resources, and scale gracefully as the network grows. Our caching protocols are based on a special kind of hashing that we call consistent hashing. Roughly speaking, a consistent hash function is one which changes minimally as the range of the function changes. Through the development of good consistent hash functions, we are able to develop caching protocols which do not require users to have a current or even consistent view of the network. We believe that consistent hash functions may eventually prove to be useful in other applications such as distributed name servers and/or quorum systems. Other excellent resources for learning more about consistent hashing are at:

  • Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web
  • Consistent Hashing by Tom White. A good explanation and some actual Java code as an implementation.
  • Programmer’s Toolbox Part 3: Consistent Hashing by Tom Kleinpeter. Another good explanation with an emphasis on useful applications: load distribution on failure, load tuning by capacity, method for bringing servers on line, redundant caching to protect the database in case of failure.
  • Distributed Hash Tables: an infrastructure that can be used to build more complex services, such as distributed file systems, peer-to-peer file sharing and content distribution systems, cooperative web caching, multicast, anycast, domain name services, and instant messaging. Notable distributed networks that use DHTs include BitTorrent (with extensions), eDonkey network, YaCy, and the Coral Content Distribution Network.
  • Chord - a peer-to-peer lookup algorithm. It allows a distributed set of participants to agree on a single node as a rendezvous point for a given key, without any central coordination.
  • Dynamo, Amazon's database uses consistent hashing.
  • Replication Under Scalable Hashing: A Family of Algorithms for Scalable Decentralized Data Distribution

    Click to read more ...

  • Friday
    Nov092007

    Paper: Container-based Operating System Virtualization: A Scalable, High-performance Alternative to Hypervisors

    One stumbling block of the the great march towards virtualization is the relatively poor performance of resource hungry applications like databases. We are told to develop and test using VMs, but deploy without them. Which kind of sucks IMHO. Maybe better virtualization technology can remove this split. This paper talks about a different approach to virtualization called "container-based" virtualization that can reportedly double the performance of traditional hypervisor systems like Xen. It does this by trading isolation for efficiency. Rather than maintaining complete isolation between VMs the container approach shares resources between VMs and thus gives higher performance while still guaranteeing strong fault, resource, and security isolation. It's yet another battle in computing's endless war of creating and destroying abstraction layers. I learned a lot from from this paper because of how it compared and contrasted traditional hypervisor and container based virtualization strategies. Good job.

    Click to read more ...

    Tuesday
    Oct302007

    Paper: Dynamo: Amazon’s Highly Available Key-value Store

    Update 2: Read/WriteWeb has a good article talking about the scalability issues of relational databases and how Dynamo solves them: Amazon Dynamo: The Next Generation Of Virtual Distributed Storage. But since Dynamo is just another frustrating walled garden protected by barbed wire and guard dogs, its relevance is somewhat overstated. Update: Greg Linden has a take on the paper where he questions some of Amazon's design choices: emphasizing write availability over fast reads, a lack of indexing support, use of random distribution for load balancing, and punting on some scalability issues. Werner Vogels, Amazon's avuncular CTO, just announced a new paper on the internal database technology Amazon uses to handle tens of millions customers. I'll dive into more details later, but I thought you'd want to read it hot off the blog. The bad news is it won't be a service. They are keeping this tech not so secret, but very safe. Happily, it's another real-life example to learn from. As many top websites use a highly tuned key-value database at their core instead of a RDBMS, it's an important technology to understand. From the abstract you can get a feel for what the paper is about:

    Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
    My first impressions after reading the paper:
  • Wow. But crap, I'll never be able to build anything like that. This is really competition through better infrastructure. Take that Google :-)
  • Their purposeful embracing of probability and manged centers of uncertainty must be dizzying for those from a RDBMS background. In a RDBMS it's all right angles. You write something and it's assumed consistent, correct, and durable. Now, how do you do this at scale across multiple data centers under failure conditions? There's the rub. So Amazon says writes must go through and we will deal with the complexities that model generates. They version objects and merge them later. Who does that? I love it, because when delve into these problems you realize you need this type of functionality, but it's too complex, so you back away and continue trying to force a square peg in a round whole. To have no fear to go where your requirements leads you is real engineering.
  • Can you imagine finding a problem in that system? I'd love to be a fly in those debugging sessions. But infrastructure takes on self-consciousness of its own when dealing with complex problems, so you just have to deal with knowing you don't know anymore. A lot of this thinking is driven by the CAP conjecture which states it's impossible for a web service to simultaneously guarantee consistency, availability, and partition-tolerance. When you get over your initial "that can't be true" reaction and embrace it, you get something like Dynamo. I'd really love to hear what you guys think about Dynamo.

    Click to read more ...

  • Friday
    Oct262007

    Paper: Wikipedia's Site Internals, Configuration, Code Examples and Management Issues

    Wikipedia and Wikimedia have some of the best, most complete real-world documentation on how to build highly scalable systems. This paper by Domas Mituzas covers a lot of details about how Wikipedia works, including: an overview of the different packages used (Linux, PowerDNS, LVS, Squid, lighttpd, Apache, PHP5, Lucene, Mono, Memcached), how they use their CDN, how caching works, how they profile their code, how they store their media, how they structure their database access, how they handle search, how they handle load balancing and administration. All with real code examples and examples of configuration files. This is a really useful resource.

    Related Articles

  • Wikimedia Architecture
  • Domas Mituzas' Blog

    Click to read more ...

  • Sunday
    Oct212007

    Paper: Standardizing Storage Clusters (with pNFS)

    pNFS (parallel NFS) is the next generation of NFS and its main claim to fame is that it's clustered, which "enables clients to directly access file data spread over multiple storage servers in parallel. As a result, each client can leverage the full aggregate bandwidth of a clustered storage service at the granularity of an individual file." About pNFS StorageMojo says: pNFS is going to commoditize parallel data access. In 5 years we won’t know how we got along without it. Something to watch.

    Click to read more ...

    Monday
    Oct082007

    Paper: Understanding and Building High Availability/Load Balanced Clusters

    A superb explanation by Theo Schlossnagle of how to deploy a high availability load balanced system using mod backhand and Wackamole. The idea is you don't need to buy expensive redundant hardware load balancers, you can make use of the hosts you already have to the same effect. The discussion of using peer-based HA solutions versus a single front-end HA device is well worth the read. Another interesting perspective in the document is to view load balancing as a resource allocation problem. There's also a nice discussion of the negative of effect of keep-alives on performance.

    Click to read more ...

    Sunday
    Oct072007

    Paper: Architecture of a Highly Scalable NIO-Based Server

    The article describes the basic architecture of a connection-oriented NIO-based java server. It takes a look at a preferred threading model, Java Non-blocking I/O and discusses the basic components of such a server.

    Click to read more ...