High Scalability -

10 Comments |

Permalink |

Book,

capacity planning,

Presentation: Behind the Scenes at MySpace.com

Thursday

Feb122009

MySpace Architecture

Thursday, February 12, 2009 at 1:28AM

Update:Presentation: Behind the Scenes at MySpace.com. Dan Farino, Chief Systems Architect at MySpace shares details of some of MySpace's cool internal operations tools. MySpace.com is one of the fastest growing site on the Internet with 65 million subscribers and 260,000 new users registering each day. Often criticized for poor performance, MySpace has had to tackle scalability issues few other sites have faced. How did they do it? Site: http://myspace.com

Information Sources

Inside MySpace.com

Platform

ASP.NET 2.0

Windows

IIS

SQL Server

What's Inside?

300 million users.

Pushes 100 gigabits/second to the internet. 10Gb/sec is HTML content.

4,500+ web servers windows 2003/IIS 6.0/APS.NET.

1,200+ cache servers running 64-bit Windows 2003. 16GB of objects cached in RAM.

500+ database servers running 64-bit Windows and SQL Server 2005.

MySpace processes 1.5 Billion page views per day and handles 2.3 million concurrent users during the day

Membership Milestones: - 500,000 Users: A Simple Architecture Stumbles - 1 Million Users:Vertical Partitioning Solves Scalability Woes - 3 Million Users: Scale-Out Wins Over Scale-Up - 9 Million Users: Site Migrates to ASP.NET, Adds Virtual Storage - 26 Million Users: MySpace Embraces 64-Bit Technology

500,000 accounts was too much load for two web servers and a single database.

At 1-2 Million Accounts - They used a database architecture built around the concept of vertical partitioning, with separate databases for parts of the website that served different functions such as the log-in screen, user profiles and blogs. - The vertical partitioning scheme helped divide up the workload for database reads and writes alike, and when users demanded a new feature, MySpace would put a new database online to support it. - MySpace switched from using storage devices directly attached to its database servers to a storage area network (SAN), in which a pool of disk storage devices are tied together by a high-speed, specialized network, and the databases connect to the SAN. The change to a SAN boosted performance, uptime and reliability.

At 3 Million Accounts - the vertical partitioning solution didn't last because they replicated some horizontal information like user accounts across all vertical slices. With so many replications one would fail and slow down the system. - individual applications like blogs on sub-sections of the Web site would grow too large for a single database server - Reorganized all the core data to be logically organized into one database - split its user base into chunks of 1 million accounts and put all the data keyed to those accounts in a separate instance of SQL Server

9 Million–17 Million Accounts - Moved to ASP.NET which used less resources than their previous architecture. 150 servers running the new code were able to do the same work that had previously required 246. - Saw storage bottlenecks again. Implementing a SAN had solved some early performance problems, but now the Web site's demands were starting to periodically overwhelm the SAN's I/O capacity—the speed with which it could read and write data to and from disk storage. - Hit limits with the 1 million-accounts-per-database division approach as these limits were exceeded. - Moved to a virtualized storage architecture where the entire SAN is treated as one big pool of storage capacity, without requiring that specific disks be dedicated to serving specific applications. MySpace now standardized on equipment from a relatively new SAN vendor, 3PARdata

Added a caching tier—a layer of servers placed between the Web servers and the database servers whose sole job was to capture copies of frequently accessed data objects in memory and serve them to the Web application without the need for a database lookup.

26 Million Accounts - Moved to 64-bit SQL server to work around their memory bottleneck issues. Their standard database server configuration uses 64 GB of RAM.

Horizontally Federated Database. Databases are partition by purpose. Have profile, email databases etc. Partition is based on user range. 1 Million users live in each database. So you have Profile1, Profile2 all the way up to Profile300 as they have 300 million users.

Doesn't use ASP cache because they don't have a high enough hit rate on the front-end. The middle tier cache does have a high hit rate.

Failure isolation. Segment requests into web server by database. Allow only 7 threads per database. So if the database is slow only those threads will slowdown and the traffic in the other threads will flow.

Operations

PerfCollector. Centralized collection of performance data via UDP. More reliable than Windows and allows any client to connect and see stats.

Web Based Stack Dump Tool. Can right-click on a problem server and get stack dump of the .Net managed threads. Used to have to RDC into system and attach a debugger and 1/2 later get an answer. Slow, nonscalable, and tedious. Not just a stack dump, gives a lot of context about what the thread is doing. Troubleshooting is easier because you can see 90 threads are blocked on a database so the database may be down.

Web Base Heap Dump Tool. Dumps all memory allocations. Very useful for developers. Save hours of doing it by hand.

Profiler. Traces a request from start to finish and produces a report. See URL, methods, status, everything that will help you identify a slow request. Looks at lock contentions, are a lot of exceptions being thrown, anything that might be interesting. Very light weight. It's running on one box in every VIP (group of 100 servers) in production. Samples 1 thread every 10 seconds. Always tracing in background.

Powershell. Microsoft's new shell that runs in process and pass objects between commands versus parsing text output. MySpace develops a lot of commandlets to support operations.

Developed their own asynchronous communication technology to get around windows networking problems and treat servers as a group. Can ship a .cs file, compile it, run it, and ship the response back.

Codespew. Pushes code updates on their communication technology. Used to do 5 code pushes a day, now down to 1 a week.

Lessons Learned

You can build big websites using Microsoft tech.

A cache should have been used from the beginning.

The cache is a better place to store transitory data that doesn't need to be recorded in a database, such as temporary files created to track a particular user's session on the Web site.

Built in OS features to detect denial of service attacks can cause inexplicable failures.

Distribute your data to geographically diverse data centers to handle power failures.

Consider using virtualized storage/clustered file systems from the start. It allows you to massively parallelize IO access while being able to add disk as needed without any reorganization needed.

Develop tools that work in a production environment. Can't simulate everything in test environment. The scale and variety of uses APIs are put to can't be simulated in QA during testing. Legitimate users and hackers will run into corner cases that weren't hit in testing, though QA will find most of the problems.

Throw hardware at problems. Easier than changing their backend software to a new way of doing things. The example is they add a new database server for every million users. It might be more efficient to change their approach to more efficiently use the database hardware, but it's easier just to add servers. For now.

17 Comments |

Permalink |

.Net,

Friday

Dec052008

Sprinkle - Provisioning Tool to Build Remote Servers

Friday, December 5, 2008 at 1:54AM

At 37 Signals Joshua Sierles describes how 37 Signals uses Sprinkle to configure their servers within EC2. Sprinkle defines a domain specific meta-language for describing and processing the installation of software. You can find an interesting discussion of Sprinkle's creation story by the creator himself, Marcus Crafter, in Sprinkle Some Powder!. Marcus divides provisioning tools into two categories:

Task Based - the tool issues a list of commands to run on the remote system, either remotely via a network connection or smart client.

Policy/state Based - the tool determines what needs to be run on the remote system by examining its current and final state. Sprinkle combines both models together in a chocolate-in-my-peanut-butter approach using normal Ruby code as the DSL (domain specific language) to declaratively describe remote system configurations. 37 Signals likes the use of Ruby as the DSL because it makes learning a separate syntax unnecessary. I've successfully done similar things in Perl. You already have a scripting language, why layer another one on top? One reason not to is that you've now tied configuration and execution together so that only one tool can control the process, but the leverage is so high with this approach it's hard to ignore. There's all the usual bits about defining packages, dependencies, installation logic, pre and post actions, etc. The format is compact and clear because that's how Ruby is and the operations are task specific so there's no fluff. Capistrano is used to communicate with remote systems though that is pluggable. 37 Signals uses the EC2 security group as way to specify the role an instance should take on when it boots. A configuration script that can handle all roles is shipped with a near complete functional base image. Sprinkle then configures the system the rest of the way based on the passed in role. Joshua says they like this approach better than Puppet because it doesn't rely on a centralized configuration server or "pushing large sets of commands over SSH manually." There's always one more than one way to do "it" and Sprinkle carves out an interesting niche in the provisioning space. The 37 Signal's approach doesn't scale to a large organization with many different flavor of servers, but for a specific set of tightly cooperating servers it's a very simple, clean, and robust way of doing business. Related Articles: Product: Puppet the Automated Administration System.

1 Comment |

Permalink |

ruby

Monday

Nov242008

Product: Scribe - Facebook's Scalable Logging System

Monday, November 24, 2008 at 12:25AM

In Log Everything All the Time I advocate applications shouldn't bother logging at all. Why waste all that time and code? No, wait, that's not right. I preach logging everything all the time. Doh. Facebook obviously feels similarly which is why they opened sourced Scribe, their internal logging system, capable of logging 10s of billions of messages per day. These messages include access logs, performance statistics, actions that went to News Feed, and many others.

Imagine hundreds of thousands of machines across many geographical dispersed datacenters just aching to send their precious log payload to the central repository off all knowledge. Because really, when you combine all the meta data with all the events you pretty much have a complete picture of your operations. Once in the central repository logs can be scanned, indexed, summarized, aggregated, refactored, diced, data cubed, and mined for every scrap of potentially useful information.

Just imagine the log stream from all of Facebook's Apache servers alone. Brutal. My guess is these are not real-time feeds so there are no streaming query issues, but the task is still daunting. Let's say they log 10 billion messages a day. That's over 1 million messages per second!

When no off the shelf products worked for them they built their own. Scribe can be downloaded from Sourceforge. But the real action is on their wiki. It's here you'll find some decent documentation and their support forums. Not much activity on the site so you haven't missed your chance to be a charter member of the Scribe guild.

A logging system has three broad components:

Client Code Interface - How does your code interact with the log system? Scribe doesn't do much for you here. There's a simple Thrift interface for logging from a large set of languages, but the bulk of the work is stull up to you.

Distribution System - This is were Scribe fits. It reliably (mostly) moves large numbers of messages around. A few error cases lead to data loss: 1) If a client can't connect to either the local or central scribe server the message will be loss; 2) If a scribe server crashes it could lose a small amount of data that's in memory but not on disk; 3) Some multiple component failure cases, such as a resender can't connect to any central server and its local disk fills up; 4) Some rare timeout conditions can lead to duplicate messages

Do Something Usefullizer - How do you do anything useful with over 1 million messages per second? Good question. Scribe doesn't help here. But Scribe will get your data their.

I browsed around the source and it's a well crafted, straightforward socket server that forwards messages to other servers and can write messages to disk. Nothing fancy which is why it probably works for them. It's basic function is:

Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. If the central scribe server isn't available the local scribe server writes the messages to a file on local disk and sends them when the central server recovers. The central scribe server(s) can write the messages to the files that are their final destination, typically on an nfs filer or a distributed file system, or send them to another layer of scribe servers.

It some ways it could be fancier. For example, there's no throttle on incoming connections so a server can chew up memory. And there is a max_msg_per_second throttle on message processing, but this is really to simple. Throttling needs to be adaptive based on local conditions and the conditions of down stream servers. Under load you want to push flow control back to the client so the data stays there until resources become available. Simple configuration file settings rarely work when the world starts getting weird.

Client Code Interface

Here's what the Thrift interface looks like:

enum ResultCode
{
OK,
TRY_LATER
}

struct LogEntry
{
1: string category,
2: string message
}

service scribe extends fb303.FacebookService
{
ResultCode Log(1: list messages);
}

I know, I thought the same thing. Thank God there's another IDL syntax. We simply did not have enough of them. Thrift translates this IDL into the glue code necessary for making cross-language calls (marshalling arguments and responses over the wire). The Thrift library also has templates for servers and clients.

Here's what a call looks like in PHP:

$messages = array();
$entry = new LogEntry;
$entry->category = "buckettest";
$entry->message = "something very interesting happened";
$messages []= $entry;
$result = $conn->Log($messages);

Pretty simple. Usually in C++, for example, there's an elaborate set of macros for logging that provide sophisticated control of log generation. It might look something like:

MSG(msg) - a simple message. It only prints out msg. None of the other information is printed out.
NOTE(const char* name, const char* reason, const char* what, Module* module, msg) - something to take note of.
WARN(const char* name, const char* reason, const char* what, Module* module, msg) - a warning.
ERR(const char* name, const char* reason, const char* what, Module* module, msg) - an error occured.
CRIT(const char* name, const char* reason, const char* what, Module* module, msg) - a critical error occurred.
EMERG(const char* name, const char* reason, const char* what, Module* module, msg) - an emergency occurred.

There's lots more to handle streams and behind the scenes things like time stamps, thread ids, function names, and line numbers. Scribe has wisely not done any of that. It has a RPC like interface to send a list of messages and that's it. It's up to you to write the wrappers.

You'll no doubt have noticed Scribe only logs a category and message, both strings:

Scribe is unique in that clients log entries consisting of two strings, a category and a message. The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server, which allows data stores to be moved by changing the scribe configuration instead of client code. The server also allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path. Flexibility and extensibility is provided through the "store" abstraction. Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server. Stores are implemented as a class hierarchy, and stores can contain other stores. This allows a user to chain features together in different orders and combinations by changing only the configuration.

Distribution System

The payload has whatever structure you give it. Scribe is policy neutral and doesn't push a logging model on you.

The configuration file looks something like this:


# BUCKETIZER TEST
<store>
  category=buckettest
  type=buffer
  target_write_size=20480
  max_write_interval=1
  buffer_send_rate=2
  retry_interval=30
  retry_interval_range=10
<primary>
   type=bucket
   num_buckets=6
   bucket_subdir=bucket
   bucket_type=key_hash
   delimiter=1
<bucket>
   type=file
   fs_type=std
   file_path=/tmp/scribetest
   base_filename=buckettest
   max_size=1000000
   rotate_period=hourly
    rotate_hour=0
   rotate_minute=30
   write_meta=yes
</bucket>
</primary>
<secondary>
  type=file
  fs_type=std
  file_path=/tmp
  base_filename=buckettest
  max_size=30000
</secondary>
</store>

The types of stores currently available are:

file - writes to a file, either local or nfs.

network - sends messages to another scribe server.

buffer - contains a primary and a secondary store. Messages are sent to the primary store if possible, and otherwise the secondary. When the primary store becomes available the messages are read from the secondary store and sent to the primary.

bucket - contains a large number of other stores, and decides which messages to send to which stores based on a hash.

null - discards all messages.

thriftfile - similar to a file store but writes messages into a Thrift TFileTransport file.

multi - a store that forwards messages to multiple stores.

Certainly a flexible and useful set of logging capabilities. You can build a hierarchy of log servers to do pretty much anything you want. You could imagine have a log server on each server that has file store to handle upstream server failures. This log server forwards messages onto a centralized server for a datacenter. And all the datacenter servers forward their logs on to the centralized data warehouse. To scale adjust fan-in and fan-out as necessary.

Do Something Usefullizer

You may not have over 1 million log messages a second to process, but you are likely to have your own tanker trunk full of log messages. How do you do something useful with them?

Log messages stored in log files are next to useless. Grep'ing on a terabyte of logs to answer simple questions about your data just doesn't work.

You may have a sharded datawarehouse you can pump log messages into and do reasonably effective job of querying.

Or you can set up a HADOOP/HDFS. style system. The idea here is you need a distributed file system to handle the continual stream of log messages. And once you have all the data stored safely away you'll need to use map-reduce to do anything with such a large amount of data.

If you want to ask, for example, how many of your users are from Asia, log files won't work. It's likely your data warehouse can't handle it. HADOOP/HDFS is a practical option.

If that's the direction you are going what does it imply about your log system? I would say it makes even the simple category-payload system of Scribe overkill. The with a scalable backend is to move log payloads from applications to the centralized store as quickly as possible. By definition the central store can handle the load, so there's no reason to use intermediate servers to scale. From an application write directly to the central store, even from multiple datacenters. The payload structure is unimportant until it hits the central store. If the application can't hit the central store then it queues into the file system until it can. Ideally log messages never hit the file system until HDFS is writing them to their final destination. This makes for a low latency and high throughput logging and is even simpler than Scribe.

If you don't have a scalable central store then Scribe is a good option. It gives you all the flexibility you need to compose your logging system in a way that is mostly reliabile and scalable.

13 Comments |

Permalink |

Interesting Puppet Thread Mentioning CTL

Wednesday

Oct292008

CTL - Distributed Control Dispatching Framework

Wednesday, October 29, 2008 at 1:19AM

CTL is a flexible distributed control dispatching framework that enables you to break management processes into reusable control modules and execute them in distributed fashion over the network. From their website: CTL is a flexible distributed control dispatching framework that enables you to break management processes into reusable control modules and execute them in distributed fashion over the network. What does CTL do? CTL helps you leverage your current scripts and tools to easily automate any kind of distributed systems management or application provisioning task. Its good for simplifiying large-scale scripting efforts or as another tool in your toolbox that helps you speed through your daily mix of ad-hoc administration tasks. What are CTL's features? CTL has many features, but the general highlights are: * Execute sophisticated procedures in distributed environments - Aren't you tired of writing and then endlessly modifying scripts that loop over nodes and invoke remote actions? CTL dispatches actions to remote controllers with network transparency (over SSH), parallelism, and error handling already built in. * Comes with pre-built utilities - CTL comes with pre-built utilities so you don't have to script actions like file distribution or process and port checking. * Define your own automation using the tools/languages you already know - New controller modules are defined in XML and your scripting can be done in multiple scripting languages (Perl, Python, etc.), *nix shell, Windows batch, and/or Ant. * Cross platform administration - CTL is Java-based, works on *nix and Windows.

Blog.Control.Tier

Introduction to CTL - a very nice overview of features.

2 Comments |

Permalink |

Product,

Saturday

Oct252008

Product: Puppet the Automated Administration System

Saturday, October 25, 2008 at 3:57AM

Update: Digg on their choice and use of Puppet. They chose puppet over cfengine, and bcfg2 because they liked Puppet's resource abstraction layer (RAL), the ability to implement configuration management incrementally, support for bundles, and the overall design philosophy. Puppet implements a declarative (what not how) configuration language for automating common administration tasks. It's the system every large site writes for themselves and it's already made for you! Ilike was able to "easily" scale from 0 to hundreds of servers using Puppet. I can't believe I've never seen this before. It looks really cool. What is Puppet and how can it help you scale your website operations? From the Puppet website: Puppet has been developed to help the sysadmin community move to building and sharing mature tools that avoid the duplication of everyone solving the same problem. It does so in two ways: * It provides a powerful framework to simplify the majority of the technical tasks that sysadmins need to perform * The sysadmin work is written as code in Puppet's custom language which is shareable just like any other code. This means that your work as a sysadmin can get done much faster, because you can have Puppet handle most or all of the details, and you can download code from other sysadmins to help you get done even faster. The majority of Puppet implementations use at least one or two modules developed by someone else, and there are already tens of recipes available in Puppet's CookBook. This sound good. But does it work in the field? HJK Solutions' Adam Jacob says it does: Puppet enables us to get a huge jump-start on building automated, scaleable, easy to manage infrastructures for our clients. Using puppet, we: 1. Automate as much of the routine systems administration tasks as possible. 2. Get 10 minute unattended build times from bare metal, most of which is data transfer. Puppet takes it the rest of the way, getting the machines ready to have applications deployed on them. It’s down to two and a half minutes for Xen. 3. Bootstrap our clients production environments while building their development environment. I can’t stress how cool this really is. Because we are expressing the infrastructure at a higher level, when it comes time to deploy your production systems, it’s really a non-event. We just roll out the Puppet Master and an Operating System auto-install environment, and it’s finished. 4. Cross-pollinate between clients with similar architectures. We work with several different shops using Ruby on Rails, all of whom have very similar infrastructure needs. By using Puppet in all of them, when we solve a problem for one client, we’ve effectively solved it for the others. I love being able to tell a client that we solved a problem for them, and all it’s going to cost is the time it takes for us to add the recipe. Puppet, today, is a tool that is good enough to handle the vast majority of issues encountered in building scalable infrastructures. Even the places where it falls short are almost always just a matter of it being less elegant than it could be, and the entire community is working on making those parts better.

Operations is a competitive advantage... (Secret Sauce for Startups!) by Jesse Robbins

Infrastructure 2.0 by John Willis

Puppet, iLike and Infrastructure 2.0 by John Willis

Why are people paying 3 to 5 million for configuration management software? by Adam Jacob

13 Comments |

Permalink |

Product,

HighScalability Operations

Tuesday

Sep162008

Product: Func - Fedora Unified Network Controller

Tuesday, September 16, 2008 at 12:52AM

Func is used to manage a large network using bash or Python scripts. It targets easy and simple remote scripting and one-off tasks over SSH by creating a secure (SSL certifications) XMLRPC API for communication. Any kind of application can be written on top of it. Other configuration management tools specialize in mass configuration. They say here's what the machine should look like and keep it that way. Func allows you to program your cluster. If you've ever tried to securely remote script a gang of machines using SSH keys you know what a total nightmare that can be. Some example commands:

Using the command line:
func "*.example.org" call yumcmd update

Using the Pthon API:
import func.overlord.client as fc
client = fc.Client("*.example.org;*.example.com")
client.yumcmd.update()
client.service.start("acme-server")
print client.hardware.info()

Func may certainly overlap in functionality with other tools like Puppet and cfengine, but as programmers we always need more than one way to do it and definitely see how I could have used Func on a few projects.

Some Facebook Secrets to Better Operations

Wednesday, September 3, 2008 at 1:27AM

Kim Nash in an interview with Jonathan Heiliger, Facebook VP of technical operations, provides some juicy details on how Facebook handles operations. Operations is one of those departments everyone runs differently as it is usually an ontogeny recapitulates phylogeny situation. With 2,000 databases, 25 terabytes of cache, 90 million active users, and 10,000 servers you know Facebook has some serious operational issues. What are some of Facebook's secrets to better operations?

Frequent Releases. A major release once a week and a minor releases every few days.

Create a Cyber Liability Group. At one time operations was distributed amongst several groups. A permanent operations group was created to isolate problems and revert problem software components back to previously known good states. The ability of a separate team to handle rollbacks speaks to a great deal of standardization and advanced tool building.

Distribute Team Across Time Zones. Split the operations team across different time zones so no one has to work the graveyard shift. Facebook has 20 people in their team located in Palo Alto, California and London, England.

Be Innovative, Not Safe. Fear of failure often shuts down the organizational brain and makes it hide behind excessive rules and regulations. A technology company should have a bias towards action and innovation. Release software. Don't stifle genius. Rely on your tools and processes to recover from problems.

Expect Problems. Software pushed to production will have problems. Expect problems, but don't let that stop you from innovating.

Roll Backward. When a problem is detected in a release the changes can either be rolled forward or backward. Rolling back is going to a previously good release. Rolling forward is fixing problems in the new release rather than rolling back. Bugs in production are fixed in production. Roll forward ends up being covered in the press, so prefer roll backs over roll forwards.

Roll Out Massive Changes Slowly. Turn on features gradually, for a few percent of users at a time. Use the slow rollout to fix problems that can only be found under real user conditions. This approach give operations and development a lot of confidence in changes.

Encourage Openness and Information Sharing. Design reviews, PR strategy, which servers to buy, etc are often open for informal debate among employees. Facebook has created an Ideas system where employees can create an Idea by category. There's a discussion tool for discussing the idea and a rating system for rating the idea. Tools are built on-top of Facebook platform so they are available to everyone.

Live-blog Key Events. Large company meetings, monthly presentations and weekly Q&As with the management team are transcribed live.

It sounds like a relatively fun environment for pushing software live. Getting software moved into production is often harder than the original coding and testing. Now I know what you are thinking. You somehow managed to procure the ssh login. So just login remotely and do the install yourself! Nobody will know. Oh so tempting. But it's not really good corporate citizenship. And you just might screw up, then there will be some esplaining to do.

Emphasing frequent releases and gutsy release policies makes it actually seem like someone is supporting developers instead of treating them like their software carries the plague. Data centers are often treated like quarantine stations and developers are treated like asymptomatic carriers of some unknown virulent disease. To be safe nothing should ever change, but that's not an attitude that makes things better. Nice to see that recognized.

To setup or not to setup a separate operations group? Facebook says "to be" and creates a seperate group. Amazon says "not to be" and has developers support their own software. Secretly I think Amazon gets better results by requiring developers to support their own software. Knowing it may be you getting the "It's Down!" call gives one proper perspective. But I like not being on call and I think most developers agree. Plus the idea "following the sun" to get 24 hour support is a smart idea.

On Designing and Deploying Internet-Scale Services

Amazon Architecture

7 Comments |

Permalink |

facebook,

Tuesday

Mar252008

Paper: On Designing and Deploying Internet-Scale Services

Tuesday, March 25, 2008 at 5:58AM

Greg Linden links to a heavily lesson ladened LISA 2007 paper titled On Designing and Deploying Internet-Scale Services by James Hamilton of the Windows Live Services Platform group. I know people crave nitty-gritty details, but this isn't a how to configure a web server article. It hitches you to a rocket and zooms you up to 50,000 feet so you can take a look at best web operations practices from a broad, yet practical perspective. The author and his team of contributors obviously have a lot of in the trenches experience. Many non-obvious topics are covered. And there's a lot to learn from.

The paper has too many details to cover here, but the big sections are:

Recommendations

Automatic Management and Provisioning

Dependency Management

Release Cycle and Testing

Operations and Capacity Planning

Graceful Degradation and Admission Control

Customer Self-Provisioning and Self-Help

Customer and Press Communication Plan

In the recommendations we see some of our old favorites:

Expect failure and design for failure.

Implement redundancy and fault recovery.

Depend upon a commodity hardware slice.

Keep things simple and robust.

Automate everything.

Personally, I'm still trying to figure out how to make something simple.

Next are some good thoughts on how to design operations friendly software:

Quick service health check. This is the services version of a build verification test.

Develop in the full environment.

Zero trust of underlying components.

Do not build the same functionality in multiple components.

One pod or cluster should not affect another pod or cluster.

Allow (rare) emergency human intervention.

Enforce admission control at all levels.

Partition services.

Understand the network design.

Analyze throughput and latency.

Treat operations utilities as part of the service.

Understand access patterns.

Version everything.

Keep the unit/functional tests from the last release.

Avoid single points of failure.

Support single-version software. Have all your customers run the same version.

Implement multi-tenancy. Apparently a lot of software requires cloning hardware installations to support multiple customers. Don't do that. Have your software work for multiple customers all on the same hardware.

And the paper continues along the same lines in each section. Good detailed advice on lots of different topics.

You'll undoubtedly agree with some of the advice and disagree with some. Greg wants faster release cycles, thinks having server affinity for some things is OK, and thinks the advice on allowing humans to throttle load won't work in a crisis. Perfectly valid points, but what's fun is to consider them. Some companies, for example, have a dead-man's switch that must be thrown before one master can failover to another in a multi-datacenter situation. Is that wrong or right? Only the shadow knows.

The advice to "document all conceivable component failures and modes and combinations" sounds good but is truly difficult to do in practice. I went through this process once on a telco project and it took months just to cover all the failure scenarios on a few cards. But the spirit is right I think.

My favorite part of the whole paper is:
We have long believed that 80% of operations issues originate in design and development, so this section
on overall service design is the largest and most important. When systems fail, there is a natural tendency
to look first to operations since that is where the problem actually took place. Most operations issues,
however, either have their genesis in design and development are best solved there.

Understand this and I think much of the rest follows naturally.

3 Comments |

Permalink |

Paper,