Paper: On Designing and Deploying Internet-Scale Services

Tuesday

Mar252008

Paper: On Designing and Deploying Internet-Scale Services

Tuesday, March 25, 2008 at 5:58AM

Greg Linden links to a heavily lesson ladened LISA 2007 paper titled On Designing and Deploying Internet-Scale Services by James Hamilton of the Windows Live Services Platform group. I know people crave nitty-gritty details, but this isn't a how to configure a web server article. It hitches you to a rocket and zooms you up to 50,000 feet so you can take a look at best web operations practices from a broad, yet practical perspective. The author and his team of contributors obviously have a lot of in the trenches experience. Many non-obvious topics are covered. And there's a lot to learn from.

The paper has too many details to cover here, but the big sections are:

Recommendations

Automatic Management and Provisioning

Dependency Management

Release Cycle and Testing

Operations and Capacity Planning

Graceful Degradation and Admission Control

Customer Self-Provisioning and Self-Help

Customer and Press Communication Plan

In the recommendations we see some of our old favorites:

Expect failure and design for failure.

Implement redundancy and fault recovery.

Depend upon a commodity hardware slice.

Keep things simple and robust.

Automate everything.

Personally, I'm still trying to figure out how to make something simple.

Next are some good thoughts on how to design operations friendly software:

Quick service health check. This is the services version of a build verification test.

Develop in the full environment.

Zero trust of underlying components.

Do not build the same functionality in multiple components.

One pod or cluster should not affect another pod or cluster.

Allow (rare) emergency human intervention.

Enforce admission control at all levels.

Partition services.

Understand the network design.

Analyze throughput and latency.

Treat operations utilities as part of the service.

Understand access patterns.

Version everything.

Keep the unit/functional tests from the last release.

Avoid single points of failure.

Support single-version software. Have all your customers run the same version.

Implement multi-tenancy. Apparently a lot of software requires cloning hardware installations to support multiple customers. Don't do that. Have your software work for multiple customers all on the same hardware.

And the paper continues along the same lines in each section. Good detailed advice on lots of different topics.

You'll undoubtedly agree with some of the advice and disagree with some. Greg wants faster release cycles, thinks having server affinity for some things is OK, and thinks the advice on allowing humans to throttle load won't work in a crisis. Perfectly valid points, but what's fun is to consider them. Some companies, for example, have a dead-man's switch that must be thrown before one master can failover to another in a multi-datacenter situation. Is that wrong or right? Only the shadow knows.

The advice to "document all conceivable component failures and modes and combinations" sounds good but is truly difficult to do in practice. I went through this process once on a telco project and it took months just to cover all the failure scenarios on a few cards. But the spirit is right I think.

My favorite part of the whole paper is:
We have long believed that 80% of operations issues originate in design and development, so this section
on overall service design is the largest and most important. When systems fail, there is a natural tendency
to look first to operations since that is where the problem actually took place. Most operations issues,
however, either have their genesis in design and development are best solved there.

Understand this and I think much of the rest follows naturally.

Todd Hoff |

3 Comments |

Permalink |

Print Article

Email Article

Paper,

deployment,

management,

operations

References (1)

References allow you to track sources for this article, as well as articles that were written in response to this article.

Source: On Designing and Deploying Internet-Scale Services

Reader Comments (3)

Eureka! This is a great paper, it is certainly worth a proper read! If you're stuck trying to make SaaS/high scalability solutions in a traditional development/operational organisation this could prove to be an eye- opener to your organisation. It also certainly raises a question I have touted for a long time; can traditional ISV's create internet scale services?

One of my favourite bits from the abstract must be:

Make the development team responsible. Amazon is perhaps the most aggressively down this path with their slogan ‘‘you built it, you manage it.’’ That position is perhaps slightly stronger than the one we would take, but it’s clearly the right general direction. If development is frequently called in the middle of the night, automation is the likely outcome. If operations is frequently called, the usual reaction is to grow the operations team.

December 31, 1999 |

jab

This is a great find and a well worth the read. Thanks for finding that one for us Todd. I will most certainly be blogging about this one soon as well.

If you liked this article then here are two more you will probably like. I have written about them both already, here are the links if you are interested.

Release IT! Book Recommendation
http://www.productionscale.com/home/2008/1/27/book-recommendation-release-it.html

Harvard Business Review Article Review
http://www.productionscale.com/home/2008/3/18/hbr-article-read-recommendation-radically-simple-it.html

Regards,
Kent Langley

December 31, 1999 |

Kent Langley

That bit is probably my least favorite bits from the paper, because it implies that operations personnel aren't interested in preventing future wake-ups. :) Nothing could be further from the truth, in my experience.

December 31, 1999 |

john allspaw

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>