Tuesday
Mar252008
Paper: On Designing and Deploying Internet-Scale Services

Greg Linden links to a heavily lesson ladened LISA 2007 paper titled On Designing and Deploying Internet-Scale Services by James Hamilton of the Windows Live Services Platform group. I know people crave nitty-gritty details, but this isn't a how to configure a web server article. It hitches you to a rocket and zooms you up to 50,000 feet so you can take a look at best web operations practices from a broad, yet practical perspective. The author and his team of contributors obviously have a lot of in the trenches experience. Many non-obvious topics are covered. And there's a lot to learn from.
The paper has too many details to cover here, but the big sections are:
In the recommendations we see some of our old favorites:
Personally, I'm still trying to figure out how to make something simple.
Next are some good thoughts on how to design operations friendly software:
And the paper continues along the same lines in each section. Good detailed advice on lots of different topics.
You'll undoubtedly agree with some of the advice and disagree with some. Greg wants faster release cycles, thinks having server affinity for some things is OK, and thinks the advice on allowing humans to throttle load won't work in a crisis. Perfectly valid points, but what's fun is to consider them. Some companies, for example, have a dead-man's switch that must be thrown before one master can failover to another in a multi-datacenter situation. Is that wrong or right? Only the shadow knows.
The advice to "document all conceivable component failures and modes and combinations" sounds good but is truly difficult to do in practice. I went through this process once on a telco project and it took months just to cover all the failure scenarios on a few cards. But the spirit is right I think.
My favorite part of the whole paper is:
We have long believed that 80% of operations issues originate in design and development, so this section
on overall service design is the largest and most important. When systems fail, there is a natural tendency
to look first to operations since that is where the problem actually took place. Most operations issues,
however, either have their genesis in design and development are best solved there.
Understand this and I think much of the rest follows naturally.
Reader Comments (3)
Eureka! This is a great paper, it is certainly worth a proper read! If you're stuck trying to make SaaS/high scalability solutions in a traditional development/operational organisation this could prove to be an eye- opener to your organisation. It also certainly raises a question I have touted for a long time; can traditional ISV's create internet scale services?
One of my favourite bits from the abstract must be:
Make the development team responsible. Amazon is perhaps the most aggressively down this path with their slogan ‘‘you built it, you manage it.’’ That position is perhaps slightly stronger than the one we would take, but it’s clearly the right general direction. If development is frequently called in the middle of the night, automation is the likely outcome. If operations is frequently called, the usual reaction is to grow the operations team.
This is a great find and a well worth the read. Thanks for finding that one for us Todd. I will most certainly be blogging about this one soon as well.
If you liked this article then here are two more you will probably like. I have written about them both already, here are the links if you are interested.
Release IT! Book Recommendation
http://www.productionscale.com/home/2008/1/27/book-recommendation-release-it.html
Harvard Business Review Article Review
http://www.productionscale.com/home/2008/3/18/hbr-article-read-recommendation-radically-simple-it.html
Regards,
Kent Langley
That bit is probably my least favorite bits from the paper, because it implies that operations personnel aren't interested in preventing future wake-ups. :) Nothing could be further from the truth, in my experience.