The email sent will contain a link to this article, the article title, and an article excerpt (if available). For security reasons, your IP address will also be included in the sent email.

This is a guest post by Steve Newman, co-founder of Writely (Google Docs), tech lead on the Paxos-based synchronous replication in Megastore, and founder of cloud service provider Scalyr.com.
Microsoft’s Azure service suffered a widely publicized outage on February 28th / 29th. Microsoft recently published an excellent postmortem. For anyone trying to run a high-availability service, this incident can teach several important lessons.
The central lesson is that, no matter how much work you put into redundancy, problems will arise. Murphy is strong and, I might say, creative; things go wrong. So preventative measures are important, but how you react to problems is just as important. It’s interesting to review the Azure incident in this light...