« Sponsored Post: Percona, Mathworks, AppDynamics, Gazillion, Edmunds, OPOWER, ClearStone, ScaleOut, aiCache, WAPT, Karmasphere, Newrelic, Cloudkick, Membase, CloudSigma, ManageEngine, Site24x7 | Main | Stack Overflow Makes Slow Pages 100x Faster by Simple SQL Tuning »
Monday
May022011

The Updated Big List of Articles on the Amazon Outage

Since The Big List Of Articles On The Amazon Outage was published we've a had few updates that people might not have seen. Amazon of course released their Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Netlix shared their Lessons Learned from the AWS Outage as did Heroku (How Heroku Survived the Amazon Outage), Smug Mug (How SmugMug survived the Amazonpocalypse), and SimpleGeo (How SimpleGeo Stayed Up During the AWS Downtime). 

The curious thing from my perspective is the general lack of response to Amazon's explanation. I expected more discussion. There's been almost none that I've seen. My guess is very few people understand what Amazon was talking about enough to comment whereas almost everyone feels qualified to talk about the event itself.

Lesson for crisis handlers: deep dive post-mortems that are timely, long, honestish, and highly technical are the most effective means of staunching the downward spiral of media attention. 

Amazon's Explanation of What Happened

Experiences from Specific Companies, Both Good and Bad

Amazon Web Services Discussion Forum 

A fascinating peek into the experiences of people who were dealing with the outage while they were experiencing it. Great real-time social archeology in action.

There were also many many instances of support and help in the log. 

In Summary

Taking Sides: It's the Customer's Fault

Taking Sides: It's Amazon's Fault

Lessons Learned and Other Insight Articles

Vendor's Vent

Reader Comments (7)

Hmph. I'm not usually into this kind of self-promotion, but that sure doesn't seem to be disqualifying even more vapid articles from the list.

http://pl.atyp.us/wordpress/?p=3237 Amazon’s Outage (April 21)
http://pl.atyp.us/wordpress/?p=3242 More Fallout from the AWS Outage (April 27)
http://pl.atyp.us/wordpress/?p=3247 Amazon’s Own Post Mortem (April 29)

May 2, 2011 | Unregistered CommenterJeff Darcy

Thanks Jeff, that's what I was hoping for, to get some more coverage. Thanks.

Here's a funny thing (: AWS outage explained http://yfrog.com/hs9pwtp.

May 2, 2011 | Unregistered CommenterAhmet Alp Balkan

In regards to Twilio's post:

We’ve been a slow adopter of EBS for core parts of our persistence infrastructure because it doesn’t satisfy the “unit-of-failure is a single host principle.” If EBS were to experience a problem, all dependent service could also experience failures.

I couldn't get any answer about how they came to that conclusion of EBS. I'm also a bit confused because they claim they also use SQS which in my mind also doesn't satisfy the principle?

May 2, 2011 | Unregistered CommenterRob Olmos

@Rob

I've been running a few postgres clusters on EBS for three years now. We were already in the process of eliminating EBS from our critical infrastructure before this outage took out our primary database cluster.

EBS is a brittle service. Drives fail, they are slow to create, slow to restore, slow to attach, and slow to detach. Most importantly, their performance is horrifically inconsistent. One day they perform fine, the next day they don't (usually more of the latter).

Those are reasons we were already ditching EBS. We assumed that EBS was better partitioned, but this outage proves that assumption was wrong and bad on our part.

May 3, 2011 | Unregistered CommenterBryan Murphy

I think Amazon survived this with its rep intact less 'cause of what it did during/after than because of the huge amount of credit it'd built up up front. For years they've been providing really complicated services nobody can match, priced to let anybody get a toe in the water. Quora's "We'd point fingers, but we wouldn't be where we are without Amazon EC2" isn't entirely backhanded in this reading. Also, good sysadmins and Amazon are about preparing for failures and occasional catastrophes -- AZs, regions, S3 snapshots, etc. Downtime sucks, but it's not an existential crisis.

I sort of hope Amazon offers an alternative to EBS, like an instance type with a standardized direct-attached disk array or something. Failures are inherently contained, it's tested tech, and you can push the performance further. (You're never going to max out an SSD array over GigE.)

May 4, 2011 | Unregistered CommenterRandall

Great list. Adding one more lesson to it - enterprise must also design their operations management solution for proactively and holistically detecting and isolating probable root cause of cloud applications. This will ultimately drive the resiliency of your cloud services. See new blog on this at http://cloudopsmanagement.wordpress.com/2011/05/05/amazon-outage-reminded-proactive-monitoring/

May 5, 2011 | Unregistered CommenterHarry

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>