The Updated Big List of Articles on the Amazon Outage
Since The Big List Of Articles On The Amazon Outage was published we've a had few updates that people might not have seen. Amazon of course released their Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Netlix shared their Lessons Learned from the AWS Outage as did Heroku (How Heroku Survived the Amazon Outage), Smug Mug (How SmugMug survived the Amazonpocalypse), and SimpleGeo (How SimpleGeo Stayed Up During the AWS Downtime).
The curious thing from my perspective is the general lack of response to Amazon's explanation. I expected more discussion. There's been almost none that I've seen. My guess is very few people understand what Amazon was talking about enough to comment whereas almost everyone feels qualified to talk about the event itself.
Lesson for crisis handlers: deep dive post-mortems that are timely, long, honestish, and highly technical are the most effective means of staunching the downward spiral of media attention.
Amazon's Explanation of What Happened
- Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region
- Hackers News thread on AWS Service Disruption Post Mortem
- Quite Funny Commentary on the Summary
- AWS outage follow-up: if you wanted details, you got details! by RightScale
- Amazon’s Own Post Mortem by Jeff Darcy
Experiences from Specific Companies, Both Good and Bad
- Lessons Netflix Learned from the AWS Outage by several Netflixians on the Netflix Tech Blog
- How Heroku Survived the Amazon Outage on the Heroku status page
- How SimpleGeo Stayed Up During the AWS Downtime by Mike Malone
- How SmugMug survived the Amazonpocalypse by Don MacAskill (Hacker News discussion)
- How Bizo survived the Great AWS Outage of 2011 relatively unscathed... by Someone at Bizo
- Joe Stump's explanation of how SimpleGeo survived
- How Netflix Survived the Outage
- Why Twilio Wasn’t Affected by Today’s AWS Issues on Twilio Engineering's Blog (Hacker News thread)
- On reddit's outage
- What caused the Quora problems/outage in April 2011?
- Availability, redundancy, failover and data backups at LearnBoost
- How our small startup survived the Amazon EC2 Cloud-pocalypse from mobile app developer
- Recovering from Amazon cloud outage by Drew Engelson of PBS.
- PBS was affected for a while primarily because we do use EBS-backed RDS databases. Despite being spread across multiple availability-zones, we weren’t easily able to launch new resources ANYWHERE in the East region since everyone else was trying to do the same. I ended up pushing the RDS stuff out West for the time being. From Comment
Amazon Web Services Discussion Forum
A fascinating peek into the experiences of people who were dealing with the outage while they were experiencing it. Great real-time social archeology in action.
- Amazon Web Services Discussion Forum
- Cost-effective backup plan from now on?
- Life of our patients is at stake - I am desperately asking you to contact
- Why did the EBS, RDS, Cloudformation, Cloudwatch and Beanstalk all fail?
- Moved all resources off of AWS
- Any success stories?
- Is the mass exodus from East going to cause demand problems in the West?
- Finally back online after about 71 hours
- Amazon EC2 features vs windows azure
- Aren't Availability Zones supposed to be "insulated from failures"?
- What a lot of people aren't realizing about the downtime:
- ELB CNAME
- Availability Zones were used in a misleading manner
- Tip: How to recover your instance
- Crying in Forum Gets Results, Silver-level AWS Premium Support Doesn't
- Well-worth reading: "design for failure" cloud deployment strategy
- New best practice
- Don't bother with Premium Support
- Best practices for multi-region redundancy
- "Postmortum"
- Learning from this case
- Amazon, still no instructions what to do?
- Anyone else prepared for an all-nighter?
- Is Jeff Bezos going to give a public statement?
- Rackspace, GoGrid, StormonDemand and Others
- Jeff Barr, Werner Vogels and other AWS persons - where have you been???
- After you guys fix EBS do I have do anything on my side?
- Need Help!!! Lives of people and billions in revenue are at risk now!!!
- I've Got A Suspicion
- Farewell EC2, Farewell
There were also many many instances of support and help in the log.
In Summary
- Amazon EC2 outage: summary and lessons learned by RightScale
- AWS outage timeline & downtimes by recovery strategy by Eric Kidd
- The Aftermath of Amazon’s Cloud Outage by Rich Miller
Taking Sides: It's the Customer's Fault
- So Your AWS-based Application is Down? Don’t Blame Amazon by The Storage Architect
- The Cloud is not a Silver Bullet by Joe Stump (Hacker News thread)
- The AWS Outage: The Cloud's Shining Moment by George Reese (Hacker News discussion)
- Failing to Plan is Planning to Fail by Ted Theodoropoulos
- Get a life and build redundancy/resiliency in your apps on the Cloud Computing group
Taking Sides: It's Amazon's Fault
- Stop Blaming the Customers - the Fault is on Amazon Web Services by Klint Finley
- AWS is down: Why the sky is falling by Justin Santa Barbara (Hacker News thread)
- Amazon Web Services are down - Huge Hacker News thread
- The EC2/EBS outage: What Amazon didn’t tell you by Jeremy Gaddis
Lessons Learned and Other Insight Articles
- Amazon’s EBS outage by Robin Harris of StorageMojo
- People Using Amazon Cloud: Get Some Cheap Insurance At Least by Bob Warfield
- Basic scalability principles to avert downtime by Ronald Bradford
- Amazon crash reveals 'cloud' computing actually based on data centers by Kevin Fogarty
- Seven lessons to learn from Amazon's outage By Phil Wainewright
- The Cloud and Outages : Five Key Lessons by Patrick Baillie (Cloud Computing Group discussion)
- Some thoughts on outages by Till Klampaeckel
- Amazon.com’s real problem isn’t the outage, it’s the communication by Keith Smith
- How to work around Amazon EC2 outages by James Cohen (Hacker News thread)
- Today’s EC2 / EBS Outage: Lessons learned on Agile Sysadmin
- Amazon EC2 has gone down -what would a prefered hosting platform be? on Focus
- Single Points of Failure by Mat
- Coping with Cloud Downtime with Puppet
- Amazon Outage Concerns Are Overblown by Tim Crawford
- Where There Are Clouds, It Sometimes Rains by Clay Loveless
- Availability, redundancy, failover and data backups at LearnBoost by Guillermo Rauch
- Cloud hosting vs colocation by Chris Chandler (Hacker News thread)
- Amazon’s EC2 & EBS outage by Arnon Rotem-Gal-Oz
- Complex Systems Have Complex Failures. That’s Cloud Computing by Greg Ferro
- Amazon Web Services, Hosting in the Cloud and Configuration Management by Ian Chilton
- Lessons learned from deploying a production database in EC2 by by Grig Gheorghiu of Agile Testing
- Bezos on Amazon as a technology and invention company by John Gruber on Daring Fireball.
- On Importance of Planning for Failure by Dmitriy Samovskiy
Vendor's Vent
- Amazon Outage Proves Value of Riak’s Vision by Basho
- Magical Block Store: When Abstractions Fail Us by Mark Joyent (Hacker News discussion)
- On Cascading Failures and Amazon’s Elastic Block Store by Jason
- An unofficial EC2 outage postmortem - the sky is not falling from CloudHarmony
- Cloudfail: Lessons Learned from AWS Outage by Jyoti Bansal
Reader Comments (7)
Hmph. I'm not usually into this kind of self-promotion, but that sure doesn't seem to be disqualifying even more vapid articles from the list.
http://pl.atyp.us/wordpress/?p=3237 Amazon’s Outage (April 21)
http://pl.atyp.us/wordpress/?p=3242 More Fallout from the AWS Outage (April 27)
http://pl.atyp.us/wordpress/?p=3247 Amazon’s Own Post Mortem (April 29)
Thanks Jeff, that's what I was hoping for, to get some more coverage. Thanks.
Here's a funny thing (: AWS outage explained http://yfrog.com/hs9pwtp.
In regards to Twilio's post:
I couldn't get any answer about how they came to that conclusion of EBS. I'm also a bit confused because they claim they also use SQS which in my mind also doesn't satisfy the principle?
@Rob
I've been running a few postgres clusters on EBS for three years now. We were already in the process of eliminating EBS from our critical infrastructure before this outage took out our primary database cluster.
EBS is a brittle service. Drives fail, they are slow to create, slow to restore, slow to attach, and slow to detach. Most importantly, their performance is horrifically inconsistent. One day they perform fine, the next day they don't (usually more of the latter).
Those are reasons we were already ditching EBS. We assumed that EBS was better partitioned, but this outage proves that assumption was wrong and bad on our part.
I think Amazon survived this with its rep intact less 'cause of what it did during/after than because of the huge amount of credit it'd built up up front. For years they've been providing really complicated services nobody can match, priced to let anybody get a toe in the water. Quora's "We'd point fingers, but we wouldn't be where we are without Amazon EC2" isn't entirely backhanded in this reading. Also, good sysadmins and Amazon are about preparing for failures and occasional catastrophes -- AZs, regions, S3 snapshots, etc. Downtime sucks, but it's not an existential crisis.
I sort of hope Amazon offers an alternative to EBS, like an instance type with a standardized direct-attached disk array or something. Failures are inherently contained, it's tested tech, and you can push the performance further. (You're never going to max out an SSD array over GigE.)
Great list. Adding one more lesson to it - enterprise must also design their operations management solution for proactively and holistically detecting and isolating probable root cause of cloud applications. This will ultimately drive the resiliency of your cloud services. See new blog on this at http://cloudopsmanagement.wordpress.com/2011/05/05/amazon-outage-reminded-proactive-monitoring/