« The 5 Biggest Ways to Boost MySQL Scalability | Main | Stuff The Internet Says On Scalability For August 12, 2011 »
Monday
Aug152011

Should any cloud be considered one availability zone? The Amazon experience says yes.

Amazon has a very will written account of their 8/8/2011 downtime: Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region. Power failed, backup generators failed to kick in, there weren't enough resources for EBS volumes to recover, API servers where overwhelmed, a DNS failure caused failovers to alternate availability zones to fail, a double fault occurred as the power event interrupted the repair of a different bug. All kind of typical stuff that just seems to happen.

Considering the previous outage, the big question for programmers is: what does this mean? What does it mean for how systems should be structured? Have we learned something that can't be unlearned?

The Amazon post has lots of good insights into how EBS and RDS work, plus lessons learned. The short of the problem is large + complex = high probability of failure. The immediate fixes are adding more resources, more redundancy, more isolation between components, more automation, reduce recovery times, and build software that is more aware of large scale failure modes. All good, solid, professional responses. Which is why Amazon has earned a lot of trust.

We can predict, however, problems like this will continue to happen, not because of any incompetence by Amazon, but because: large + complex make cascading failure an inherent characteristic of the system. At some level of complexity any cloud/region/datacenter could be reasonably considered a single failure domain and should be treated accordingly, regardless of the heroic software infrastructure created to carve out availability zones.

Viewing a region as a single point of failure implies to be really safe you would need to be in multiple regions, which is to say multiple locations. Diversity as mother nature's means of robustness would indicate using different providers as a good strategy. Something a lot of people have been saying for a while, but with more evidence coming in, that conclusion is even stronger now. We can't have our cake and eat it too.

For most projects this conclusion doesn't really matter all that much. 100% uptime is extremely expensive and Amazon will usually keep your infrastructure up and working. Most of the time multiple Availability Zones are all you need. And you can always say hey, we're on Amazon, what I can I do? It's the IBM defense.

All this diversity of course is very expensive and and very complicated. Double the budget. Double the complexity. The problem of synchronizing data across datacenters. The problem of failing over and recovering properly. The problem of multiple APIs. And so on. 

Another option is a retreat into radical simplicity. Complexity provides a lot of value, but it also creates fragility. Is there way to become radically simpler?

Related Articles

Reader Comments (5)

This is why I think that the real value of the cloud paradigm is in the notion of services and APIs, not the specifics of the implementation or so-called elasticity or scalability or cost savings. The Amazon API has become the de facto standard, because they were first, and it's good enough. As time passes, we'll see it be adopted much more broadly. Eucalyptus stands a bit alone in their field right now, but that'll become a crowded market in the not-too-distant future.

August 15, 2011 | Unregistered CommenterBaron Schwartz

I recently presented an HA deployment to a big client for their website. It had no SPOF's down to the level of data center providers. That is to say that I had designed an active-passive deployment scenario w/ async replication of data and simultaneous deployment of all assets to keep things reasonably in sync and provide a reasonable RPO and RTO. There were even multiple CDN's in this case w/ dynamic DNS for management of that and other related issues.

Client took a look, evaluated, made some calls, realized they could save some money each year by going w/ one data center provider and forced my hand on that issue.

I also explained that one of the reasons was price negotiating since they had a preference for fixed contracts over purely on-demand. When it only takes a couple of hours to be up and running on another provider and you can turn off one or the other DC's w/ little business impact it puts you in a more reasonable negotiating position with your infrastructure provider if they start monkeying around with pricing structures in an unfair way. Remember on demand can often mean less control over pricing changes.

They just could not really agree with me that it was important and worth the money to have two DC providers so it's deployed in one with the rest of the details still in-tact at least.

I'll have to send them a link to this post....

August 15, 2011 | Unregistered CommenterKent Langley

Todd,
Great post. I'm in full agreement with your position. We're incrementally moving the intelligence related to scaling & availability out of the individual platform and into the cloud fabric. As you noted, this has a huge impact. The upside is that we can quickly make just about any platform/application highly resilient with little effort/cost. The down side is what Noah indicates, "The greater the degree of heterogeneity in an interactive system, the more resistant it is to collapse." Conversely, a largely homogeneous cloud fabric will, imho, face a greater degree of cascading failure.

Baron's comments are spot-on. The only thing I'd add is that we need to make sure that behind our 'as a service' interfaces, we don't overly rely on homogeneous components. I'll suggest that alternative implementations behind the aaS interface will decrease the chance of the cascading collapse (e.g., infectious algorithms spilling across availability zones, etc.) We don't necessarily need to go to multiple providers to get diversification of implementation. As 'collapse dynamics' enters mainstream, best of breed providers might be forced to embrace the 'single interface + diverse implementation' strategy.
Jeff

August 16, 2011 | Unregistered CommenterJeff Schneider

As you discussed, and as recent events have borne out, the availability zone segmentation is a great idea and in practice it almost always works. The problem is the *almost* part. As you mention, a region can be viewed as a single point of failure, and as such it is a necessity to have a DR environment in another region or with another infrastructure provider. While a true multi-region architecture looks good on paper, in practice it is complicated by bandwidth costs, latency between regions, and security issues (once your bits and bytes leave the region, they are in the Wild West of the public Internet). It is for this reason that a lot of customers we talk to come into a conversation convinced they need a multi-region/cloud environment, but usually leave with a plan to make the best use of availability zones and have a solid DR solution in place.
Your comment on the complexity of multiple APIs is true, and while the Amazon API may be seen as the de facto standard as Baron points out, it is currently far from being universally accepted, as seen by Cloud.com, OpenStack, Rackspace, et. al. As such, a cloud management platform that insulates the end user from having to deal with the complexities of the underlying APIs and instead presents a “single pane of glass” to allow them to manage all of their infrastructure regardless of which cloud it resides in, or which vendor provides that cloud, can prove invaluable.

August 19, 2011 | Unregistered CommenterBrian Adler

> "Is there way to become radically simpler?"

Break down the IaaS offerings to just that, at their lowest common denomiator.

A host.

You get a host and work (build) from there.

Building solely with any non-redundant feature sets of an IaaS provider is building for failure. AWS says it themselves. AWS ec2 has become a PaaS, it has long ago left the realms of an IaaS. Vendor tie-in anyone?

And ensure if you do build something custom with vendor feature tie-in, you have a live copy of your data and customised setup at providerB mimicking that specific vendor tie-in feature . "The network is the only source of truth" and the data is the truth.

'one vendor' == 'spf'

Hopefully 2013 will be the year of zpf thinking.

People forget how much hassle it was to just change ISP on a single host/user back in the day. The crowd today have not really figured that one out yet. Is it going to be painful??? Refactor that "infrastructure as code" a.k.a "migration" only in code with dependencies this time....

Have we not been down these routes many times before?

What is the next URL of the page with the roll call of all those that are DOWN because of the latest in-progress ec2 failure going to be? I forget the last 2 I saw. But hey #devops ec2 heroku go go go...

It is hard and complex, but evolution does tend to favour complexity over simplicity. Trust me breaking it down to just the lowest common denominator, the host, makes that route complex as way, but stability favours complexity so... I am here now, no turning back and hey its 2013 - zero points of failure, hybrid clouds/physical infrastructures, raindrops - distributed risk.

> "but because: large + complex make cascading failure an inherent characteristic of the system"

The evolution in our universe disagrees - complexity and seeming chaos produce long-term stability, but every system is destined to fail.

Failure is an inherent characteristic of all systems.

February 19, 2013 | Unregistered CommenterGary Wilson (@earthgeko)

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>