Thursday
Aug022012
Strategy: Use Spare Region Capacity to Survive Availability Zone Failures

In the wake of the recent Amazon problems Ryan Lackey offers some practical first responder cloud survival advice:
If you're a large site (particularly a PaaS) on AWS and care about availability, you need to have spare capacity in your region (using Reserve Instances, like Netflix does) to cover when a single AZ disappears, and your own external to AWS load balancing (not DNS based), with your own per-AZ subsidiary load balancers (nginx or whatever) running within EC2.
You need a robust database layer, ideally multi-region or AWS + nonAWS, but that's more site specific.
Going multiregion is the next step, and the above is an essential part of getting to that point.
Reader Comments (3)
I would agree, but would extend this a bit. While having a pool of reserved instances within an AZ (or AZs) of a particular region is a good idea, it is also a very expensive configuration. But as with any insurance policy, you are paying for something that you hope you never need, so what cost is too high? That is for each organization to determine based on their own cost-benefit analysis. While we recommend this setup to our customers (full disclosure: I am an Architect in the PS group at RightScale) very few of them implement such an environment due to the cost involved. What we see more commonly is an array of application servers distributed across all AZs of a region such that a loss of an AZ results in a fraction (25% as an example in the case of US-East) of these servers being affected, with the plan that the remaining servers can pick up the load as they are not running at full capacity in normal operation. Albeit performance may be diminished, the application continues to perform.
At the database tier we are seeing more and more customers running a “warm DR” scenario with a replicating slave database (or node of a replica set) running in a separate region with the rest of the infrastructure in a non-operational, but “ready to be launched” state. While this replication traffic needs to be secured and is more costly due to public Internet rates, the price of this insurance policy is one that many customers are willing to incur.
There should be particular emphasis on the non-AWS based load balancing because in the recent outages, API access was also affected. This not only prevents launching of new instances but also reconfiguring systems like elastic load balancers and elastic IPs. However, non-DNS based implementations of this will be challenging because that implies out of AWS latency at the edge load balancer, which could be significant depending on the location.
"However, non-DNS based implementations of this will be challenging because that implies out of AWS latency at the edge load balancer, which could be significant depending on the location."
Y..yeah. That's what threw me for a loop in that description. No silver bullet, I suppose.