The email sent will contain a link to this article, the article title, and an article excerpt (if available). For security reasons, your IP address will also be included in the sent email.

Like the Spanish Inquisition, nobody expects cascading failures. Here's how Google handles them.
This excerpt is a particularly interesting and comprehensive chapter—Chapter 22 - Addressing Cascading Failures—from Google's awesome book on Site Reliability Engineering. Worth reading if it hasn't been on your radar. And it's free!
Written by Mike Ulrich
If at first you don't succeed, back off exponentially."
Dan Sandler, Google Software Engineer
Why do people always forget that you need to add a little jitter?"
Ade Oshineye, Google Developer Advocate
A cascading failure is a failure that grows over time as a result of positive feedback.107 It can occur when a portion of an overall system fails, increasing the probability that other portions of the system fail. For example, a single replica for a service can fail due to overload, increasing load on remaining replicas and increasing their probability of failing, causing a domino effect that takes down all the replicas for a service.
We’ll use the Shakespeare search service discussed in Shakespeare: A Sample Service as an example throughout this chapter. Its production configuration might look something like Figure 22-1.

Figure 22-1. Example production configuration for the Shakespeare search service
Causes of Cascading Failures and Designing to Avoid Them