Top

Recommend Google: Addressing Cascading Failures (Email)

This action will generate an email recommending this article to the recipient of your choice. Note that your email address and your recipient's email address are not logged by this system.

Email Article Link

The email sent will contain a link to this article, the article title, and an article excerpt (if available). For security reasons, your IP address will also be included in the sent email.

Article Excerpt:

Like the Spanish Inquisition, nobody expects cascading failures. Here's how Google handles them.

This excerpt is a particularly interesting and comprehensive chapter—Chapter 22 - Addressing Cascading Failures—from Google's awesome book on Site Reliability Engineering. Worth reading if it hasn't been on your radar. And it's free!

Written by Mike Ulrich

If at first you don't succeed, back off exponentially."

Dan Sandler, Google Software Engineer

Why do people always forget that you need to add a little jitter?"

Ade Oshineye, Google Developer Advocate

A cascading failure is a failure that grows over time as a result of positive feedback.¹⁰⁷ It can occur when a portion of an overall system fails, increasing the probability that other portions of the system fail. For example, a single replica for a service can fail due to overload, increasing load on remaining replicas and increasing their probability of failing, causing a domino effect that takes down all the replicas for a service.

We’ll use the Shakespeare search service discussed in Shakespeare: A Sample Service as an example throughout this chapter. Its production configuration might look something like Figure 22-1.

Figure 22-1. Example production configuration for the Shakespeare search service

Causes of Cascading Failures and Designing to Avoid Them

Article Link:

Your Name:

Your Email:

Recipient Email:

Message: