Facebook and Site Failures Caused by Complex, Weakly Interacting, Layered Systems
Thursday, September 30, 2010 at 6:37AM
HighScalability Team in Strategy, admin, facebook, postmortem

Facebook has been so reliable that when a site outage does occur it's a definite learning opportunity. Fortunately for us we can learn something because in More Details on Today's Outage, Facebook's Robert Johnson gave a pretty candid explanation of what caused a rare 2.5 hour period of down time for Facebook. It wasn't a simple problem. The root causes were feedback loops and transient spikes caused ultimately by the complexity of weakly interacting layers in modern systems. You know, the kind everyone is building these days. Problems like this are notoriously hard to fix and finding a real solution may send Facebook back to the whiteboard. There's a technical debt that must be paid. 

The outline and my interpretation (reading between the lines) of what happened is:

This kind of thing happens in complex systems as abstractions leak all over each other at the most inopportune times. So the typical Internet reply to every failure of "how could they be so stupid, that would never happen to someone as smart as me", doesn't really apply. Complexity kills. Always.

Based on nothing but pure conjecture, what are some of the key issues here for system designers?

There are a lot of really interesting system design issues here. To what degree you want to handle these type of issues depends a lot on your SLAs, resources, and so on. But as systems grow in complexity there's a lot more infrastructure code that needs to be written to keep it all working together without killing each other. Much like a family :-)

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.