The email sent will contain a link to this article, the article title, and an article excerpt (if available). For security reasons, your IP address will also be included in the sent email.

For a long time now stateless services have been the royal road to scalability. Nearly every treatise on scalability declares statelessness as the best practices approved method for building scalable systems. A stateless architecture is easy to scale horizontally and only requires simple round-robin load balancing.
What’s not to love? Perhaps the increased latency from the roundtrips to the database. Or maybe the complexity of the caching layer required to hide database latency problems. Or even the troublesome consistency issues.
But what of stateful services? Isn’t preserving identity by shipping functions to data instead of shipping data to functions a better approach? It often is, but we don’t hear much about how to build stateful services. In fact, do a search and there’s very little in the way of a systematic approach to building stateful services. Wikipedia doesn’t even have an entry for stateful service.
Caitie McCaffrey, Tech Lead for Observability at Twitter, is fixing all that with a refreshing talk she gave at the Strange Loop conference on Building Scalable Stateful Services (slides).
Refreshing because I’ve never quite heard of building stateful services in the way Caitie talks about building them. You’ll recognize most of the ideas--Sticky Sessions, Data Shipping Paradigm, Function Shipping Paradigm, Data Locality, CAP, Cluster Membership, Gossip Protocols, Consistent Hashing, DHT---but she weaves them around the theme of building stateful services in a most compelling way.
The highlight of the talk for me is when Caitie ties the whole talk together around the discussion of her experiences developing Halo 4 using Microsoft’s Orleans on top of Azure. Orleans doesn’t get enough coverage. It’s based on an inherently stateful distributed virtual Actor model; a highly available Gossip Protocol is used for cluster membership; and a two tier system of Consistent Hashing plus a Distributed Hash Table is used for work distribution. With this approach Orleans can rebalance a cluster when a node fails, or capacity is added/contracted, or a node becomes hot. The result is Halo was able to run a stateful Orleans cluster in production at 90-95% CPU utilization across the cluster.
Orleans isn't the only example system covered. Facebook's Scuba and Uber's Ringpop are also analyzed using Caitie's stateful architecture framework. There's also a very interesting section on how Facebook cleverly implements fast database restarts for large in-memory databases by decoupling the memory lifetime from the process lifetime.
So let’s jump in and learn how to build stateful services...
Stateless Services are Wasteful