Paper: Dapper, Google's Large-Scale Distributed Systems Tracing Infrastructure
Tuesday, April 27, 2010 at 7:25AM
HighScalability Team in Paper, logging

Imagine a single search request coursing through Google's massive infrastructure. A single request can run across thousands of machines and involve hundreds of different subsystems. And oh by the way, you are processing more requests per second than any other system in the world. How do you debug such a system? How do you figure out where the problems are? How do you determine if programmers are coding correctly? How do you keep sensitive data secret and safe? How do ensure products don't use more resources than they are assigned? How do you store all the data? How do you make use of it?

That's where Dapper comes in. Dapper is Google's tracing system and it was originally created to understand the system behaviour from a search request. Now Google's production clusters generate more than 1 terabyte of sampled trace data per day. So how does Dapper do what Dapper does?

Dapper is described in an very well written and intricately detailed paper: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure by Benjamin H. Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag.  The description of Dapper from The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines is:

The Dapper system, developed at Google, is an example of an annotation-based tracing tool that remains effectively transparent to application-level software by instrumenting a few key modules that are commonly linked with all applications, such as messaging, control flow, and threading  libraries. Finally, it is extremely useful to build the ability into binaries (or run-time systems) to obtain CPU, memory, and lock contention profiles of in-production programs. This can eliminate the need to redeploy new binaries to investigate performance problems.

The full paper is worth a full read and a re-read, but we'll just cover some of the highlights:

As you might expect Google has produced and elegant and well thought out tracing system. In many ways it is similar to other tracing systems, but it has that unique Google twist. A tree structure, probabilistically unique keys, sampling, emphasising common infrastructure insertion points, technically minded data exploration tools, a global system perspective, MapReduce integration, sensitivity to index size, enforcement of system wide invariants, an open API—all seem very Googlish.

The largest apparent weakness in my mind is that developers have to keep separate logs. This sucks for developers trying to figure out what the heck is going on in a system. All the same tools should be available to developers trying to drill down as the people who are trying to look across. This same bias is evident in the lack of detailed logging about queue depths, memory, locks, task switching, disk, task priorities and other detailed environmental information. When things are slow it's often these details that are the root cause.

Despite those in-the-trenches issues, Dapper seems like a very cool system that any organization can learn from.

Related Articles

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.