Debugging Distributed Systems. Excellent overview of the all challenges of debugging distributed systems. Problems like heterogeneity, concurrency, distributed state, and partial failures. Attempts at solutions like testing, model checking, theorem proving, record and replay, tracing, log analysis, and visualization. What really shines is
ShiViz, a tool for studying executions of distributed systems, which " displays the happens-before relation. Given event e at node n, the happens-before relation indicates all the events that logically precede e." The time-space diagram is a huge step from multiple terminals running grep on logs. It also has a diff feature for comparing program runs. That would be handy when you are wondering what the heck is different this time.