High Scalability -

Permalink |

Paper

Thursday

Sep052013

Paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale

Thursday, September 5, 2013 at 8:50AM

Ever wonder what powers Google's world spirit sensing Zeitgeist service? No, it's not a homunculus of Georg Wilhelm Friedrich Hegel sitting in each browser. It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring.

Abstract:

MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous ﬂow of records, all within the envelope of the framework’s fault-tolerance guarantees.

This paper describes MillWheel’s programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel’s features are used. MillWheel’s programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we ﬁnd that MillWheel’s unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.

Click to read more ...

1 Comment |

Permalink |

Paper

Thursday

Aug222013

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second edition

Thursday, August 22, 2013 at 8:45AM

Google has released an epic second edition of their ground breaking The Datacenter as a Computer book. It's called an introduction, but at 156 pages I would love to see what the Advanced version would look like!

John Fries in a G+ comment has what I think is a perfect summary of the ultimate sense of the book:

It's funny, when I was at Google I was initially quite intimidated by interacting with an enormous datacenter, and then I started imagining the entire datacenter was shrunk down into a small box sitting on my desk, and realized it was just another machine and the physical size didn't matter anymore

It's such a far ranging book that it's impossible to characterize simply. It covers an amazing diversity of topics, from an introduction to warehouse-scale computing; workloads and software infrastructure; hardware; datacenter architecture; energy and power efficiency; cost structures; how to deal with failures and repairs; and it closes with a discussion of key challenges, which include rapidly changing workloads, building responsive large scale systems, energy proportionality of non-CPU components, overcoming the end of Dennard scaling, and Amdahl's cruel law.

In reading it I get the sense the Faerie Queen has transported us to the land of Faerie, a special other place of timeless truths, where dragons roam, and mortal danger lurks. And if you do escape, nothing is quite the same ever again.

Abstract:

1 Comment |

Permalink |

Paper

Wednesday

Aug072013

RAFT - In Search of an Understandable Consensus Algorithm

Wednesday, August 7, 2013 at 9:42AM

If like many humans you've found even Paxos Made Simple a bit difficult to understand, you might enjoy RAFT as described in In Search of an Understandable Consensus Algorithm by Stanford's Diego Ongaro and John Ousterhout. The video presentation of the paper is given by John Ousterhout. Both the paper and the video are delightfully accessible.

mcherm has a good summary of the paper:

A consensus algorithm is: a cluster of servers should record a series of records ("log entries") in response to requests from clients of the cluster. (It may also take action based on those entries.) It does so in a way that guarantees that the responses seen by clients of the cluster will be consistent EVEN in the face of servers crashing in unpredictable ways (but not loosing data that was synched to disk), and networks introducing unpredictable delays or communication blockages.

Here's what Raft does. First, it elects a leader, then the leader records the master version of the log, telling other cluster servers what's in that master record and "committing" a log entry then responding to the client of the cluster to acknowledge that entry only when more than half the cluster has recorded a given entry. That works unless the leader crashes or loses communication with too many others; in such a case Raft elects a new leader. The election process is designed to guarantee that any newly elected leader will have (at least) all of the already-committed entries.

We also have a treat in the form of a great roundtable discussion of the topic via a Think Distributed hangout, featuring several folks from Basho, Peter Bailis, and Diego Ongaro.

Perhaps the most interesting part of the talk came late in the discussion when Peter commented that he was astounded that an academic paper already has so many open source implementations. RAFT already has 40 or so different implementations in many different languages.

The key that others can learn from is: understandability. Most academic papers are opaque, to put it generously. Diego talks about this saying:

2 Comments |

Permalink |

Paper

Thursday

Jun272013

Paper: XORing Elephants: Novel Erasure Codes for Big Data

Thursday, June 27, 2013 at 8:45AM

Erasure codes are one of those seemingly magical mathematical creations that with the developments described in the paper XORing Elephants: Novel Erasure Codes for Big Data, are set to replace triple replication as the data storage protection mechanism of choice.

The result says Robin Harris (StorageMojo) in an excellent article, Facebook’s advanced erasure codes: "WebCos will be able to store massive amounts of data more efficiently than ever before. Bad news: so will anyone else."

Robin says with cheap disks triple replication made sense and was economical. With ever bigger BigData the overhead has become costly. But erasure codes have always suffered from unacceptably long time to repair times. This paper describes new Locally Repairable Codes (LRCs) that are efficiently repairable in disk I/O and bandwidth requirements:

These systems are now designed to survive the loss of up to four storage elements – disks, servers, nodes or even entire data centers – without losing any data. What is even more remarkable is that, as this paper demonstrates, these codes achieve this reliability with a capacity overhead of only 60%.

They examined a large Facebook analytics Hadoop cluster of 3000 nodes with about 45 PB of raw capacity. On average about 22 nodes a day fail, but some days failures could spike to more than 100.
LRC test results found several key results.

Disk I/O and network traffic were reduced by half compared to RS codes.

The LRC required 14% more storage than RS, information theoretically optimal for the obtained locality.

Repairs times were much lower thanks to the local repair codes.

Much greater reliability thanks to fast repairs.

Reduced network traffic makes them suitable for geographic distribution.

LRC test results found several key results.

Disk I/O and network traffic were reduced by half compared to RS codes.

I wonder if we'll see a change in NoSQL database systems as well?

Erasure Coding vs. Replication: A Quantitative Comparison
Ceph - a distributed object store.

1 Comment |

Permalink |

Paper

Wednesday

Jun192013

Paper: MegaPipe: A New Programming Interface for Scalable Network I/O

Wednesday, June 19, 2013 at 8:40AM

The paper MegaPipe: A New Programming Interface for Scalable Network I/O (video, slides) hits the common theme that if you want to go faster you need a better car design, not just a better driver. So that's why the authors started with a clean-slate and designed a network API from the ground up with support for concurrent I/O, a requirement for achieving high performance while scaling to large numbers of connections per thread, multiple cores, etc. What they created is MegaPipe, "a new network programming API for message-oriented workloads to avoid the performance issues of BSD Socket API."

The result: MegaPipe outperforms baseline Linux between 29% (for long connections) and 582% (for short connections). MegaPipe improves the performance of a modiﬁed version of memcached between 15% and 320%. For a workload based on real-world HTTP traces, MegaPipe boosts the throughput of nginx by 75%.

What's this most excellent and interesting paper about?

3 Comments |

Permalink |

Paper,

Performance

Thursday

Jun062013

Paper: Memory Barriers: a Hardware View for Software Hackers

Thursday, June 6, 2013 at 8:55AM

It's not often you get so enthusiastic a recommendation for a paper as Sergio Bossa gives Memory Barriers: a Hardware View for Software Hackers: If you only want to read one piece about CPUs architecture, cache coherency and memory barriers, make it this one.

It is a clear and well written article. It even has a quiz. What's it about?

So what possessed CPU designers to cause them to inﬂict memory barriers on poor unsuspecting SMP software designers?

In short, because reordering memory references allows much better performance, and so memory barriers are needed to force ordering in things like synchronization primitives whose correct operation depends on ordered memory references.

Getting a more detailed answer to this question requires a good understanding of how CPU caches work, and especially what is required to make caches really work well. The following sections:

present the structure of a cache,

describe how cache-coherency protocols ensure that CPUs agree on the value of each location in memory, and, ﬁnally,

outline how store buﬀers and invalidate queues help caches and cache-coherency protocols achieve high performance.

We will see that memory barriers are a necessary evil that is required to enable good performance and scalability, an evil that stems from the fact that CPUs are orders of magnitude faster than are both the interconnects between them and the memory they are attempting to access.

Permalink |

Paper

Thursday

May302013

Google Finds NUMA Up to 20% Slower for Gmail and Websearch

Thursday, May 30, 2013 at 8:25AM

When you have a large population of servers you have both the opportunity and the incentive to perform interesting studies. Authors from Google and the University of California in Optimizing Google’s Warehouse Scale Computers: The NUMA Experience conducted such a study, taking a look at how jobs run on clusters of machines using a NUMA architecture. Since NUMA is common on server class machines it's a topic of general interest for those looking to maximize machine utilization across clusters.

Some of the results are surprising:

4 Comments |

Permalink |

Paper,

Performance

Thursday

May232013

Paper: Calvin: Fast Distributed Transactions for Partitioned Database Systems

Thursday, May 23, 2013 at 8:31AM

Distributed transactions are costly because they use agreement protocols. Calvin says, surprisingly, that using a deterministic database allows you to avoid the use of agreement protocols. The approach is to use a deterministic transaction layer that does all the hard work before acquiring locks and the beginning of transaction execution.

Overview:

Many distributed storage systems achieve high data access throughput via partitioning and replication, each system with its own advantages and tradeoffs. In order to achieve high scalability, however, today’s systems generally reduce transactional support, disallowing single transactions from spanning multiple partitions. Calvin is a practical transaction scheduling and data replication layer that uses a deterministic ordering guarantee to signiﬁcantly reduce the normally prohibitive contention costs associated with distributed transactions. Unlike previous deterministic database system prototypes, Calvin supports disk-based storage, scales near-linearly on a cluster of commodity machines, and has no single point of failure. By replicating transaction inputs rather than effects, Calvin is also able to support multiple consistency levels—including Paxos based strong consistency across geographically distant replicas—at no cost to transactional throughput.

If you are interested Daniel Abadi gives a very accessible overview of Calvin in If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

Permalink |

Paper

Thursday

May162013

Paper: Warp: Multi-Key Transactions for Key-Value Stores

Thursday, May 16, 2013 at 8:30AM

Looks like an interesting take on "a completely asynchronous, low-latency transaction management protocol, in line with the fully distributed NoSQL architecture."

Warp: Multi-Key Transactions for Key-Value Stores overview:

Implementing ACID transactions has been a longstanding challenge for NoSQL systems. Because these systems are based on a sharded architecture, transactions necessarily require coordination across multiple servers. Past work in this space has relied either on heavyweight protocols such as Paxos or clock synchronization for this coordination.

This paper presents a novel protocol for coordinating distributed transactions with ACID semantics on top of a sharded data store. Called linear transactions, this protocol achieves scalability by distributing the coordination task to only those servers that hold relevant data for each transaction. It achieves high performance by serializing only those transactions whose concurrent execution could potentially yield a violation of ACID semantics. Finally, it naturally integrates chain-replication and can thus tolerate faults of both clients and servers. We have fully implemented linear transactions in a commercially available data store. Experiments show that the throughput of this system achieves 1-9× more throughput than MongoDB, Cassandra and HyperDex on the Yahoo! Cloud Serving Benchmark, even though none of the latter systems provide transactional guarantees.

Permalink |