Entries in Paper (127)

Monday
Oct042010

Paper: An Analysis of Linux Scalability to Many Cores  

An Analysis of Linux Scalability to Many Cores, by a number of MIT researchers, is a refreshingly practical paper on what it takes to scale Linux and common applications like Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce to run on 48 core systems. A very timely paper given moderately massive multicore systems are reportedly the near future of computing.

This paper must have taken a lot of work. They both tracked down bottlenecks in a number of applications and the Linux kernel and they also tried to fix them. Modestly speaking the authors said they made "modest" changes to the kernel and applications, but there's nothing modest about what they did. It's excellent work.

After the next bit, which is the abstract, there is a list of the problems they found and how they fixed them.

Click to read more ...

Friday
Oct012010

Google Paper: Large-scale Incremental Processing Using Distributed Transactions and Notifications

This paper, Large-scale Incremental Processing Using Distributed Transactions and Notifications by Daniel Peng and Frank Dabek, is Google's much anticipated description of Percolator, their new real-time indexing system.

The abstract:

Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.

 

We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%. 
Wednesday
Sep012010

Paper: The Case for Determinism in Database Systems  

Can you have your ACID cake and eat your distributed database too? Yes explains Daniel Abadi, Assistant Professor of Computer Science at Yale University, in an epic post, The problems with ACID, and how to fix them without going NoSQL, coauthored with Alexander Thomson, on their paper The Case for Determinism in Database Systems. We've already seen VoltDB offer the best of both worlds, this sounds like a completely different approach.

The solution, they propose, is: 

Click to read more ...

Wednesday
Aug182010

Misco: A MapReduce Framework for Mobile Systems - Start of the Ambient Cloud?

Misco: A MapReduce Framework for Mobile Systems is a very exciting paper to me because it's really one of the first explorations of some of the ideas in Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud. What they are trying to do is efficiently distribute work across a set cellphones using a now familiar MapReduce interface. Usually we think of MapReduce as working across large data center hosted clusters. Here, the cluster nodes are cellphones not contained in any data center, but compute nodes potentially distributed everywhere.

I talked briefly with Adam Dou, one of the paper's authors, and he said they don't see cellphone clusters replacing dedicated computer clusters, primarily because of the power required for both network communication and the map-reduce computations. Large multi-terabyte jobs aren't in the cards...yet. Adam estimates computationally that cellphones are performing similarly to desktops of ten years ago. Instead, they want to focus on the unique characteristics of the mobile devices--camera, microphone, GPS and other directly collectable data--so the data can be processed where collected.

MapReduce was selected as the programming interface because it is familiar to programmers, it transparently supports programming multiple devices, and can be implemented--especially using Python---in such a way that programmers are freed from all the underlying details like concurrency, data distribution, and code management. A very smart move in my estimation. 

It's interesting to contrast the economics of the ambient cloud to the economics of the data center cloud. The goal of a data center cloud is 100 percent utilization. Use every possible CPU cycle or money is being wasted money on unused equipment. In an ambient cloud the idea is more parasitic, deploy to more resources yet leave the primary function of the device unaffected. It's a different perspective that may lead to different architectures.

A quick introduction to Misco from the abstract:

Click to read more ...

Wednesday
Aug042010

Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm

If Google was a boxer then MapReduce would be a probing right hand that sets up the massive left hook that is Dremel, Google's—scalable (thousands of CPUs, petabytes of data, trillions of rows), SQL based, columnar, interactive (results returned in seconds), ad-hoc—analytics system. If Google was a magician then MapReduce would be the shiny thing that distracts the mind while the trick goes unnoticed. I say that because even though Dremel has been around internally at Google since 2006, we have not heard a whisper about it. All we've heard about is MapReduce, clones of which have inspired entire new industries. Tricky.

Dremel, according to Brian Bershad, Director of Engineering at Google, is targeted at solving BigData class problems:

While we all know that systems are huge and will get even huger, the implications of this size on programmability, manageability, power, etc. is hard to comprehend. Alfred noted that the Internet is predicted to be carrying a zetta-byte (1021 bytes) per year in just a few years. And growth in the number of processing elements per chip may give rise to warehouse computers of having 1010 or more processing elements. To use systems at this scale, we need new solutions for storage and computation.

Click to read more ...

Wednesday
Jun302010

Paper: GraphLab: A New Framework For Parallel Machine Learning

In the never ending quest to figure out how to do something useful with never ending streams of data, GraphLab: A New Framework For Parallel Machine Learning wants to go beyond low-level programming, MapReduce, and dataflow languages with a new parallel framework for ML (machine learning) which exploits the sparse structure and common computational patterns of ML algorithms. GraphLab enables ML experts to easily design and implement efficient scalable parallel algorithms by composing problem specific computation, data-dependencies, and scheduling.  Our main contributions include: 

  • A graph-based data model which simultaneously represents data and computational dependencies. 
  • A set of concurrent access models which provide a range of sequential-consistency guarantees. 
  • A sophisticated modular scheduling mechanism. 
  • An aggregation framework to manage global state. 

Click to read more ...

Friday
Jun182010

Paper: The Declarative Imperative: Experiences and Conjectures in Distributed Logic

The Declarative Imperative: Experiences and Conjectures in Distributed Logic is written by UC Berkeley's Joseph Hellerstein for a keynote speech he gave at PODS. The video version of the talk is here. You may have heard about Mr. Hellerstein through the Berkeley Orders Of Magnitude project (BOOM), whose purpose is to help people build systems that are OOM (orders of magnitude) bigger than are building today, with OOM less effort than traditional programming methodologies. A noble goal which may be why BOOM was rated as a top 10 emerging technology for 2010 by MIT Technology Review. Quite an honor.

The motivation for the talk is a familiar one: it's a dark period for computer programming and if we don't learn how to write parallel programs the children of Moore's law will destroy us all. We have more and more processors, yet we are stuck on figuring out how the average programmer can exploit them. The BOOM solution is the Bloom language which is based on Dedalus: 

Click to read more ...

Wednesday
Jun092010

Paper: Propagation Networks: A Flexible and Expressive Substrate for Computation 

Alexey Radul in his fascinating 174 page dissertation Propagation Networks: A Flexible and Expressive Substrate for Computation, offers to help us break free of the tyranny of linear time by arranging computation as a network of autonomous but interconnected machines.  We can do this by organizing computation as a network of interconnected machines of some kind, each of which is free to run when it pleases, propagating  information around the network as proves possible. The consequence of this freedom is that the structure of the aggregate does not impose an order of time. The abstract from his thesis is:

Click to read more ...

Wednesday
May262010

End-To-End Performance Study of Cloud Services

Cloud computing promises a number of advantages for the deployment of data-intensive applications. Most prominently, these include reducing cost with a pay-as-you-go business model and (virtually) unlimited throughput by adding servers if the workload increases. At the Systems Group, ETH Zurich, we did an extensive end-to-end performance study to compare the major cloud offerings regarding their ability to fulfill these promises and their implied cost.

Click to read more ...

Tuesday
Apr272010

Paper: Dapper, Google's Large-Scale Distributed Systems Tracing Infrastructure

Imagine a single search request coursing through Google's massive infrastructure. A single request can run across thousands of machines and involve hundreds of different subsystems. And oh by the way, you are processing more requests per second than any other system in the world. How do you debug such a system? How do you figure out where the problems are? How do you determine if programmers are coding correctly? How do you keep sensitive data secret and safe? How do ensure products don't use more resources than they are assigned? How do you store all the data? How do you make use of it?

That's where Dapper comes in. Dapper is Google's tracing system and it was originally created to understand the system behaviour from a search request. Now Google's production clusters generate more than 1 terabyte of sampled trace data per day. So how does Dapper do what Dapper does?

Click to read more ...

Page 1 ... 6 7 8 9 10 ... 13 Next 10 Entries »