High Scalability -

Entries in Paper (127)

Friday

May012009

FastBit: An Efficient Compressed Bitmap Index Technology

Friday, May 1, 2009 at 1:42AM

Data mining and fast queries are always in that bin of hard to do things where doing something smarter can yield big results. Bloom Filters are one such do it smarter strategy, compressed bitmap indexes are another. In one application "FastBit outruns other search indexes by a factor of 10 to 100 and doesn’t require much more room than the original data size." The data size is an interesting metric. Our old standard b-trees can be two to four times larger than the original data. In a test searching an Enron email database FastBit outran MySQL by 10 to 1,000 times.

FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. Compared with other optimal indexing methods, bitmap indices are superior because they can be efficiently combined to answer multi-dimensional queries whereas other optimal methods can not.

It's not all just map-reduce and add more servers until your attic is full.

FastBit: Digging through databases faster. An excellent description of how FastBit works, especially compared to b-trees.

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

Print Article

Email Article

Paper,

Thursday

Apr162009

Paper: The End of an Architectural Era (It’s Time for a Complete Rewrite)

Thursday, April 16, 2009 at 1:16AM

Update 3: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. Update 2: H-Store: A Next Generation OLTP DBMS is the project implementing the ideas in this paper: The goal of the H-Store project is to investigate how these architectural and application shifts affect the performance of OLTP databases, and to study what performance benefits would be possible with a complete redesign of OLTP systems in light of these trends. Our early results show that a simple prototype built from scratch using modern assumptions can outperform current commercial DBMS offerings by around a factor of 80 on OLTP workloads. Update: interesting related thread on Lamda the Ultimate. A really fascinating paper bolstering many of the anti-RDBMS threads the have popped up on the intertube lately. The spirit of the paper is found in the following excerpt: In summary, the current RDBMSs were architected for the business data processing market in a time of different user interfaces and different hardware characteristics. Hence, they all include the following System R architectural features: * Disk oriented storage and indexing structures * Multithreading to hide latency * Locking-based concurrency control mechanisms * Log-based recovery Of course, there have been some extensions over the years, including support for compression, shared-disk architectures, bitmap indexes, support for user-defined data types and operators, etc. However, no system has had a complete redesign since its inception. This paper argues that the time has come for a complete rewrite. Of particular interest the discussion of H-store, which seems like a nice database for the data center. H-Store runs on a grid of computers. All objects are partitioned over the nodes of the grid. Like C-Store [SAB+05], the user can specify the level of K-safety that he wishes to have. At each site in the grid, rows of tables are placed contiguously in main memory, with conventional B-tree indexing. B-tree block size is tuned to the width of an L2 cache line on the machine being used. Although conventional B-trees can be beaten by cache conscious variations [RR99, RR00], we feel that this is an optimization to be performed only if indexing code ends up being a significant performance bottleneck. Every H-Store site is single threaded, and performs incoming SQL commands to completion, without interruption. Each site is decomposed into a number of logical sites, one for each available core. Each logical site is considered an independent physical site, with its own indexes and tuple storage. Main memory on the physical site is partitioned among the logical sites. In this way, every logical site has a dedicated CPU and is single threaded. The paper goes through how databases should be written with modern CPU, memory, and network resources. It's a fun an interesting read. Well worth your time.

Click to read more ...

Todd Hoff |

10 Comments |

Permalink |

Print Article

Email Article

Database,

Paper

Thursday

Mar122009

Paper: Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments

Thursday, March 12, 2009 at 2:02AM

Authors: Kevin Lim Parthasarathy Ranganathan Jichuan Chang Chandrakant Patel Trevor Mudge Steven Reinhardt This International Symposium on Computer Architecture paper seeks to understand and design next-generation servers for emerging "warehouse-computing" environments. We make two key contributions. First, we put together a detailed evaluation infrastructure including a new benchmark suite for warehouse-computing workloads, and detailed performance, cost, and power models, to quantitatively characterize bottlenecks. Second, we study a new solution that incorporates volume non-server-class components in novel packaging solutions, with memory sharing and flash-based disk caching. Our results show that this approach has promise, with a 2X improvement on average in performance-per-dollar for our benchmark suite.

Click to read more ...

geekr |

1 Comment |

Permalink |

Print Article

Email Article

Paper,

flash

Tuesday

Mar102009

Paper: Consensus Protocols: Paxos

Tuesday, March 10, 2009 at 12:58AM

Update:Barbara Liskov’s Turing Award, and Byzantine Fault Tolerance. Henry Robinson has created an excellent series of articles on consensus protocols. We already covered his 2 Phase Commit article and he also has a 3 Phase Commit article showing how to handle 2PC under single node failures. But that is not enough! 3PC works well under node failures, but fails for network failures. So another consensus mechanism is needed that handles both network and node failures. And that's Paxos. Paxos correctly handles both types of failures, but it does this by becoming inaccessible if too many components fail. This is the "liveness" property of protocols. Paxos waits until the faults are fixed. Read queries can be handled, but updates will be blocked until the protocol thinks it can make forward progress. The liveness of Paxos is primarily dependent on network stability. In a distributed heterogeneous environment you are at risk of losing the ability to make updates. Users hate that. So when companies like Amazon do the seemingly insane thing of creating eventually consistent databases, it should be a little easier to understand now. Partitioning is required for scalability. Partitioning brings up these nasty consensus issues. Not being able to write under partition failures is unacceptable. Therefor create a system that can always write and work on consistency when all the downed partitions/networks are repaired.

Google's Paxos Made Live – An Engineering Perspective

ZooKeeper - A Reliable, Scalable Distributed Coordination System

Impossibility of Distributed Consensus with One Faulty Process by Lynch et al

Consensus, impossibility results and Paxos by Ken Birman

Paxos for System Builders by Jonathan Kirsch and Yair Amir

Click to read more ...

Todd Hoff |

Post a Comment |

Permalink |

Print Article

Email Article

Paper,

consensus

Monday

Feb092009

Paper: Consensus Protocols: Two-Phase Commit

Monday, February 9, 2009 at 12:28AM

Henry Robinson has created an excellent series of articles on consensus protocols. Henry starts with a very useful discussion of what all this talk about consensus really means: The consensus problem is the problem of getting a set of nodes in a distributed system to agree on something - it might be a value, a course of action or a decision. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network. In this article Henry tackles Two-Phase Commit, the protocol most databases use to arrive at a consensus for database writes. The article is very well written with lots of pretty and informative pictures. He did a really good job. In conclusion we learn 2PC is very efficient, a minimal number of messages are exchanged and latency is low. The problem is when a co-ordinator fails availability is dramatically reduced. This is why 2PC isn't generally used on highly distributed systems. To solve that problem we have to move on to different algorithms and that is the subject of other articles.

Click to read more ...

Todd Hoff |

Post a Comment |

Permalink |

Print Article

Email Article

Paper,

consensus

Tuesday

Feb032009

Paper: Optimistic Replication

Tuesday, February 3, 2009 at 3:42AM

To scale in the large you have to partition. Data has to be spread around, replicated, and kept consistent (keeping replicas sufficiently similar to one another despite operations being submitted independently at different sites). The result is a highly available, well performing, and scalable system. Partitioning is required, but it's a pain to do efficiently and correctly. Until Quantum teleportation becomes a reality how data is kept consistent across a bewildering number of failure scenarios is a key design decision. This excellent paper by Yasushi Saito and Marc Shapiro takes us on a wild ride (OK, maybe not so wild) of different approaches to achieving consistency. What's cool about this paper is they go over some real systems that we are familiar with and cover how they work: DNS (single-master, state-transfer), Usenet (multi-master), PDAs (multi-master, state-transfer, manual or application-specific conflict resolution), Bayou (multi-master, operation-transfer, epidemic propagation, application conflict resolution), CVS (multi-master operation-transfer, centralized, manual conflict resolution). The paper then goes on to explain in detail the different approaches to achieving consistency. Most of us will never have to write the central nervous system of an application like this, but knowing about the different approaches and tradesoffs is priceless. The abstract:

Data replication is a key technology in distributed data sharing systems, enabling higher availability and performance. This paper surveys optimistic replication algorithms that allow replica contents to diverge in the short term, in order to support concurrent work practices and to tolerate failures in low-quality communication links. The importance of such techniques is increasing as collaboration through wide-area and mobile networks becomes popular. Optimistic replication techniques are different from traditional “pessimistic” ones. Instead of synchronous replica coordination, an optimistic algorithm propagates changes in the background, discovers conflicts after they happen and reaches agreement on the final contents incrementally. We explore the solution space for optimistic replication algorithms. This paper identifies key challenges facing optimistic replication systems — ordering operations, detecting and resolving conflicts, propagating changes efficiently, and bounding replica divergence—and provides a comprehensive survey of techniques developed for addressing these challenges.

If you can't wait to know the ending, here's the summary of the paper:

We summarize some of the lessons learned from our own experience and in reviewing the literature. Optimistic, asynchronous data replication is an appealing technique; it indeed improves networking flexibility and scalability. Some environments or application areas could simply not function without optimistic replication. However, optimistic replication also comes with a cost. The algorithmic complexity of ensuring eventual consistency can be high. Conflicts usually require application-specific resolution, and the lost update problem is ultimately unavoidable. Hence our recommendations: (1) Keep it simple. Traditional, pessimistic replication, with many off-the-shelf solutions, is perfectly adequate in small-scale, fully connected, reliable networking environments. Where pessimistic techniques are the cause of poor performance or lack of availability, or do not scale well, try single-master replication: it is simple, conflictfree, and scales well in practice. State transfer using Thomas’s write rule works well for many applications. Advanced techniques such as version vectors and operation transfer should be used only when you need flexibility and semantically rich conflict resolution. (2) Propagate operations quickly to avoid conflicts. While connected, propagate often and keep replicas in close synchronization. This will minimize divergence when disconnection does occur. (3) Exploit commutativity. Commutativity should be the default; design your system so that non-commutative operations are the uncommon case. For instance, whenever possible, partition data into small, independent objects. Within an object, use monotonic data structures such as an append-only log, a monotonically increasing counter, or a union-only set. When operations are dependent upon each other, represent the invariants explicitly.

The End of an Architectural Era (It’s Time for a Complete Rewrite)

Big Table

Google's Paxos Made Live – An Engineering Perspective

Dynamo: Amazon’s Highly Available Key-value Store

Eventually Consistent - Revisited by Werner Vogels

Click to read more ...

Todd Hoff |

Post a Comment |

Permalink |

Print Article

Email Article

CAP,

Paper,

consistentcy

Monday

Jan262009

Paper: Scalability by Design - Coding for Systems With Large CPU Counts

Monday, January 26, 2009 at 12:27AM

The multi-cores are coming and software designed for fewer cores usually doesn't work on more cores without substantial redesign. For a taste of the issues take a look at No new global mutexes! (and how to make the thread/connection pool work), which shows some of the difficulties of making MySQL perform on SMP servers. In this paper, Richard Smith, a –Staff Engineer at Sun, goes into some nice detail on multi-core issues. His take home lessons are:

Use fine-grained locking or lock-free strategy

Document the strategy, including correctness criteria (invariants)

Keep critical sections short

Profile the code at both light and heavy load

Collect HW performance counter data

Identify bottleneck resource (there's always at least one!)

Eliminate or ameliorate it

Click to read more ...

Todd Hoff |

1 Comment |

Permalink |

Print Article

Email Article

Paper,

multicore

Thursday

Jan082009

Paper: Sharding with Oracle Database

Thursday, January 8, 2009 at 2:33PM

The upshot of the paper is Oracle rules and MySQL sucks for sharding. Which is technically probable, if you don't throw in minor points like cost and ease of use. The points where they think Oracle wins: online schema changes, more robust replication, higher availability, better corruption handling, better use of large RAM and multiple cores, better and better tested partitioning features, better monitoring, and better gas mileage.

Click to read more ...

Todd Hoff |

8 Comments |

Permalink |

Print Article

Email Article

Oracle,

Paper,

sharding

Monday

Jan052009

Lessons Learned at 208K: Towards Debugging Millions of Cores

Monday, January 5, 2009 at 8:49AM

How do we debug and profile a cloud full of processors and threads? It's a problem more will be seeing as we code big scary programs that run on even bigger scarier clouds. Logging gets you far, but sometimes finding the root cause of problem requires delving deep into a program's execution. I don't know about you, but setting up 200,000+ gdb instances doesn't sound all that appealing. Tools like STAT (Stack Trace Analysis Tool) are being developed to help with this huge task. STAT "gathers and merges stack traces from a parallel application’s processes." So STAT isn't a low level debugger, but it will help you find the needle in a million haystacks. Abstract:

Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application – already, debugging the full BlueGene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach such sizes and beyond, tools must use a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become tool bottlenecks. In this paper, we present challenges to petascale tool development, using the Stack Trace Analysis Tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present implemented solutions to these challenges and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.

Lessons Learned

At the end of the paper they identify several insights they had about developing petascale tools:

We find that sequential daemon launching becomes a bottleneck at this scale. We improve both scalability and portability by eschewing ad hoc sequential launchers in favor of LaunchMON, a portable daemon spawner that integrates closely with native resource managers.

As daemons run, we find that it is critical that they avoid data structures that represent, or even reserve space to represent, a global view. Instead, we adopt a hierarchical representation that dramatically reduces data storage and transfer requirements at the fringes of the analysis tree.

We find that seemingly-independent operations across daemons can suffer scalability bottlenecks when accessing a shared resource, such as the file system. Our scalable binary relocation service is able to optimize the file operations and reduce file system accesses to constant time regardless of system size. Unsurprisingly these lessons aren't that much different than other builders of scalable programs have had to learn.

Livermore Lab pioneers debugging tool by Jaob Jackson in Government Computer News.

Click to read more ...

Todd Hoff |

Post a Comment |

Permalink |

Print Article

Email Article

Paper,

multicore

Sunday

Jan042009

Paper: MapReduce: Simplified Data Processing on Large Clusters

Sunday, January 4, 2009 at 12:50AM

Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve and make programmers more productive faster. From the abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day. Thanks to Kevin Burton for linking to the complete article.

MapReducing 20 petabytes per day by Greg Linden

2004 Version of the Article by Jeffrey Dean and Sanjay Ghemawat

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

Print Article

Email Article

Map Reduce,

Paper,

google