« Serving 250M quotes/day at CNBC.com with aiCache | Main | Implementing large scale web analytics »
Thursday
Apr162009

Paper: The End of an Architectural Era (It’s Time for a Complete Rewrite)

Update 3: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better.
Update 2: H-Store: A Next Generation OLTP DBMS is the project implementing the ideas in this paper: The goal of the H-Store project is to investigate how these architectural and application shifts affect the performance of OLTP databases, and to study what performance benefits would be possible with a complete redesign of OLTP systems in light of these trends. Our early results show that a simple prototype built from scratch using modern assumptions can outperform current commercial DBMS offerings by around a factor of 80 on OLTP workloads.
Update: interesting related thread on Lamda the Ultimate.

A really fascinating paper bolstering many of the anti-RDBMS threads the have popped up on the intertube lately. The spirit of the paper is found in the following excerpt:

In summary, the current RDBMSs were architected for the business data processing market in a time of different user interfaces and different hardware characteristics. Hence, they all include the following System R architectural features:
* Disk oriented storage and indexing structures
* Multithreading to hide latency
* Locking-based concurrency control mechanisms
* Log-based recovery



Of course, there have been some extensions over the years, including support for compression, shared-disk architectures, bitmap indexes, support for user-defined data types and operators, etc. However, no system has had a complete redesign since its inception. This paper argues that the time has come for a complete rewrite.



Of particular interest the discussion of H-store, which seems like a nice database for the data center.
H-Store runs on a grid of computers. All objects are partitioned
over the nodes of the grid. Like C-Store [SAB+05], the user can
specify the level of K-safety that he wishes to have.
At each site in the grid, rows of tables are placed contiguously in
main memory, with conventional B-tree indexing. B-tree block
size is tuned to the width of an L2 cache line on the machine
being used. Although conventional B-trees can be beaten by
cache conscious variations [RR99, RR00], we feel that this is an
optimization to be performed only if indexing code ends up being
a significant performance bottleneck.
Every H-Store site is single threaded, and performs incoming SQL
commands to completion, without interruption. Each site is
decomposed into a number of logical sites, one for each available
core. Each logical site is considered an independent physical site,
with its own indexes and tuple storage. Main memory on the
physical site is partitioned among the logical sites. In this way,
every logical site has a dedicated CPU and is single threaded.


The paper goes through how databases should be written with modern CPU, memory, and network resources. It's a fun an interesting read. Well worth your time.

References (1)

References allow you to track sources for this article, as well as articles that were written in response to this article.

Reader Comments (10)

Great paper indeed. Thanks for the link!

December 31, 1999 | Unregistered Commentermpermar

Fascinating paper, and not particularly daunting for the inexperienced user. I'm not a db admin or db expert by any means, but I was able to follow most of the points and understand the general argument being made.

http://www.callum-macdonald.com/" title="Callum" target="_blank">Callum

December 31, 1999 | Unregistered Commenterchmac

This is the paper behind a new product called http://www.vertica.com">Vertica The author of the paper, Michael Stonebreaker, is a database heavyweight responsible for Ingres and lots of other database technology. Vertica is optimized for data warehousing. Another Stonebreaker project, http://streambase.com/">Streambase, is for things like stock prices that are constantly flooding systems that need super-low latency.

December 31, 1999 | Unregistered CommenterJay Jakosky

For those who don't know him. Michael Stonebraker already proclaimed the end of relational databases in the 90ies - supplanted, of course, by products from one of his companies. Go figure!

December 31, 1999 | Unregistered CommenterAnonymous

Stonebreaker wrote this:

http://www.databasecolumn.com/2007/09/one-size-fits-all.html

Not sure if Vertica is the final answer, but I agree generally with his comments.

December 31, 1999 | Unregistered Commentergerryg

Hmm this group of people published another paper at SIGMOD2008 titled
"OLTP Through the Looking Glass, and What We Found There".

I'm pretty critical about their agenda, though the arguments certainly fly both ways.

However, I found that in their SIGMOD08 paper they pretty much pull away all the goodness that a RDBMS gives you; and then declare the result as a decisive improvement in performance. Well hands up who didnt know if you remove locking, buffer management, logging etc from a dbms you would get huge performance gains.

December 31, 1999 | Unregistered CommenterMathew

These papers are interesting, but they don't point the way to greater "scalability."

Their obsession is with getting the maximum throughput possible with a single CPU. Other processors could be used as replication servers to increase availability, but not to increase the throughput at which transactions can be run.

What they're talking about is a limited sort of product: kind of like the next generation of sqllite, Microsoft Access, or SQL Server Compact Edition. Something that gets awesome performance on a single-processor machine, but isn't going to scale to the heights possible with a conventional RDBMS if you were going to throw either a big SMP or a shared-nothing cluster at the problem.

Some of their ideas might point to something more scalable, but they see "single threaded execution" as a fundamental optimization.

There is something appealing about the "all transactions run in stored procedures" model that they use, but I've found that the ability to do ad-hoc and long running queries is important in both business systems and in web publishing systems. They suggest that their kind of system could be integrated into a "data warehouse", but it's not a problem they've solved. Also, there's a big difference between "data warehouse" system (that usually involves a lot of cleaning and integration of various sources) and the ability to run OLAP queries to see the state of a system in real time.

December 31, 1999 | Unregistered CommenterPaul Houle

Well it seems that you yourself the writer have rewrote this article.as the date specifies it.!
-----
http://underwaterseaplants.awardspace.com">sea plants
http://underwaterseaplants.awardspace.com/seagrapes.htm">sea grapes...http://underwaterseaplants.awardspace.com/plantroots.htm">plant roots

December 31, 1999 | Unregistered Commenterfarhaj

The link isn't working(404 Error).

December 31, 1999 | Unregistered CommenterAnonymous

Reference Source Link is changed : http://cs-www.cs.yale.edu/homes/dna/papers/vldb07hstore.pdf

February 17, 2010 | Unregistered Commenternumber3

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>