« MySQL and Memcached: End of an Era? | Main | Hot Scalability Links for February 24, 2010 »
Thursday
Feb252010

Paper: High Performance Scalable Data Stores 

The world of scalable databases is not a simple one. They come in every race, creed, and color. Rick Cattell has brought some harmony to that world by publishing High Performance Scalable Data Stores, a nicely detailed one stop shop paper comparing scalable databases soley on the content of their character. Ironically, the first step in that evaluation is dividing the world into four groups:

  • Key-value stores: Redis, Scalaris, Voldmort, and Riak.
  • Document stores: Couch DB, MongoDB, and SimpleDB.
  • Record stores: BigTable, HBase, HyperTable, and Cassandra.
  • Scalable RDBMSs: MySQL Cluster, ScaleDB, Drizzle, and VoltDB.

The paper describes each system and then compares them on the dimensions of Concurrency Control, Data Storage Replication, Transaction Model, General Comments, Maturity, K-hits, License Language.

And the winner is: there are no winners. Yet. Rick concludes by pointing to a great convergence:

I believe that a few of these systems will gain critical mass and key players, and will pull away from the others by next year.  At that point, open source contributors will likely migrate to those players.

From the paper:

In recent years, a dozen new storage systems, sometimes called “NoSQL” systems, have been introduced to provide indexed data storage that is much higher performance than existing relational database products like MySQL, Oracle, DB2, and SQL Server.  These data storage systems have a number of features in common:
• a call level interface (in contrast to a SQL binding)
• fast indexes on large amounts of data,
• ability to horizontally scale throughput over many servers, and
• ability to dynamically define attributes or data schema.


They differ in other ways, and in this paper I contrast those differences.  They range in functionality from the simplest distributed hashing, as supported by memcached, to highly scalable partitioned tables, as supported by Google’s BigTable.  Those two systems in particular provided a “proof of concept” that I believe has led to much of the popularity of these alternative data stores:
• Memcached demonstrated that in-memory indexes can be scalable, distributing
and replicating objects over multiple nodes.
• BigTable demonstrated that persistent record storage could be scaled to thousands
of nodes, a feat that all of the other systems aspire to.


Good performance on a single multicore node is important, but I believe the key feature of these and the various NoSQL systems is “shared nothing” horizontal scaling (over many CPUs) for applications that require a large number of simple read/write operations per second, such as Web 2.0 applications.  This load is traditionally called OLTP (online transaction processing).  To date, RDBMSs have not provided good horizontal scaling for OLTP, but in this paper we will look at RDBMSs making progress in that direction.  (Data warehousing RDBMSs provide horizontal scaling of complex joins and queries when the database is read-only or read-mostly, but that is not the focue of these new
systems.)

Related Articles

Reader Comments (6)

I also recommend this paper for in-depth overview of NoSQL: "No Relation: The Mixed Blessings of Non-Relational Databases" http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf

February 25, 2010 | Unregistered Commenterhelwr

Just skimming the section on Redis shows the author doesn't have all the facts, so makes me wonder about the rest of the paper.

February 25, 2010 | Unregistered Commenterprogi

Any idea why he didn't analyze any of the enterprise players like GemFire, Coherence, XAP or SQL Fabric?

February 25, 2010 | Registered CommenterPierce Lamb

It is worth mentioning graph database (e.g. Neo4j)

February 26, 2010 | Unregistered CommenterKrzysztof Tomczyk

Also, how about scimoreDB? Its distributed db for OLTP.

February 26, 2010 | Unregistered CommenterMarius Slyzius

That paper is a waste of time in terms of getting a better overview than already exists on the web. Although it's written in an academic style, a better reference for "state of the union" today is the link at Rackspace (also linked from this in Feb 2010) - http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/.

The paper doesn't do equivalent depth and comparison research between the various systems, providing a highly biased view towards what the author has looked at with really only an obligatory nod to alternate technologies. In the end, no conclusion other than "the space will consolidate", but without any logic to drive that conclusion or hint at what will be the deciding factors in consolidation.

February 28, 2010 | Unregistered CommenterJoe

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>