« Stuff The Internet Says On Scalability For August 5, 2011 | Main | How Will DIDO Wireless Networking Change Everything? »
Thursday
Aug042011

Jim Starkey is Creating a Brave New World by Rethinking Databases for the Cloud

Jim Starkey, founder of NuoDB, in this thread on the Cloud Computing group, delivers a masterful post on why he thinks the relational model is the best overall compromise amongst the different options, why NewSQL can free itself from the limitations of legacy SQL architectures, and how this creates a brave new lock free world....

I'll [Jim Starkey] go into more detail later in the post for those who care, but the executive summary goes like this:  Network latency is relatively high and human attention span is relatively low.  So human facing computer systems have to perform their work in a small number of trips between the client and the database server.  But the human condition leads inexorably to data complexity.  There are really only two strategies to manage this problem. One is to use coarse granularity storage, glombing together related data into a single blob and letting intelligence on the client make sense of it.  The other is storing fine granularity data on the server and using intelligence on the server to aggregate data to be returned to the client.

NoSQL uses the former for a variety of reasons.  One is that the parallel nature of scalable implementations make maintaining consistent relationship among data elements problematic at best.  Another is that clients are easy to run in parallel on multiple nodes, so moving computation from servers to clients makes sense.  The downside is that there is only one efficient way to view data.  A shopping cart application, for example, can store everything about a single user session in a single blob that is easy to store, fetch, and contains just about every necessary take an order to completion.  But it also makes it infeasible to do reporting without moving vast quantities of unnecessary data.

SQL databases support ACID transactions that makes consistent fine granularity data possible, but also requires server side intelligence to manage the aggregation.  The SQL language, ugly as it may be, performs the role with sufficient adequacy (we could do better, but not significantly better).   Stored procedures using any sort of data manipulation language also works.  But distributed intelligence, declarative or procedural, pretty much requires regularity of data.  Hence schemas.  They don't have to be rigid or constraining, but whatever intelligence is going to operate on bulk data needs a clue of what it looks like and what the relationships are.

So that's the short version.  The long version requires a historical understanding of how we got from an abstract to structured views of data.

On of the earliest high performance database systems (they weren't called that yet) was a mainframe product, circa 1969, whose name eventually settled on Model 204 (person node:  I did the port from DOS/360 to OS/360 MVT).  Model 204 had records that were arbitrary collections of attribute/value pairs.  The spooks loved it -- dump everything you know into cauldron and throw queries at it.   But it couldn't handle relationships among records without a lot of procedural code.

There were two data models that attempted to capture data structure (relationships).  One was the hierarchical model, IBM's IMS and the ARPAnet Datacomputer (personal note here, too).  The other was the CODASYL (aka network) data model where individual records "owned" chains of other records.  IDMS, the premier CODASYL product roamed the earth in the late Jurassic period until an alternative emerged from the primordial swamps (sigh, I was on the team that ported IDBMS to the PDP-11, often likened to an elephant on a bicycle).  Both the hierarchical and network models required data manipulation languages that an orc's mother couldn't love.

Codd's relational model, once stripped of the unnecessary mathematical trappings, was a very happy compromise between simple regular tables and ad hoc joins for inter-table relationships.  And it made for straightforward transactional semantics as well.  It was a hit when the first commercial implementation came out and has persevered (I put out DEC's first relational database, Rdb/ELN, in 1984).

The OO data model (encapsulated objects and pointer) pops up and dies with great regularity, as does its close relative, object/relational mapping.  Neither is particularly suited for the distributed world, however, but the refugees can generally find work in other companies.

Over the years I've had a professional affiliations with the amorphous data model (Model 204), hierachical (Datacomputer), CODASYL (DBMS-11), and relational (Rdb/ELN, Interbase, Firebird, MySQL, and NimbusDB).  And my friend Tom Atwood started at least half of the OO database companies.  So, if I can't claim objectivity, I can at least claim in depth personal experience.

Everything is a compromise.  And I deeply believe that the relational model is the best compromise among simplicity, power, performance, and flexibility of implementation.  It does require data description, but so do all other useful database management systems regardless of what it is called.

The scale problems with legacy relational databases have nothing to do with SQL, the relational model, or ACID transactions.  The essence of the problem is theoretical -- the conflation of consistency and serializability. Serializability is a sufficient condition for database consistency but not a necessary condition. The NewSQL generation of products recognize this and refuse to be limited by lock managers of any ilk.

It's a brave new world out there, ladies and gentlemen.

If this seems like magic, take a look at How NimbusDB Works for more details. Are Cloud Based Memory Architectures The Next Big Thing? has a lot of clarifying quotes from Jim Starkey as well.

Related Articles

Reader Comments (3)

Great post! I agree with most of your points. However, here are my questions.

I am not sure about your justification of "refuse to be limited by lock managers of any ilk." Do you imply that it is not possible to implement a scalable "new SQL" DB using locks. If I remember correctly, there are efficient distributed lock managers in the literature and in implementation. Also, if your point of "refuse to be limited" is, for example, by using single threaded execution (e.g. VoltDB), isn't that the same as using a single "global" lock ( masqueraded as a "single thread of execution") on the data item?

Another question, I guess I can understand your point on "Serializability is a sufficient condition for database consistency but not a necessary condition." But my gut feeling is that any consistency condition system that can be expressed in a general and declarative way will probably require serializability. As you can see there are a lot of "speculative" word in this question, this is just my gut feeling. So, if you can enlighten me with so more "non-speculative" :-) argument or counter examples, I would be appreciated.

Tipin

August 4, 2011 | Unregistered CommenterTipin

You say that ACID is not part of the problem, but that lock managers are.

But aren't locks required for any implementation of ACID?

August 11, 2011 | Unregistered CommenterBrandon K

The issue isn't lock manager, per se, but two phase locking where locks on each record are held for the duration of the transaction.  Two phasae locking ensures that transactions can be serialied and requires a lock manager to implement.

There are many problems with two phase locking.  Readers block writers, for example, and writers block readers, rendering ad hoc access to on-line applictions untenable.  Another major issue is latency.  A distributed lock manager generally requires a network round trip per lock siezed.  Even on fast networks this latency adds up to the point of unacceptability.

The common denominator of "new SQL" databases is avoidance of two-phase locking for concurrency control.  NuoDB (formerly NimbusDB) and many other use multi-version concurrency control (MVCC) which, if done properly, is consistent without requiring serializability.  VoltDB uses a formal transaction scheduler that will not start a transaction unless it can complete without contention.  XeRound uses a dynamic partitioning scheme with optimistic locking (at least as far as I understand it).  None us a distributed lock manager with a network round trip per record.

I'm sorry about Tipin's intuitive, but while serializability if a sufficient condition for consistency, it isn't a necessary condition.  It is necessary to synchronize declared constraints -- primary key uniqueness, referential integrity, and concurrent record update.  But while each of these may require a network round trip, the cost is per record updated, not per record accessed.  Big difference.

August 16, 2011 | Unregistered CommenterJim Starkey

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>