Is NoSQL a Premature Optimization that's Worse than Death? Or the Lady Gaga of the Database World?

Michael Stonebraker sure knows how to stir up a storm. Unlike for others, that doesn't make him a troll in my mind, he's way too accomplished in the field to be that, but he does have a bit of Barnum & Bailey in him, which serves to get the discussion flowing, and that's a good thing. A lot of previously hidden wisdom and passion unlocks, which we'll try to capture here.

This disturbance in the force is over OldSQL vs NoSQL vs NewSQL. Warning, these are not crisp categories, there's leakage all over the place, watch your step:

  • OldSQL (Oracle, MySQL, etc) refers to what some want to term as legacy relational database like MySQL, that don't scale out horizontally with aplomb.
  • NoSQL (CouchDB, Redis, Cassandra, HBase, MongoDB, Riak, Neo4j, etc) refers to, well, a collection of technologies that aren't OldSQL, these often are designed to scale out horizontally, aren't on ACID, and use schemaless non-relational datamodels.
  • NewSQL  (Xeround, Clustrix, NimbusDB, GenieDB, ScaleBase, VoltDB) are databases that preserve SQL, the relational model, ACID, schemas, and are scalable, though not necessarily horizontally (which I don't quite understand). Sharding should be transparent. The general pitch is once you have ACIDy SQL goodness and elasticity, all on commodity hardware, then there's no reason to use NoSQL. 

OK, got it? Then you might be the only one...

The disturbance first started with this article by Derrick Harris, which gets a lot of mileage out of a few quotes by Stonebraker. The short of it is: 

35+ Use Cases for Choosing Your Next NoSQL Database

We've asked What The Heck Are You Actually Using NoSQL For?. We've asked 101 Questions To Ask When Considering A NoSQL Database. We've even had a webinar What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications.

Now we get to the point of considering use cases and which systems might be appropriate for those use cases.

What are your options?

101 Questions to Ask When Considering a NoSQL Database

You need answers, I know, but all I have here are some questions to consider when thinking about which database to use. These are taken from my webinar What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications. It's a companion article to What The Heck Are You Actually Using NoSQL For?

Actually, I don't even know if there are a 101 questions, but there are a lot/way too many. You might want to use these questions as kind of a NoSQL I Ching, guiding your way through the immense possibility space of options that are in front of you. Nothing is fated, all is interpreted, but it might just trigger a new insight or two along the way.

Where are you starting from?

The NewSQL Market Breakdown

Matt Aslett from the 451 group created a term called “NewSQL”. On the definition of NewSQL, Aslett writes:

“NewSQL” is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as ‘ScalableSQL’ to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term ‘NewSQL’ in the new report.

And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.

As with NoSQL, under the NewSQL umbrella you can see various providers, with various solutions.

I think these can be divided into several sub-types:

Paper: NoSQL Databases - NoSQL Introduction and Overview

Christof Strauch, from Stuttgart Media University, has written an incredible 120+ page paper titled NoSQL Databases as an introduction and overview to NoSQL databases . The paper was written between 2010-06 and 2011-02, so it may be a bit out of date, but if you are looking to take in the NoSQL world in one big gulp, this is your chance. I asked Christof to give us a  short taste of what he was trying to accomplish in his paper:

Paper: A Co-Relational Model of Data for Large Shared Data Banks

Let's play a quick game of truth or sacrilage: are SQL and NoSQL are really just two sides of the same coin? That's what Erik Meijer and Gavin Bierman would have us believe in their "we can all get along and make a lot of money" article in the Communications of the ACM, A Co-Relational Model of Data for Large Shared Data Banks. You don't believe it? It's math, so it must be true :-) Some key points:

In this article we present a mathematical data model for the most common noSQL databases—namely, key/value relationships—and demonstrate that this data model is the mathematical dual of SQL's relational data model of foreign-/primary-key relationships

...we believe that our categorical data-model formalization and monadic query language will allow the same economic growth to occur for coSQL key-value stores.

...In contrast to common belief, the question of big versus small data is orthogonal to the question of SQL versus coSQL. While the coSQL model naturally supports extreme sharding, the fact that it does not require strong typing and normalization makes it attractive for "small" data as well. On the other hand, it is possible to scale SQL databases by careful partitioning.
What this all means is that coSQL and SQL are not in conflict, like good and evil. Instead they are two opposites that coexist in harmony and can transmute into each other like yin and yang. Because of the common query language based on monads, both can be implemented using the same principles.

I'm certainly in no position to judge this work, or what it means at some deep level. After reading a 1000 treatments on monads I still have no idea what they are. But, like the Standard Model in physics, it would be satisfying if some unifying principles underlay all this stuff. Would we all get along? That's a completely different question...


Wordnik - 10 million API Requests a Day on MongoDB and Scala

Wordnik is an online dictionary and language resource that has both a website and an API component. Their goal is to show you as much information as possible, as fast as we can find it, for every word in English, and to give you a place where you can make your own opinions about words known. As cool as that is, what is really cool is the information they share in their blog about their experiences building a web service. They've written an excellent series of articles and presentations you may find useful:
  • What has technology done for words lately?
    • Eventual consistency. Using an eventually consistent model they can do work in parallel and we count as many words as possible when we can, and add them all up when there’s a lag. The count’s always in the ballpark, and we never have to stop.D
    • Document-oriented storage. Dictionary entries are more naturally modeled as hierarchical documents and using that model has made it quicker to find data and is easier for development.

Riak's Bitcask - A Log-Structured Hash Table for Fast Key/Value Data

How would you implement a key-value storage system if you were starting from scratch? The approach Basho settled on with Bitcask, their new backend for Riak, is an interesting combination of using RAM to store a hash map of file pointers to values and a log-structured file system for efficient writes.  In this excellent Changelog interview, some folks from Basho describe Bitcask in more detail.

The essential Bitcask:

Paper: CRDTs: Consistency without concurrency control

For a great Christmas read forget The Night Before Christmas, a heart warming poem written by Clement Moore for his children, that created the modern idea of Santa Clause we all know and anticipate each Christmas eve. Instead, curl up with a some potent eggnog, nog being any drink made with rum, and read CRDTs: Consistency without concurrency control by Mihai Letia, Nuno Preguiça, and Marc Shapiro, which talks about CRDTs (Commutative Replicated Data Type), a data type whose operations commute when they are concurrent.

From the introduction, which also serves as a nice concise overview of distributed consistency issues:

SQL + NoSQL = Yes !


This is a guest post by Frédéric Faure (architect at Ysance), you can follow him on twitter.

Data storage has always been one of the most difficult problems to address, especially as the quantity of stored data is constantly increasing. This is not simply due to the growing numbers of people regularly using the Internet, particularly with all the social networks, games and gizmos now available. Companies are also amassing more and more meticulous information relevant to their business, in order to optimize productivity and ROI (Return On Investment). I find the positioning of SQL and NoSQL (Not Only SQL) as opposites rather a shame: it’s true that the marketing wave of NoSQL has enabled the renewed promotion of a system that’s been around for quite a while, but which was only rarely considered in most cases, as after all, everything could be fitted into the « good old SQL model ». The reverse trend of wanting to make everything fit the NoSQL model is not very profitable either.

So, what’s new … and what isn’t?

