The NewSQL Market Breakdown
Matt Aslett from the 451 group created a term called “NewSQL”. On the definition of NewSQL, Aslett writes:
“NewSQL” is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as ‘ScalableSQL’ to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term ‘NewSQL’ in the new report.
And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.
As with NoSQL, under the NewSQL umbrella you can see various providers, with various solutions.
I think these can be divided into several sub-types:
- New MySQL storage engines. These give MySQL users the same programming interface, but scale very well. You can Xeround or Akiban in this field. The good part is that you still use MySQL, but on the downside it’s not supporting other databases (at least not easily) and even MySQL users need to migrate their data to these new databases.
- New databases. These completely new solutions can support your scalability requirements. Of course, some (hopefully minor) changes to the code will be required, and data migration is still needed. Some examples are VoltDB and NimbusDB.
- Transparent Sharding. ScaleBase, which offers such a solution, lets you get the scalability you need from the database, but instead of rewriting the database, you can use your existing one. This allows you to reuse your existing skill set and eco-system, and you don’t need to rewrite your code or perform any data migration – everything is simple and quick. Other solutions in the field are dbShards for instance.
As in NoSQL, I believe each NewSQL solution has its own spot, answering specific needs.
Reader Comments (5)
VoltDB is a distributed RDMBS and falls under the 2nd category. http://voltdb.com/content/voltdb-features-overview
Two points (and a disclaimer than I work for VoltDB):
First, I find it facinating to conpare how many NoSQL systems are open source with how many NewSQL systems are.
Second, the assumption that moving to "transparent" sharding systems requires no changes to app code or changes to database schema is probably false in most cases.
What these products do is provide a layer between the app and the legacy system that solves some of the difficult parts of managing multiple DB instances and distributing queries among them. Compared to building such a system in-house, they can offer huge advantages. If your business isn't distributed systems, and you can avoid it, then don't build distributed systems.
Moving from a single MySQL server to a scale-out database is usually going to be transparent or performant, but usually not both. Two queries that might both be lightning quick on a single box might be an order of magnitude different in performance once you are partitioned. Suddenly a developer has to have some understanding of how a query is affected by data locality. There's a lot of fun reaseach happening on auto-discovery of optimum locality and the corresponing query planning, but not much has been commercialized yet.
The good news with all of this is that with each passing year, the tech gets better and better.
You can find more information about The 451 Group's "NoSQL, NewSQL and Beyond" report and how we segment the various NewSQL players here: http://blogs.the451group.com/information_management/2011/04/15/nosql-newsql-and-beyond/
(disclaimer: I work for Codefutures - dbShards).
Transparent sharding can literally be dropped into most DB schemas and works out of the box as advertised.
There exist RDBMS schemas that are so complex that they are inherently difficult or impossible to EFFECTIVELY distribute without redundantly storing data (i.e. no matter how data is distributed, there exist joins that span multiple nodes, and these joins are done prohibitively often).
Neither transparent sharding nor voltdb's automatic sharding helps in these cases, but this is known and obvious to most developers that understand the relational model. So "transparent" and "automatic" sharding are both to be taken w/ a grain of salt, like almost all innovative technologies, they don't solve 100% of all use-cases, but they solve relevant use cases in new & better ways.
In practice, transparent sharding is best used to distribute the important (i.e. HUGE and/or high traffic) tables in a database. This distributes as little of the schema as necessary, but distributes the bottlenecking portion, and retains the relational model. It is amazingly efficient, highly performant, and simple to implement.
The mechanisms of transparent sharding are straightforward and understood by database developers w/ little to no learning/effort. Designing a good database schema implies the ability to organize data so it can be effectively queried, which requires an understanding of data locality. Transparent sharding requires the developer understands data locality as the data is sharded, so queries to sharded tables must take this into account, much as queries to indexed tables need to use an index to avoid full table scans. Database developers are thoroughly versed in this type of analytical thinking.
Transparent sharding is a cost-effective solution to distribute and scale RDBMS schemas, it also adds high availability, and it works today :)
I prefer the term "PostSQL". To mean it means
-people recognise that transacted, indexed SQL databases have a place, but their role is limited to where you want ACID operations, and that you recognise the limits of performance this gives you.
-people recognise that lots of data are better unindexed; that logs, binaries, etc, should be best kept outside the SQL DB, and that the NoSQL tooling can help here
-people who care about scalability, and are prepared to have sharded or eventually consistent databases can use some of the newsql tools
There are some key features to this world
-open source is a key player: Hadoop, couchdb, mySQL, HBase, Cassandra, etc. Even if you don't use these tools, they apply price pressure to all the closed source products in both sales and support. The newSQL sales teams have to justify their premiums, rather than just cost less than Oracle.
-the OSS technologies aren't "owned" by anyone. MySQL is the closed to being owned by oracle, but MariaSQL shows that it doesn't have to always be so.
-SQL is still a really good query API (see Hive for an example), because so many of the web application developers know it. Even if you can't do transactions, a SELECT * from weblog where weblog.addr=192.168.1.2; is still handy.
-a lot of people are coming to the hadoop world with utterly unrealistic expectations about the technology "is it true that we can replace oracle with Hadoop?". At the same time, a lot of the OSS technologies are pretty hard to get up and running with.