MemSQL Architecture - The Fast (MVCC, InMem, LockFree, CodeGen) and Familiar (SQL)
Tuesday, August 14, 2012 at 9:13AM
HighScalability Team in MySQL, Product, databases, lock-free

This is an interview with MemSQL cofounder’s Eric Frenkiel and Nikita Shamgunov, in which they try to answer critics by going into more depth about their technology.

MemSQL ruffled a few feathers with their claim of being the fastest database in the world. According to their benchmarks MemSQL can execute 200K TPS on an EC2 Quadruple Extra Large and on a 64 core machine they can push 1.2 million transactions a second.

Benchmarks are always a dark mirror, so make of them what you will, but the target market for MemSQL is clear: projects looking for something both fast and familiar. Fast as in a novel design using a combination of technologies like MVCC, code generation, lock-free data structures, skip lists, and in-memory execution. Familiar as in SQL and nothing but SQL. The only interface to MemSQL is SQL.

It’s right to point out MemSQL gets a boost by being a first release. Only a limited subset of SQL is supported, neither replication or sharding are implemented yet, and writes queue in memory before flushing to disk. The next release will include a baseline distributed system, native replication, n-way joins, and subqueries. Maintaining performance as more features are added is a truer test.

And MemSQL is RAM based, so of course it’s fast, right? Even among in-memory databases MemSQL hopes to convince you they’ve made some compelling design choices. The reasoning for their design goes something like:


On the first hearing of this strange brew of technologies you would not be odd in experiencing a little buzzword fatigue. But it all ends up working together. The mix of lock-free data structures, code generation, skip lists, and MVCC makes sense when you consider the driving forces of data living in memory and the requirement for blindingly fast execution of SQL queries.

In a single machine environment MemSQL makes an excellent case for their architecture. In a distributed environment they are limited in the same way every distributed databases is limited. MVCC doesn’t offer any magic as it doesn’t translate easily across shards. The choices  MemSQL has made reflect their primary use case of fast transactions plus fast real-time SQL based analytics. MemSQL uses a sharded shared nothing approach where queries are run independently on each shard and merged together on aggregation nodes. Transactions across shards won’t be supported until two phase commit is implemented, but then they will perform like any other database.  What they really want to do well is run fast real-time aggregations across a cluster so that’s what their design reflects.

A lot of other questions come to mind with such a novel design. Will MemSQL perform common operations like “return the top 5 X” as programmers have come to expect? Will MemSQL still perform when hit with a wide variety of different SQL queries? They say yes given code generation and their data structure choices. Is SQL expressive enough to solve real world problems across many domains? When you start adding stored procs or user defined functions will the carefully orchestrated dance of data structures still work?

To answer these questions and more, let’s take a deeper look into the technology behind MemSQL.

Stats

Information Sources

Use Cases

Origin Story

Why Faster?

The Y Combinator Cabal

Lock-Free Data Structures

Skip Lists

MVCC

MemSQL implements multi version concurrency control to implement transactional semantics

MemSQL Durability

Code Generation

Replication

The Distributed Story

Deployment and Management

SQL Support

Pricing

Lessons Learned

Related Articles

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.