Masstree - Much Faster than MongoDB, VoltDB, Redis, and Competitive with Memcached

The EuroSys 2012 system conference has an excellent live blog summary of their talks for: Day 1, Day 2, Day 3 (thanks Henry at the Paper Trail blog). Summaries for each of the accepted papers are here.
One of the more interesting papers from a NoSQL perspective was Cache Craftiness for Fast Multicore Key-Value Storage, a wonderfully detailed description of the low level techniques used to implement Masstree:
A storage system specialized for key-value data in which all data fits in memory, but must persist across server restarts. It supports arbitrary, variable-length keys. It allows range queries over those keys: clients can traverse subsets of the database, or the whole database, in sorted order by key. On a 16-core machine Masstree achieves six to ten million operations per second on parts A–C of the Yahoo! Cloud Serving Benchmark benchmark, more than 30 as fast as VoltDB [5] or MongoDB [2].
If you are looking for innovative detailed high performance design, this paper is for you. An example from the section on writer-writer coordination:
Masstree writers coordinate using per-node spinlocks. A node’s lock is stored in a single bit in its version counter. Any modification to a node’s keys or values requires holding the node’s lock. Some data is protected by other nodes’ locks, however. A node’s parent pointer is protected by its parent’s lock, and a border node’s prev pointer is protected by its previous sibling’s lock. This minimizes the simultaneous locks required by split operations; when an interior node splits, for example, it can assign its children’s parent pointers without obtaining their locks.
Here's the live blog writeup:
Let’s build a new fast KV-store! Would like to perform well on hard workloads! Should support range queries, skewed key popularity, small K-V pairs, many puts, and arbitrary keys. Their first attempt was a fast binary tree, achieving 3.7M qps, if sufficiently high-bandwidth network and disk hardware provisioned. Bottleneck turns out to be DRAM, so optimize caching craftiness, get 1.5x better performance. Their optimized thing is called “Masstree”. They evaluated it on a 16-node cluster with a bunch of SSDs, 64 GB RAM and 10 Gbps NIC per machine. In fact, when looking at local performance (without network/disk bottlenecks), cache craftiness results in a 1.7x improvement (I thought they said that they’d already evaded those bottlenecks?!).
To reduce DRAM latency, they constructed a lock-free, unbalanced 4-way tree with the same concurrency properties as a binary tree, but is only half as deep for the same amount of data -- plus each node fits into a cache line. However, it is pessimal (O(N)) for sequential inserts, so look at a balanced B+tree instead. Their particular implementation uses optimistic concurrency with versioning. It turns out, however, that the B+tree is 11% slower than the 4-way tree (because of cache line optimization, and since all nodes are full, while for their B+tree, only ~75% are)! Now realize that we can do software prefetch to get read multiple cache lines at a time with the B+tree, and things get 9% better than the 4-way tree. However, there are consistency issues with concurrent inserts in the B+tree (not with 4-way, as keys don’t move), so pre-pend a “permuter” (index list array) to each node, so we can atomically swap keys around. But there is an issue with long keys -- they require multiple memory accesses, and throughput drops rapidly as keys get longer. So they use a Trie of B+trees, with each level (B+tree) responsible for 8 bytes of key length. Now throughput scales much better. “Masstree” is the union of all these optimizations (I think), and despite having a trie of B+trees in it, it is 8% more efficient than a single B+tree with all the other optimizations.
Evaluation finds that they are much faster than common NoSQL databases (MongoDB, VoltDB), faster than Redis, and competitive with memcached.
Now, how do we scale this to multi-core? Other systems have an instance per core, and partition the key space. Masstree uses a single, shared key that is accessed by all cores (presumably using the OCC+versioning techniques they alluded to). They find that their performance is basically constant as they scale load when running on 16 cores, while a statically partitioned Masstree performs better under low load, but loses out subsequently (it didn’t become entirely clear how they varied the load). They also find that they scale fairly well to 16 cores (about ⅕ drop in performance over perfect scalability).
From the conclusion of the paper:
Masstree is a persistent in-memory key-value database. Its design pays particular attention to concurrency and to efficiency for short and simple queries. Masstree keeps all data in memory in a tree, with fanout chosen to minimize total DRAM delay when descending the tree with prefetching. The tree is shared among all cores to preserve load balance when key popularities are skewed. It maintains high concurrency using optimistic concurrency control for lookup and local locking for updates. For good performance for keys with long shared prefixes, a Masstree consists of a trie-like concatenation of B+ trees, each of the latter supporting only fixed-length keys for efficiency. Logging and checkpointing provide consistency and durability.
Reader Comments (5)
Now i wonder where is the code for masstree :-)
It will be interesting to see if the code is made available or if it goes straight to commercial software.
"Evaluation finds that they are much faster than common NoSQL databases (MongoDB, VoltDB), faster than Redis, and competitive with memcached."
As a stand-alone statement, that's pretty misleading; it leaves readers to parse the fine print around workloads and access patterns ("key-value", "short and simple queries", "optimistic concurrency", etc.). Yet it's also sensationalized in the title of this blog post.
Indeed, the EuroSys 2012 conference's summary reviews include concerns about whether the Masstree authors were making appropriate comparisons:
"There was much initial concern over the [Masstree] evaluation, in particular that it focused on evaluation against heavier weight databases. In their rebuttal, the authors promised a more detailed comparison against Redis which seemed to be a better comparison point ..."
Hopefully, the authors will follow through on their promise and, in the process, offer a more balanced perspective on the application of their research.
First off, the paper is brilliant, these guys are data structure naturals, very interesting to see them battle against hardware characteristics/limitations. Also, any contribution to lockless multicore execution is a step forwards -> great paper.
Masstree supports range queries, but does not support range updates/deletes. With Masstree's architecture, if they were to support range updates/deletes, they would not be able to maintain their ultra-fast speed AND replicate data. The "row" level locks required in range updates across many cores can not be trivially serialised in order to be replicated w/ a deterministic order.
Range updates/deletes are not the most important features, but are nice when using secondary indexes. However, range updates GREATLY complicate any DB's architecture, so maybe NOT supporting them (from the beginning) is the correct idea to achieve linear scalability (vertical or horizontal).
Redis thought about going multi-threaded to speed up to memcached (on many cores) speed, here is the community back and forth: http://groups.google.com/group/redis-db/browse_thread/thread/378a32de50a782c5/08aa2c290286ba41
Some of redis' operations are logically range updates/deletes, so enabling multi threading would have excluded replication (something memcached does not have to worry about).
This is an important concept for people to get. Every data model enrichment results in some sort of trade-off. So Masstree is a richer data model than memcached, but can not become as rich as redis w/o giving up some speed or the ability to replicate.
I also think if range updates/deletes are supported in a DB, than the data MUST be partitioned to individual cores (and globally coordinated) to maximise performance IF deterministic replication is required (and replication should be a prerequisite to any in-memory DB). Of course this approach relegates range updates/deletes to 2nd class citizens, which is what they should be.
Looks like their code is on GH: https://github.com/kohler/masstree-beta