« Google Megastore - 3 Billion Writes and 20 Billion Read Transactions Daily | Main | BankSimple Mini-Architecture - Using a Next Generation Toolchain »
Monday
Jan102011

Riak's Bitcask - A Log-Structured Hash Table for Fast Key/Value Data

How would you implement a key-value storage system if you were starting from scratch? The approach Basho settled on with Bitcask, their new backend for Riak, is an interesting combination of using RAM to store a hash map of file pointers to values and a log-structured file system for efficient writes.  In this excellent Changelog interview, some folks from Basho describe Bitcask in more detail.

The essential Bitcask:

  • Keys are stored in memory for fast lookups. All keys must fit in RAM.
  • Writes are append-only, which means writes are strictly sequential and do not require seeking. Writes are write-through. Every time a value is updated the data file on disk is appended and the in-memory key index is updated with the file pointer.
  • Read queries are satisfied with O(1) random disk seeks. Latency is very predictable if all keys fit in memory because there's no random seeking around through a file.
  • For reads, the file system cache in the kernel is used instead of writing a complicated caching scheme in Riak.
  • Old values are compacted or "merged" to free up space. Bitcask has windowed merges: Bitcask performs periodic merges over all non-active files to compact the space being occupied by old versions of stored data. In certain situations this can cause some memory and CPU spikes on the Riak node where the merge is taking place. To that end, we've added the ability to specify when Bitcask will perform merges.
  • Get and set concurrency are implemented using vector clocks by the software layer above Bitcask.
  • The key to value index exists in memory and in the filesystem in hint files. The hint file is generated when data files are merged. On restart the index only needs to be rebuilt for non-merged files which should be a small percentage of the data. 

Eric Brewer (CAP theorem) came up with idea with Bitcask by considering if you have the capacity to keep all keys in memory, which is quite likely on modern systems, you can have a relatively easy to design and implement storage system. The commit log can be used as the database itself, providing atomicity and durability. Only one write is required to persist the data. Separate writes to a data file and a commit log is not necessary.

When a value is updated it is first appended to the on-disk commit log. Then the  in-memory hash table that maps keys to disk pointers is updated to point to the file and the offset of the record in the file. So a read takes just one file I/O. The hash key locates the file pointer and you just seek to the offset and read the value. For writes it's just an append to the file. Pretty slick. It's good compromise between a pure in-memory database and a disk based data store backed by a virtual memory layer.

Some potential issues:

  • If you suspect you will have more keys than RAM then an architecture that keeps a working set in memory would be a better choice.
  • It will be slower than a pure in-memory database.
  • Problems commonly occur during the garbage collection phase as resource spike while space for deleted values are reclaimed. Bitcask hopes to lessen this cost by enabling the scheduling of garbage collection to certain periods, though in a 24x7 property with an international set of users this may not be sufficient.
  • Syncing on every write could be a little painful. Write throughput could be increased if writes were buffered and the data was replicated synchronously to a backup node for high availability.
  • I trust operating system caches not.  An OS cache can't know your access patterns. A custom cache might introduce complexity, but it's hard to believe it wouldn't perform better or be more tunable when things go wrong. Basho seems happy with this approach, but it still makes me queasy. What happens when the traffic has a uniform distribution, or a pareto-like distribution? Benchmark your app!

Related Articles

Reader Comments (5)

"If you suspect you will have more keys than RAM then an architecture that keeps a working set in memory would be a better choice."

Yep, DB/Java (http://www.oracle.com/technetwork/database/berkeleydb/overview/index-093405.html) pretty much does this out of the box and has a relatively similar overall approach to Bitcask. There are many examples of similar systems over the years, Basho have chosen to go back to basics and build something specific to their own needs. Much of the theory can be found in Gray and Reuter's Transaction Processing: Concepts and Techniques for example (roots in the likes of System R (http://en.wikipedia.org/wiki/IBM_System_R)

"Syncing on every write could be a little painful. Write throughput could be increased if writes were buffered and the data was replicated synchronously to a backup node for high availability."

Well if you are genuinely synchronous and don't want to delay those waiting on the writes, buffering has limited benefit. In reality, many systems introduce delays so that buffering can be done. Replicating synchronously to a backup node brings its own set of problems, requiring a suitably tuned network stack. You've also got all the hassles of dealing with loss of your replicas. Generally Riak handles that outside of the storage layer, I believe.

"Old values are compacted or "merged" to free up space. " - some might call that a checkpoint in DB-speak.

"I trust operating system caches not. An OS cache can't know your access patterns. A custom cache might introduce complexity, but it's hard to believe it wouldn't perform better or be more tunable when things go wrong."

A custom cache can only know about access patterns seen by it's developers prior to it being written or updated. Or put another way, any cache is only as good as the set of access patterns its been exposed to and tuned for so far. OS caches have been, over the course of time, exposed to many different access patterns and represent a reasonable balance most of the time. I'm sure for specific cases a custom cache will be better but are Riak workloads sufficiently specific/predictable? Probably not right now, maybe later.

January 12, 2011 | Unregistered CommenterDan Creswell

I forgot to say, things like battery-protected write caches are another way to make disk syncs cheaper.

January 12, 2011 | Unregistered CommenterDan Creswell

In an article comparing the architectures of Squid and Varnish, the author claims that custom memory management can hurt performance because of paging. With that in mind, he suggests that you'd better just leave it to the OS.

Wouldn't those arguments be valid to the situation discussed here too?

January 14, 2011 | Unregistered CommenterHugh Farago

@Hugh The Redis developer is now mulling about his previous decision to bypass the OS page cache - https://groups.google.com/d/topic/redis-db/1ES8eGaJvek/discussion

I had reminded him about that nice Varnish article last year - See comment #29 here http://antirez.com/post/redis-virtual-memory-story.html

Well, some like to learn it the hard way ;-)

January 16, 2011 | Unregistered CommenterAshwin Jayaprakash

The link to vector clocks is broken. The correct link should be http://basho.com/posts/technical/why-vector-clocks-are-easy/

January 6, 2019 | Unregistered CommenterJD Kemsley

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>