Tuesday
Jul312007
BerkeleyDB & other distributed high performance key/value databases

I currently use BerkeleyDB as an embedded database
http://www.oracle.com/database/berkeley-db/
a decision which was initially brought on by learning that Google used BerkeleyDB for their universal sign-on feature.
Lustre looks impressive, but their white paper shows speeds of 800 files created per second, as a good number. However, BerkeleyDB on my mac mini does 200,000 row creations per second, and can be used as a distributed file system.
I'm having I/O scalability issues with BerkeleyDB on one machine, and about to implement their distributed replication feature (and go multi-machine), which in effect makes it work like a distributed file system, but with local access speeds. That's why I was looking at Lustre.
The key feature difference between BerkeleyDB and Lustre is that BerkeleyDB has a complete copy of all the data on each computer, making it not a viable solution for massive sized database applications. However, if you have < 1TB (ie, one disk) of total possible data, it seems to me that a replicated local key/value database is the fastest solution.
I haven't found much discussion of people using this kind of technology for highly scalabable web sites.
Over the years, I've had extremely good performance results with dbm files, and have found that nothing beats local data, access through C APIs, and btree or hash table implementations. I have never tried replicated/redundant versions of this approach, and I'm curious if others have, and what your experience has been.
http://www.oracle.com/database/berkeley-db/
a decision which was initially brought on by learning that Google used BerkeleyDB for their universal sign-on feature.
Lustre looks impressive, but their white paper shows speeds of 800 files created per second, as a good number. However, BerkeleyDB on my mac mini does 200,000 row creations per second, and can be used as a distributed file system.
I'm having I/O scalability issues with BerkeleyDB on one machine, and about to implement their distributed replication feature (and go multi-machine), which in effect makes it work like a distributed file system, but with local access speeds. That's why I was looking at Lustre.
The key feature difference between BerkeleyDB and Lustre is that BerkeleyDB has a complete copy of all the data on each computer, making it not a viable solution for massive sized database applications. However, if you have < 1TB (ie, one disk) of total possible data, it seems to me that a replicated local key/value database is the fastest solution.
I haven't found much discussion of people using this kind of technology for highly scalabable web sites.
Over the years, I've had extremely good performance results with dbm files, and have found that nothing beats local data, access through C APIs, and btree or hash table implementations. I have never tried replicated/redundant versions of this approach, and I'm curious if others have, and what your experience has been.
Reader Comments (5)
Embedded products like BerkelyDB have a few problems:
1. It's not SQL and it's not an RDBMS. Those are what most people are familiar with and they probably already have language bindings for these environments.
2. You can't access a BerkelyDB over a remote file system. So you have to message to a server so why bother?
3. Other tools you may use require a RDBMS.
4. The commercial license is not cheap.
I've used it in embedded environments and the database is solid and the tools are solid. So if the downsides are acceptable, I would go for it.
I agree, this increases development time considerably over SQL. However you get a massive increase in performance. MySQL appears to be a SQL parsing engine that sits on top of simple key/value database systems, and MySQL optionally can use BerkeleyDB as its engine. However, since I get about 200,000 queries/second out of BerkeleyDB, it also means that I don't have to spend time doing complicated perormance optimization strategies like I do with my SQL backed web sites.
Most interestingly to me, is that Amazon & Google appear to use key/value databases extensively, so they must be finding the performance vs development time trade-off worth it.
However, your point is very well taken. I use AOLServer & BerkeleyDB, both which are high performance software development tools, but the two available drivers to make the two talk both have significant limitations (BerkeleyDB's Tcl driver is not multithread safe, and a 3rd party driver I use is buggy). I'm currently considering writing my own Language bindings to BerkeleyDB, which is a pretty formidable task, which is why I'm re-examining my technological base on this forum before doing so :D
I agree, this is a serious issue, and I did add my own client/server layer to BerkeleyDB at one point, to see how well it performed (so-so, is the answer). However, why use a remote file system at all, why not use the local BerkeleyDB database, with the replication manager enabled? What this does is automatically push database writes over TCP/IP to a master server, that really looks like a traditional database server, but reads come from a local file copy of the master database copy. Since most applications are data-read-based, this "use the server for writes, but read from a local copy" gives a similar architecture to client/server SQL, but with a lot more performance.
I agree, so far I've always had to use PHP and Mysql as well, on all my web sites, in order to have access to other good open source work and not reinvent everything.
For web sites, since there is no redistribution of software, the way I read their open source license, BerkeleyDB is completely free. http://www.oracle.com/technology/software/products/berkeley-db/htdocs/licensing.html
- John Buckman
BookMooch.com / Magnatune.com
> For web sites, since there is no redistribution of software
I did not know that. Thanks.
The idea of using BerkelyDb replication is a good one for reads. I imagine you round robin or hash to the different replicants? Any idea when you'll hit the write wall? And still, I wonder if memcached might give you more bang for the buck in this scenario? But it's hard to beat 200,000 qps!
No, what I do is have a pool of machines, each with a complete copy of the web site application on it, each has its own BerkeleyDB instance, and round robin client access to the web server pool. I can be lazy and just assign multiple IPs to www.mysitename.com or be fancier and use a reverse proxy on a single IP and some load balancing scheme.
It's highly application dependent (ie, if your app does a lot of writes), and also highly dependent on the write cache flush interval. BerkeleyDB can keep all the replicas in perfect sync, at a high performnce cost. The more delay you add, the higher the performance. Since it's unlikely a web user could fetch another web page less than 1 second after causing a write, I'm using a 1 second write flush time.
Anything that moves reads onto TCP/IP will incur a huge performance penalty, vs in-memory local C calls. At a FOO Camp talk where Google was talking about BigDB, they proudly stated that their got single query times down to 10msecs each. That's impressive, but that means just 100 queries/second, and that's with their PhD genuises working on it, vs the crazy speeds you get with unoptimized BerkeleyDB. I should mention that I get 200,000 queries/second on my mac mini, and the BerkeleyDB web site claims 1 million queries/second on more typical server hardware.
I also find that because I have this kind of query speed, a lot of things I'd be optimizing, such as using string lookup tables for translation of pages, I instead use database lookups for. That really simplifies the programming, and lets me do some neat features, such as on BookMooch I offer wiki-style "correct this translation", which changes the database values for each string on a web page. To enable this, every string on every page on BookMooch is a database lookup, so a typical page can be 50 database lookups. This technique wouldn't be possible on SQL. Other people who offer a similar feature, usually do it by having a PHP array text file, that they rewrite and then distribute via rsync. That works too, but it's more programming, and a bit clumsy.
All this being said, the replication documentation in BerkeleyDB is pretty hard to understand, and quite involved, so I've been procrastinating on learning about it.
For now, I'm going to rewrite the language bindings for BerkeleyDB in C, and then probably rewrite a few key (and load intensive) web pages, like full text search, in C as well. My BookMooch web site has had 28% month-to-month growth for 6 months now, and so scalability is really something I have to really focus on (right now, at peak times, I'm getting terrible 5 to 20 second page load times).
- John Buckman
BookMooch.com / Magnatune.com
This post just made me think of a distributed hash using the memcache protocol and BerkeleyDB. Clearly, it's for those not in looking for a relational db, but just fast key based lookups.
"Tugela Cache is derived from Memcached. Much of the code remains the same, but notably, these changes: Internal slab allocator replaced by BerkeleyDB B-Tree database. ..."
http://meta.wikimedia.org/wiki/Tugelacache