This is the third guest post (part 1, part 2) of a series by Greg Lindahl, CTO of blekko, the spam free search engine. Previously, Greg was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters.
blekko's home-grown NoSQL database was designed from the start to support a web-scale search engine, with 1,000s of servers and petabytes of disk. Data replication is a very important part of keeping the database up and serving queries. Like many NoSQL database authors, we decided to keep R=3 copies of each piece of data in the database, and not use RAID to improve reliability. The key goal we were shooting for was a database which degrades gracefully when there are many small failures over time, without needing human intervention.
Most big storage systems use RAID levels like 3, 4, 5, or 10 to improve reliability. A conservative choice for a big NoSQL database might be to use a bunch of nodes with RAID, and then replicate the data to several nodes. That doesn't seem to be common among big NoSQL databases. Here are some reasons why RAID is disliked:
The alternative is to keep R=3 copies of each piece of data, on 3 different servers, and to use software consistency built into the database to keep the data consistent. This R=3 strategy fixes all of our RAID complaints:
Why 3 copies? It's a trade off between expense and availability. With R=3, my cluster is available as long as I can re-copy data on the failed disks before 3 failures happen.
It's true that keeping 3 copies of everything involves more disks than the typical RAID setup. However, if you stay in the sweet spot of direct-attached storage, the first 8-12 direct-attach disks are pretty inexpensive compared to what you're spending on cpus, ram, flash disk, and your network. Having more disks helps performance -- it gives you more read bandwidth. The extra disks don't help write bandwidth, since everything must be written R=3 times. Network bandwidth also takes a hit, although that's necessary in any scheme which protects against server failures.
Rebuilds tend to be expensive in any system preserving reliability. A 5 disk, fully-used RAID-5 volume with a failure needs to read 4 disks completely and write one disk completely to return to full protection and performance. This can take quite a while, and is getting worse over time as disk capacities increase faster than disk bandwidth. In an R=3 cluster, a single disk contains chunks of data replicated to other disks elsewhere in the cluster. For rebuild purposes, it's nice to have at least 10 chunks on a disk, and it's also nice for the 2 replicas for these 10 chunks to be on as many servers as possible. A rebuild then involves copying 1/10 of a disk between 10 pairs of systems, and the entire rebuild can be done in 1/10 the time of reading an entire disk. The rebuild of all the data on a failed server with 10 disks would involve 100 pairs of servers. While this eats up network bandwidth in comparison to any rebuild that takes place within a single server, the performance impact can be spread out over an entire cluster, instead of creating a performance hot-spot. While we've chosen 10 chunks per disk at blekko, some other NoSQL databases have as many as a million chunks per disk. Having a small number of chunks per disk leads to slow rebuilds that have big performance impact to a small number of nodes, and a performance hot-spot. Failing to spread out chunks properly also increases the time to rebuild and leaves hot-spots.
There's a fair bit of jargon which isn't standardized yet in the world of R=3 NoSQL databases, so let me give you a few terms that we use inside blekko. We call the R-level of a cluster the minimum replica count of any chunk in the cluster, but just to confuse things, we also refer to the R-level goal of the cluster (usually 3) or the R-level of an individual chunk of data. So if our R3 (a goal of 3 copies of everything) cluster has a single chunk that's R2 (2 copies of this chunk), we call the entire cluster R2 (the worst R-level of any chunk in the cluster.) While you might think this triple-usage is confusing, in practice it seems to work out OK. As long as we're at R1 or better, we can read all of the data in the database. R1 is a scary place to be, though, since a single failure might cause the cluster to go to R0, where some data can't be read. Also, at some point between R3 and R1, we stop being able to write new data into the cluster efficiently. If all chunks are at R1, for example, and we return to R3, a lot of data will need to be replicated for all the replica chunks to become current.
In an ideal world, each failure in an R3 cluster would only reduce the overall R-level of the cluster by one. Failures aren't random: they happen at the disk, server, PDU (power circuit), and network switch levels. If we let several replicas of a chunk live on the same disk, same server, same PDU, or same non-redundant network switch domain, we reduce the resilience of our database. Improving resilience in this circumstance is easy: let's make a rule that only one replica of a chunk can live on any disk, server, PDU (1 rack in our datacenter) or leaf network switch (3 racks in our datacenter.) Our central switch infrastructure is redundant, and brings everything down when it fails, so we don't have to worry about it in this context. Now that we've made the cluster behave more gracefully in response to failures, let's also consider administrative convenience. It's nice to be able to take down a large fraction of the cluster at once in order to do things like add RAM or upgrade software. We already have the ability to take down 3 racks (1 leaf network switch) at once without reducing the R-level by more than 1, but in theory we should be able to take down up to 1/3 of our cluster at once. To make this possible, we divide the cluster into 3 or 4 "zones", and make sure no chunk is replicated twice in a zone.
Our NoSQL has a separate daemon, the Repair Daemon, responsible for duplicating, moving, and removing chunks in the database, subject to the placement rules in the previous section. This daemon also has some additional abilities related to "draining" a server or disk known to be failing. The information that something is about to fail can come from sources such as disk errors and ECC errors in syslog, SMART daemon complaints, and failed fans or high temperature events reported by the bmc. "draining" chunks off before the disk or server fails completely can reduce the time a cluster spends at R2, which makes it less likely that the cluster will go to R1 or R0 due to an unlucky rapid series of failures. The repair daemon drains a disk or server by making a temporary 4th copy of the chunks involved, and then deleting the chunks on the drained disk or server. When all chunks are removed, the disk or server can be repaired. Failing disks happen often enough in our environment that we have fully automated the decision to drain a disk, or an entire server if the failing disk is the system disk.
With all this talk of reliability, it's important to remember that there will always be classes errors which can easily cause R0. One such problem is bugs in the NoSQL database itself; if data in a row causes the database to fail, all 3 replicas of the chunk containing the row can fail. This sort of bug has happened to us several times.
There are a few storage-related topics that I'd like to briefly talk about, even though they aren't directly related to R=3 databases.
You can probably tell from reading this post that we've learned a lot of lessons the hard way. We've run the system as described for about 4 years now, starting with 150 servers and growing to 1,500, scaling our raw disk storage from 1 petabyte to 15. During this time, we have never gone below R1 due to hardware failures. We've had site downtime, but it's always been related to software bugs (the "baseball game incident" was very unfortunate), or network issues.