Why Existing Databases (RAC) are So Breakable!

Monday

Nov302009

Why Existing Databases (RAC) are So Breakable!

Monday, November 30, 2009 at 3:07PM

One of the core assumption behind many of today’s databases is that disks are reliable. In other words, your data is “safe” if it is stored on a disk, and indeed most database solutions rely heavily on that assumption. Is it a valid assumption?

Read the full story here

14 Comments |

Permalink |

Print Article

Email Article

Reader Comments (14)

Not at my company. We backup all of our masters and binlogs regularly (matter of hours with regular snapshots) from the machines themselves onto network storage and eventually onto tape. Course we take machines offline for maintenance all of the time (ram and disk failures). Think that's pretty normal when you have hundreds of database servers or just very big and active databases (requires swapping data on and off of disk).

----------------------------------------------------------------------------
http://blog.pe-ell.net/1_jons_ramblings/archive/5_what_i_do.html

November 30, 2009 |

Jon Stephens

Is it just me or is that link to the full article wrong?

November 30, 2009 |

linkfail

First, the link seems to be broken. I had to Google for your blog and find the actual article there.

Second, what a huge strawman. Nobody who's serious about data integrity just throws it on a single disk and assumes that it's safe forever, yet you act as though statistics given for single disks apply equally to arrays with internal redundancy. They don't. You also seem rather keen to go looking for numbers on disk failures and then conclude that RAM can be more reliable . . . without any effort to find equivalent numbers for RAM failure rates. Here's a hint: RAM *also* fails more often than people think. Replication is equally necessary regardless of the underlying storage technology, for the very reason that Jason McHugh alludes to: *any* component can fail, and if you have enough of a component then failure will become a normal case.

There are definitely problems with disk-based solutions, but those don't have to do with innate unreliability. They have to do with latency, and contention, and things like that. Just replacing disks with memory solves *absolutely no* reliability problems, and any approach that can be used to make memory reliable can be used with disks as well. Then there's the issue of very real problem sets that simply do not fit even into the aggregate memory of large clusters. It's always sad when someone whose company has a huge hole in their portfolio tries to spread FUD so they can sell more of what they do have instead of admitting that the hole needs to be filled.

November 30, 2009 |

Jeff Darcy

Give it another try. I made the link point to Nati's blog post.

November 30, 2009 |

HighScalability Team

Products providing both controller-based and host-based RAID-1 are available, and some of these allow multiple sites and hundreds of kilometers between the mirrored disks and the host boxes; your central limit here is not the box, but the available bandwidth and the latency of the link(s) between the boxes.

Ignoring memory and bus errors, simply loading (more) memory onto a box is no panacea; memory bus length and latency being key factors here, too. Much like the numbers of PCIe slots that can be available within a box (which is one of the factors limiting the maximal numbers of disks that can be configured), the numbers of memory slots that can be physically present in a box are limited by bus length signaling restrictions.

And if disk reliability (studies from Google and CMU) and memory reliability (qv: Google) are intriguing topics, then the features and discussions and various claims of processor RAS features can be intriguingly and frustratingly nebulous. In the absence of independent studies and with the experience of the Google and CMU studies, these processor RAS features can certainly be (and given sufficiently valuable data, probably should be) viewed with some skepticism.

November 30, 2009 |

Stephen Hoffman

Jeff see my response to your comment below:

"Nobody who's serious about data integrity just throws it on a single disk"

That was not my assumption. My argument applies to the fact that database clusters relies on a shared storage for maintaining their cluster state. In most cases If you would plug-out that storage device the entire cluster breaks.

"you act as though statistics given for single disks apply equally to arrays with internal redundancy"

Quoting from my post:

There is NO correlation between failure rate and disk type – whether it is SCSI, SATA, or fiber channel. Most data centers are based on the assumption that investing in high-end disks and storage devices will increase their reliability – well, it turns out that high-end disks exhibit more or less the same failure patterns as regular disks! John Mitchell had an interesting comment on this matter during our Qcon talk, when someone pointed to RAID disks as their solution for reliability. John said that since RAID is based on an exact H/W replica that lives in the same box, there is a very high likelihood that if a particular disk fails, its replica will fail in the same way. This is because they all have the exact same model, handle the exact same load and share the same power/temperature.

"then conclude that RAM can be more reliable"

My specific words was "Memory can be more reliable then disk". I deliberately used the word "can" and not "is" as i'm sure there can be cases where it can be less reliable. By default most people assumes that memory is not reliable and disk are reliable. My point in that comment was to show that those assumptions can be quite the opposite. I was also very explicit on the fact that i couldn't find data points to back my statement rather then point to common sense.

"There are definitely problems with disk-based solutions, but those don't have to do with innate unreliability."

I beg to differ.

They have to do with latency, and contention, and things like that. Just replacing disks with memory solves *absolutely no* reliability problems, and any approach that can be used to make memory reliable can be used with disks as well."

Right but that was my exact point!. Most high-end database cluster where not built with that model but instead they rely on shared-storag and high-end hardware. They where designed under the assumption that failure can be prevented using this approach.

I was referring to Amazon S3 as an example of reliable storage. S3 relies on file sysystem but took a very different approach to database clusters to provide a high degree of reliability. To be clear i wasn't trying to say that you can't build a reliable system on top of disks but to say that if you want to do that you have to build your system in away that will tolerate failure in its architecture.

"Then there's the issue of very real problem sets that simply do not fit even into the aggregate memory of large clusters."

I agree.

November 30, 2009 |

Nati Shalom

"since RAID is based on an exact H/W replica that lives in the same box, there is a very high likelihood that if a particular disk fails, its replica will fail in the same way."

Ditto for disks in separate boxes, or for RAM in separate boxes. At my last company, we had to deal with problems that turned out to be related to bad batches of RAM, with problems spread out across dozens of nodes. How many "in memory data grids" are deliberately built using ECC RAM from multiple batches, with that knowledge built in to their data placement algorithms to avoid correlated failures? No more than is the case for disk arrays (except for the very low end), I'll bet. Most don't even turn on the OS features that would allow them to *measure* RAM error rates. They just know that their systems crash once in a while.

The real acid test is this. Take some reasonably sized data set, say 25TB (one day worth of Facebook logs IIRC). Decide what your MTDL (Mean Time to Data Loss) target is. Then, using the best available information about component failure rates, figure out what number and arrangement of disks are necessary to reach that target. You'll have to take topology into account, because an internally redundant disk array can experience multiple failures without losing access to even a single copy of data, while server-resident disk is subject to the server's own failure rates along with those of NICs, switches, etc. The price premium for a decent disk array can therefore be outweighed by the additional redundancy necessary for a server-based approach. Then repeat the experiment for RAM, which is always singly hosted and will have to be spread across more servers, switches, etc. Even if you had one tenth the failure rate, if you need ten times as many components you're not getting ahead . . . and you'll be well behind once cost enters the picture. By the time all is said and done, RAM-based architectures will often be hard pressed to meet cost/capacity/MTDL targets that are trivially achievable with disk. That doesn't mean they can't be deployed as a *performance* enhancer, of course, but there's another generation or two of magic that has to happen before anyone more storage-savvy than Tim Bray will seriously think about relegating disks to a purely archival role.

November 30, 2009 |

Jeff Darcy

Jeff

If you take the example of facebook Cassandra, Google bigtable, Amazon dynamo is a live testimonial on how extremely large scale system works in a reliable fashion even with the limitations that you pointed out i.e. managing lots of machines.

There are basically two school of thoughts here:

1. Prevent failure from happening - by putting redundancy within the box and high-end hardware devices
2. Assume that failure are inevitable and design your architecture to cope with it.

My argument is that approach 1 has an inherit fallacy and the statistics gathered from years of experience in large data centers proves that. Jason (Amazon) take that point even further arguing that failure can happen everywhere and therefore the assumption that your "safe" if your data is stored in a database that is backed by storage can't work either. Instead you should design the entire application to cope with failure not just your database.

Never in my arguments i said that you can use memory to solve all your problem in the world i actually gave two references 1) S3 as large scale data storage based on commodity disks 2) in-memory cluster that uses distributed data grid to store data in-memory

IMO both approaches are complementary i.e. i would use disk based storage for large and long-term storage and memory for real-time access to data at limited capacity.
Obviously RAM is (still) significantly more expensive then disk so the combination of the two is probably your best choice and BTW that was the exact conclusion i pointed out during my Qcon presentation.

December 1, 2009 |

Nati Shalom

Yes, Dynamo/Cassandra/etc. have proven some very important points about a particular set of approaches working at large scale. I've written about that myself many times. OTOH, Lustre/PVFS2/GlusterFS have also proven some points about another set of approaches - those which you deride - also working at very large scale. The difference is less about "works" vs. "doesn't work" than about CAP-theorem tradeoffs.

The papers you cite do show (or showed two years ago when I read them) that disks fail more often than people think, but they're not a killer blow to the whole notion of internally redundant shared storage. Underestimating the value of more reliable storage is just as fallacious as overestimating it. Even though software ultimately has to take responsibility for dealing with failures, that software can still be significantly simplified if some subset of errors are handled transparently and some other subset can be handled by accessing an existing replica through a different path (including a different server) instead of having to generate a new replica. Good system designs can be based on that. Is RAC a good system design? Quite likely not, but that doesn't really reflect on the storage. If RAC is crap on top of shared storage, it could be crap on top of dispersed storage as well.

December 1, 2009 |

Jeff Darcy

Of course disk fail -- it happens all the time in a big environment and replacing them is routine. The shared disk architecture described here is actually most vulnerable to hardware failures in the host systems and memory. Corrupted buffers flushed from DRAM to shared disk can destroy data integrity for the entire cluster.

December 2, 2009 |

Johann Schleier-Smith

thats why you should use a real filesystem that checksums the data end to end. like ZFS or BTRFS.

December 2, 2009 |

gustav

checksums defnitely help. You need them in the application code as well as the filesystem, and the Oracle database can do block checksums. With the default settings there are still ways for corruption to slip past, though.

I sleep best when the storage for each system is isolated from the others.

December 3, 2009 |

Johann Schleier-Smith

I got a interesting reference to a recent research by the Department of Computer Science Stanford University - The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM - Below is an interesting quote form this research:

With today’s technologies, if a 1KB record is accessed at least once every 30 hours, it is not only faster to store it in memory than on disk, but also cheaper (to enable this access rate only 2% of the disk space can be utilized).

December 5, 2009 |

Nati Shalom

seems the article was more about failures of clustered systems than about a failure of a clustered DB.

I managed a Oracle RAC through 3 generations of RH & Oracle versions; we did have failures to the SAN, network & hosts; but we never lost data or had any outages; as we ran more than 1 host. Plus we ran a duplicate RAC at another site (though on lesser powered equipment) for total redundancy.

December 6, 2009 |

clarke

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>