Switch your databases to Flash storage. Now. Or you're doing it wrong.

Monday

Dec102012

Switch your databases to Flash storage. Now. Or you're doing it wrong.

Monday, December 10, 2012 at 9:30AM

This is a guest post by Brian Bulkowski, CTO and co-founder of Aerospike, a leading clustered NoSQL database, has worked in the area of high performance commodity systems since 1989.

Why flash rules for databases

The economics of flash memory are staggering. If you’re not using SSD, you are doing it wrong.

Not quite true, but close. Some small applications fit entirely in memory – less than 100GB – great for in-memory solutions. There’s a place for rotational drives (HDD) in massive streaming analytics and petabytes of data. But for the vast space between, flash has become the only sensible option.

For example, the Samsung 840 costs $180 for 250GB. The speed rating for this drive is rated by the manufacturer at 96,000 random 4K read IOPS, and 61,000 random 4K write IOPS. The Samsung 840 is not alone at this price performance. A 300GB Intel 320 is $450. An OCZ Vertex 4 256GB is $235, with the Intel being rated as slowest, but our internal testing showing solid performance. Most datacenter chassis will accommodate four data drives, and adding four Samsung 840 creates a system with 1TB of storage, 384,000 read IOPS, 248,000 random write IOPS, for a storage street cost of $720 and adding an extra 0.3 watts to a server’s power draw.

If you have a dataset under 10TB, and you’re still using rotational drives, you’re doing it wrong. The new low cost of flash makes rotational drives useful only for the lightest of workloads.

Most operational non-analytic work loads require only a few IOPS per transaction. A good database should require just one.

HDD as a price of about $0.10 per GB – 10x cheaper than flash – but each spindle supports about 200 IOPS--- the number of seeks per second. Until the recent advent of flash, databases were IOPS limited, requiring large arrays to reach high performance. Estimating cost per IOP is difficult, as smaller drives provide the same performance for lower cost. But achieving performance similar to the 96,000 IOPS of a $180 Samsung 840 would require over 400 HDD at a price of hundreds of thousands of dollars.

Let’s compare the economics of memory. Dell is currently (December 2012) charging $20 per GB for DRAM (16GB DIMM at $315), and a fully loaded R720 with RDIMMs topping out at 384GB for $13,000—or $33 per GB, fully loaded. Memory doesn’t have IOPS, and main memory databases measured over 1M transactions per second. Memory is faster, but we’ll see that for most use cases, network bottlenecks will overcome RAM’s performance advantage.

Step back: $33 per GB for RAM, $1 per GB for flash. High density 12T solutions can be built with the current Dell R720, compared to a high density 384GB memory system at about the same price ($13K/server). RAM’s power draw tips the equation even further.

Flash storage provides random access capabilities, which means your application developers are spending less time optimizing query patterns. All the queries go fast. That fast random access results in architectural flexibility, and allows you to change your data patterns and applications rapidly. That’s priceless.

The lure of main memory databases

Main memory sounds ideal - it’s blindingly fast for random data access patterns. Reads and writes are predictable, and memcache is one of the most loved balms to fix performance issues discovered in deployment.

It’s easy to write a new main memory database, and to simply cache data in your application. As a programmer, you never have to write an I/O routine and never have to deal with thread context switches. Using a standard allocator and standard threading techniques—without even optimizing for memory locality (NUMA optimizations)—a database built on main memory principles will be faster than 1G and 10G networking.

In Russ’s blog post about 1M TPS on a single $5K server, he showed blistering performance on an in-memory dataset. Clustered, distributed databases utilize k-safety and allow persistence to disk, so losing your data is not an issue. A Dell R520 with 96GB of memory can be had for $6,000, and if your business problem fits in a few hundred gigabytes, main memory is a great choice.

The problems come when you scale out. You start buying a lot of RAM, and you find interesting applications that are not cost-effective – where 100GB of data was a good start, but a few terabytes of storage would create a very compelling application.

I recently visited two modest social networking companies and found each had 4TB of memcache servers – a substantial main memory investment. As they broadened their reach and tried to build applications that spanned more variety for each user request, they just kept beefing up their cache tier. The CTOs at both companies didn’t complain about the cost. Instead, they were afraid to roll out the best user experience ideas they had—new features such as expanded friend-of-friend display—because that would expand cache requirements, thrash the caching layer, and bring down their service. Or require another several racks of servers.

With the right database, your bottleneck is the network driver, not flash

Networks are measured in bandwidth (throughput), but if your access patterns are random and low latency is required, each request is an individual network packet. Even with the improvements in Linux network processing, we find an individual core is capable of resolving about 100,000 packets per second through the Linux core.

100,000 packets per second aligns well with the capability of flash storage at about 20,000 to 50,000 per device, and adding 4 to 10 devices fits well in current chassis. RAM is faster – in Aerospike, we can easily go past 5,000,000 TPS in main memory if we remove the network bottleneck through batching – but for most applications, batching can’t be cleanly applied.

This bottleneck still exists with high-bandwidth networks, since the bottleneck is the processing of network interrupts. As multi-queue network cards become more prevalent (not available today on many cloud servers, such as the Amazon High I/O Instances), this bottleneck will ease – and don’t think switching to UDP will help. Our experiments show TCP is 40% more efficient than UDP for small transaction use cases.

Rotational disk drives create a bottleneck much earlier than the network. A rotational drive tops out at about 250 random transactions per second. Even with a massive RAID 10 configuration, 24 direct attach disks would create a bottleneck at about 6,000 transactions per second. Rotational disks never make sense when you are querying and seeking. However, they are appropriate for batch analytics systems, such as Hadoop, which stream data without selecting.

With flash storage, even if you need to do 10 to 20 I/Os per database transaction, your bottleneck is the network. If you’re in memory, the bottleneck is still the network.

If you choose the main memory path, you’ve thrown away a lot of money on RAM; you’re burning money on powering that RAM every minute of every day, and – very probably - your servers aren’t going any faster.

The top myths of flash

1. Flash is too expensive.

Flash is 10x more expensive than rotational disk. However, you’ll make up the few thousand dollars you’re spending simply by saving the cost of the meetings to discuss the schema optimizations you’ll need to try to keep your database together. Flash goes so fast that you’ll spend less time agonizing about optimizations.

2. I don’t know which flash drives are good.

Aerospike can help. We have developed and open-source a tool (Aerospike Certification Tool) that benchmarks drives for real-time use cases, and we’re providing our measurements for old drives. You can run these benchmarks yourself, and see which drives are best for real-time use or see our latest test results.

3. They wear out and lose my data.

Wear patterns and flash are an issue, although rotational drives fail too. There are several answers. When a flash drive fails, you can still read the data. A clustered database and multiple copies of the data, you gain reliability – a server level of RAID. As drives fail, you replace them. Importantly, new flash technology is available every year with higher durability, such as this year’s Intel S3700 which claims each drive can be rewritten 10 times a day for 5 years before failure. Next year may bring another generation of reliability. With a clustered solution, simply upgrade drives on machines while the cluster is online.

4. I need the speed of in-memory

Many NoSQL databases will tell you that the only path to speed is in-memory. While in-memory is faster, a database optimized for flash using the techniques below can provide millions of transactions per second with latencies under a millisecond.

Techniques for flash optimization

Many projects work with main memory because the developers don’t know how to unleash flash’s performance. Relational databases only speed up 2x or 3x when put on a storage layer that supports 20x more I/Os. Following are three programming techniques to significantly improve performance with flash.

1. Go parallel with multiple threads and/or AIO

Different SSD drives have different controller architectures, but in every case there are multiple controllers and multiple memory banks—or the logical equivalent. Unlike a rotational drive, the core underlying technology is parallel.

You can benchmark the amount of parallelism where particular flash devices perform optimally with ACT, but we find the sweet spot is north of 8 parallel requests, and south of 64 parallel requests per device. Make sure your code can cheaply queue hundreds, if not thousands, of outstanding requests to the storage tier. If you are using your language’s asynchronous event mechanism (such as a co-routine dispatch system), make sure it is efficiently mapping to an OS primitive like asynchronous I/O, not spawning threads and waiting.

2. Don’t use someone else’s file system

File systems are very generic beasts. As databases, with their own key-value syntax and interfaces, they have been optimized for a particular use, such as multiple names for one object and hierarchical names. The POSIX file system interface supplies only one consistency guarantee. To run at the speed of flash, you have to remove the bottleneck of existing file systems.

Many programmers try to circumvent the file system by using direct device access and the O_DIRECT flag. Linus Torvalds famously removed the DIRECT option from the Linux kernel saying it was braindamaged. Here’s how he said it in 2007:

Date	Wed, 10 Jan 2007 19:05:30 -0800 (PST)
From	Linus Torvalds <>
Subject	Re: O_DIRECT question

The right way to do it is to just not use O_DIRECT.

The whole notion of "direct IO" is totally braindamaged. Just say no.

        This is your brain: O
        This is your brain on O_DIRECT: .

        Any questions?

Our measurements show that the page cache dramatically increases latency. At the speeds of flash storage, the page cache is disastrous. Linus agreed substantially in his own post:

Side note: the only reason O_DIRECT exists is because database people are
too used to it, because other OS's haven't had enough taste to tell them
to do it right, so they've historically hacked their OS to get out of the
way.

As a result, our madvise and/or posix_fadvise interfaces may not be all
that strong, because people sadly don't use them that much. It's a sad
example of a totally broken interface (O_DIRECT) resulting in better
interfaces not getting used, and then not getting as much development
effort put into them.

So O_DIRECT not only is a total disaster from a design standpoint (just
look at all the crap it results in), it also indirectly has hurt better
interfaces. For example, POSIX_FADV_NOREUSE (which _could_ be a useful and
clean interface to make sure we don't pollute memory unnecessarily with
cached pages after they are all done) ends up being a no-op ;/

Sad. And it's one of those self-fulfilling prophecies. Still, I hope some
day we can just rip the damn disaster out.

In a test scenario we tried, enabling the page cache causes some requests to complete in 16 to 32 milliseconds, and a substantial portion (3% to 5%) take more than 1ms. With O_DIRECT, no request took more than 2 ms and > 99.9 requests were under 1 ms. We did not test madvise; we also found it is not hooked up correctly in some versions of Linux.

Aerospike uses O_DIRECT, because it works.

3. Use large block writes and small block reads

Flash storage is different from any other storage because it is asymmetric. Reads are different from writes, unlike RAM and unlike rotational disk. Flash chips are like an etch-a-sketch: you can draw individual lines, but to erase the entire screen requires shaking. Flash chips work in native block of sizes around 1MB, and writing at the same size as the fundamental block of the flash chip means the device keeps the simplest possible map.

Reads can be done anywhere on the device, unlike writes. You can exploit this characteristic by writing data together and reading randomly.

Over time, flash device firmware will improve, and small block writes will become more efficient and correct, but we are still early in the evolution of flash storage. Today, writing only in large blocks leads to lower write amplification and lower read latency.

Flash is better than you think; use it and prosper

Knowing how to use flash and using flash-optimized databases can provide your designs with massive competitive benefit. The problems of social metadata, graph analysis, user profile storage, massive online multiplayer games with shared social gameplay, security threat pattern analysis, real-time advertising and audience analysis, benefit from immediately available, highly random database storage.

HighScalability Team |

19 Comments |

Permalink |

Print Article

Email Article

SSD,

Strategy

Reader Comments (19)

A trio of notes.

1. It's pretty disingenuous to compare Dell's price for RAM to Newegg's price for SSDs. You should see what Dell charges for SSDs (it's unconscionable). Also, it's important to consider TCO, as the MTBF for consumer SSDs like the Samsung 840 is much lower than than decent ECC ram. Once you consider datacenter targeted SSDs (even cheap ones), you're only really paying a 3x overhead or so. That said, if you want a TB of memory in one machine, you currently need a big piece of iron or expensively dense dimms to get there. Even with scale out systems, there's a number of TB where SSDs look attractive.

2. Performance differences between memory and flash do differ dramatically depending on your workload. For massively parallel, key-value applications, network bandwidth and message throughput is often a bottleneck. For workloads that need to access more than one piece of data in an operation, or for synchronous workloads, the performance of memory can make a big difference.

This later issue is why VoltDB uses stored procedures to achieve memory-optimized performance on boring gigabit ethernet. If you're limited by how many packets the server can exchange with the client, why not load up on the amount of work you can do per packet. Many of our customers start out with key-value applications and add richness to the put and get operations. Examples include enrichment, dynamically-computed values, validation and even fraud-detection. These features can be added without additional network load.

3. Finally, one great thing about in-memory systems is they give new purpose to rotational media. Yes, spindles are terrible at f-syncs, but they're not terrible at append-only writes. This makes them excellent for snapshots or logs, where data is ordered in memory. Add an SLC flash-buffered disk controller, and they can f-sync fast too. VoltDB still recommends spindles with smart controllers as the persistence backing of choice for it's in-memory store.

December 10, 2012 |

John Hugg

This reads like marketing collateral for Fusion IO and the author's startup. Can you provide benchmarks on the tps you're claiming?

December 10, 2012 |

Jay Peters

High density 12T solutions can be built with the current Dell R720

I'd love to hear some more details on how? I thought that box had 16 drive bays.Using 300 GB drives in Raid 1 you're left with 2.4 terabytes.

Is it possible to cost effectively build a system closer to 10 terabytes with drive redundancy?

December 10, 2012 |

Ben

If you aren't using ssd write/read caching, you are doing it wrong!

See how that works?

The fact that you are using MLC SSD's with horrible tbw " total bytes written" performance, seems you really don't like your data. Until Intel releases the 710 MLC based drives, which will have peta byte endurance. Doing so, without knowing the consequences, is playing with fire.

The better approach is to use ssd's as a caching layer. Or in your words, you are doing it wrong :-P

There's :
- cachecade 2.0 from LSI which does this in your controller withour caring about the OS ( I do this and get 100K iops in 1tb of storage on a warehouse db )
- bcache for linux
- zfs for linux ( linux patches from llnl, quite stable )
- brtfs, if you are feeling adventurous.

The reality is that caching your most used data with ssd's is the better way to do this. Else, i hope you are building raid 10 arrays across drives of different manufacturers and definitely having enough replicas... because MLC ssd's fail "predicitively". Meaning, aside from manufacturing defects, they fail AT A SET amount of writes +/- a few gigabytes.. and guess what's going to happen to all of your dbs at similar time frames?

So while you can use MLC's, in production, i do all the time... you have to be really really careful, Advocating their blanquet use, and definitely showing you down understand the hardware concepts and/or that there are equally great solutions there ( ssd caching layers ), reflects more on you than them :-P

December 10, 2012 |

Javier

Block caching is another option if you are caught in between the need for large database size and performance for a heavily used portion. Here's some benchmarking:
http://www.accelcloud.com/2012/04/18/linux-flashcache-and-bcache-performance-testing/
http://bcache.evilpiepirate.org/

Seems you still have to use a patched kernel for bcache whereas flashcache is available as a kernel module. Not sure of the integrity of the WAL under failure for either of these, so do your own homework.

December 10, 2012 |

Brian Knapp

Good points in this post, even if prices vary the basic architecture of spinning rust vs. flash is well described. One problem is that decades of development of traditional database architectures is based on making disks work better, and to get the benefits of flash in full you need a very different storage engine that doesn't do inserts or overwrite blocks, and write large immutable files. This is what Apache Cassandra does, and we run it on AWS based SSD instances.

December 11, 2012 |

adrianco

Any time I see an article with the title "If you’re not doing X, you are doing it wrong." wherein the author then immediately contradicts themselves ("Not quite true, but close") I see the whole article as linkbait marketing crap. Do yourself a favor and write more considered content.

December 11, 2012 |

James

Great article--especially about the part about in-memory databases being network-packet bound in the real world.

One interesting note about SSDs is that when running mixed read-write random workloads on a 50-75% full drive, the sustainable IOPs is 2-10x slower than the nominal ratings that you see thrown around (most good consumer SSD sustain more like 20K IOPS in these cases.)

December 11, 2012 |

Dave Rosenthal

I worry for the authors performance. About 3 times he states it costs so much more but you save cause you don't have to worry about your algorithm's for your writes / reads, cause SSD just takes care of it being so much faster. Um so what happens to your application when the speed of SSD becomes the standard and your application runs slow as heck cause you didn't make it efficient. Not to mention it's like buying a sports car and putting 87 octane in the tank. It'll work but you won't get the performance you spent all the money for.
My point is no you won't save money, so the systems still end up costing 3 times more than standard options with efficient coding. NOW with that said, if you build it out right you can get performance gains, but really the option is and never is just one solution.
Should use in memory for parts of the system, that shifts to SSD when appropriate that then transfers to HDD based on the systems needs at each step.

December 11, 2012 |

mtcoder

Great article. Disk is the new Tape.

To really make ssd cost effective and reliable, add deduplication and a type of RAID that is specifically designed to protect against the 3 flash failure modes (device failure, bit errors, and performance variability).

as for database performance and and behavior, you can check out my blog entries at blog.purestorage.com

December 11, 2012 |

Chas. Dye

Scale vertically. Yes, this has worked for decades!

December 11, 2012 |

Craig

Write a provocative article -> get a lot of heated responses :)

SSDs are a topic riddled w/ conflicting arguments/claims/benchmarks, etc.. We as a collective engineering community have yet to 100% agree on how to best use SSDs, a hard task as SSDs are constantly changing/improving plus there is a huge disparity between drives.

SSDs are not all built the same, some perform well and then fall off sharply after a month or two in production, but some really do perform as expected (within a degree of margin) for extended periods IFF the software writing to them is tailored/aware of SSDs idiosyncrasies.

There is a lot of unfounded hype and a lot of unfounded fear surrounding SSDs and hopefully discussions like this article (and the ensuing [sometimes conflicting] comments) can start to shed light on the truths of SSDs in production use, so we can make forward progress on how to best use SSDs.

The point that there is a real sweet spot for SSDs in databases on thoroughly tested SSDs w/ software optimized to SSDs idiosyncrasies is a valid one. Once that is agreed, the task becomes defining the sweet spot, which is unfortunately terribly complicated :)

December 11, 2012 |

Russell Sullivan

There's an awful lot of FUD being thrown around in the comments here by people with self-centered vested interests in thrashing this blog post. You don't need "enterprise" SSD's to scale out. SLC is a waste of money for the author's approach. "Enterprise MLC" is just overprovisioned MLC flash. You can do this yourself with the regular "consumer" drives. The Intel 320's that the author mentioned will easily last 4-5 years under all but the most brutal write workloads. Don't believe me? Try it yourself and see what your media wearout indicators are after a year. I bet they're in the 80-95 range. Who cares if you storage doesn't last 8 years? You're going to replace it all in 4 anyway.

If you care about your site running well and dealing with more important things, I'd recommend using SSD's in your databases like the author suggested. If you like being an uber nerd about the "best" and "most optimal" solution, you should keep reading random blogs and getting pedantic in the comment threads.

December 11, 2012 |

diq

You must be careful when building a DRAM-based DB. In real life power outages do happened all the time in double and triple UPS systems. Seen it one too many - the problem is just like with the data backup: it works wonderfully until you need to recover... then SOL! SSDs are great for casual data (i.e. the data you can afford to loose). Failure rates are still on a high side and it will be a while (IMHO) until the technology and real MTBF catches up with good old HDs. If you must, go with RAID and I mean RAID 10 and multiple drives, just as Javier advises (above post)

December 11, 2012 |

Dmitry Gorin

For those who still live in 2008: the modern enterprise MLC SSD have life time in petabytes. Intel S3700 can sustain 5 full drive overwrites each day during 5 years. This is more than enough for any "enterprise" application.

December 12, 2012 |

Vladimir Rodionov

You say you can take 4x 250GB flash drives and get 1TB of storage. So you recommend RAID 0 in production environments? All credibility lost with this recommendation. Wow.

Also you can't quote prices from Dell for RAM. $33/GB is insane. You can get server RAM (ECC/Registered) for $10 or less per GB from most non-enterprise vendors.

December 13, 2012 |

Sean

The Intel S3700 does get more life, but it is also more than double than price of normal MLC. So, it is >20x the price of a HDD. The author forgets to mention this detail. And secondly, the author suggests doing mirroring, etc., to deal with failures. Well, add 2x to that. So, overall, we are looking at 40x the cost of HDD :)

December 21, 2012 |

Thanks for sharing this nice post!!!!
http://www.netdepot.com/

December 30, 2012 |

tyler jones

Now I know where to put my database. Thanks for this article

October 11, 2017 |

Jeff

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Switch your databases to Flash storage. Now. Or you're doing it wrong.

Why flash rules for databases

The lure of main memory databases

With the right database, your bottleneck is the network driver, not flash

The top myths of flash

Techniques for flash optimization

Flash is better than you think; use it and prosper

Related Articles

Reader Comments (19)

Post a New Comment