Russ’ 10 Ingredient Recipe for Making 1 Million TPS on $5K Hardware

My name is Russell Sullivan, I am the author of AlchemyDB: a highly flexible NoSQL/SQL/DocumentStore/GraphDB-datastore built on top of redis. I have spent the last several years trying to find a way to sanely house multiple datastore-genres under one roof while (almost paradoxically) pushing performance to its limits.
I recently joined the NoSQL company Aerospike (formerly Citrusleaf) with the goal of incrementally grafting AlchemyDB’s flexible data-modeling capabilities onto Aerospike’s high-velocity horizontally-scalable key-value data-fabric. We recently completed a peak-performance TPS optimization project: starting at 200K TPS, pushing to the recent community edition launch at 500K TPS, and finally arriving at our 2012 goal: 1M TPS on $5K hardware.
Getting to one million over-the-wire client-server database-requests per-second on a single machine costing $5K is a balance between trimming overhead on many axes and using a shared nothing architecture to isolate the paths taken by unique requests.
Even if you aren't building a database server the techniques described in this post might be interesting as they are not database server specific. They could be applied to a ftp server, a static web server, and even to a dynamic web server.
Here is my personal recipe for getting to this TPS per dollar.
The Hardware
Hardware is important, but pretty cheap at 200 TPS per dollar spent:
- Dual Socket Intel motherboard
- 2*Intel X5690 Hexacore @3.47GHz
- 32GB DRAM 1333
- 2 NIC ports of an Intel quad-port NIC (each NIC has 8 queues)
Select the Right Ingredients
The architecture/software/OS ingredients used in order to get optimal peak-performance rely on the combination and tweaking of ALL of the ingredients to hit the sweet spot and achieve a VERY stable 1M database-read-requests per-second over-the-wire.
It is difficult to quantify the importance of each ingredient, but in general they are in order of descending importance.
Select the Right Architecture
First, it is imperative to start out with the right architecture, both vertical and horizontal scalability (which are essential for peak-performance on modern hardware) flow directly from architectural decisions:
1. 100% shared nothing architecture. This is what allows you to parallelize/isolate. Without this, you are eventually screwed when it comes to scaling.
2. 100% in-memory workload. Don’t even think about hitting disk for 0.0001% of these requests. SSDs are better than HDDs, but nothing beats DRAM for the dollar for this type of workload.
3. Data lookups should be dead-simple, i.e.:
- Get packet from event loop (event-driven)
- Parse action
- Lookup data in memory (this is fast enough to happen in-thread)
- Form response packet
- Send packet back via non-blocking call
4. Data-Isolation. The previous lookup is lockless and requires no hand-off from thread-to-thread: this is where a shared-nothing architecture helps you out. You can determine which core on which machine a piece of data will be written-to/served-from and the client can map a tcp-port to this core and all lookups go straight to the data. The operating system will provide the multi-threading & concurrency for your system.
Select the Right OS, Programming Language, and Libraries
Next, make sure your operating system, programming language, and libraries are the ones proven to perform:
5. Modern Linux kernel. Anything less than CentOS 6.3 (kernel 2.6.32) has serious problems w/ software interrupts. This is also the space where we can expect a 2X improvement in the near future; the Linux kernel is currently being upgraded to improve multi-core efficiency.
6. The C language. Java may be fast, but not as fast as C, and more importantly: Java is less in your control and control is the only path to peak performance. The unknowns of garbage collection frustrate any and all attempts to attain peak performance.
7. Epoll. Event-driven/non-blocking I/O, single threaded event loop for high-speed code paths.
Tweak and Taste Until Everything is Just Right
Finally, use the features of the system you have designed. Tweak the Hardware & OS to isolate performance critical paths:
8. Thread-Core-Pinning. Event loop threads reading and writing tcp packets should each be pinned to their own core and no other threads should be allowed on these cores. These threads are so critical to performance; any context switching on their designated cores will degrade peak-performance significantly.
9. IRQ affinity from the NIC. To avoid ALL soft interrupts (generated by tcp packets) bottlenecking on a single core. There are different methodologies depending on the number of cores you have:
- For QuadCore CPUs: round-robin spread IRQ affinity (of the NIC’s Queue’s) to the Network-facing-event-loop-threads (e.g. 8 Queue’s, map 2 Queue’s to each core)
- On Hexacore (and greater) CPUs: reserve 1+ cores to do nothing but IRQ-processing (i.e. send IRQ’s to these cores and don’t let any other thread run on these cores) and use ALL other cores for Network-facing-event-loop-threads (similarly running w/o competition on their own designated core). The core receiving the IRQ will then signal the recipient core and the packet has a near 100% chance of being in L3 cache, so the transport of the packet from core to core is near optimal.
10. CPU-Socket-Isolation via PhysicalNIC/PhysicalCPU pairing. Multiple CPU sockets holding multiple CPUs should be used like multiple machines. Avoid inter-CPU communication; it is dog-slow when compared to communication between cores on the same CPU die. Pairing a physical NIC port to a PhysicalCPU is a simple means to attain this goal and can be achieved in 2 steps:
- Use IRQ affinity from this physical NIC port to the cores on its designated PhysicalCPU
- Configure IP routing on each physical NIC port (interface) so packets are sent from its designated CPU back to the same interface (instead of to the default interface)
This technique isolates CPU/NIC pairs; when the client respects this, a Dual-CPU-socket machine works like 2 single-CPU-socket machines (at a much lower TCO).
That is it. The 10 ingredients are fairly straightforward, but putting them all together, and making your system really hum, turns out to be a pretty difficult balancing act in practice. The basic philosophy is to isolate on all axis.
The Proof is Always in the Pudding
Any 10 step recipe is best illustrated via an example: the client knows (via multiple hashings) that dataX is presently on core8 of ipY, which has a predefined mapping of going to ipY:portZ.
The connection from the client to ipY:portZ has previously been created, the request goes from the client to ipY:(NIC2):portZ.
- NIC2 sends all of its IRQs to CPU2, where the packet gets to core8 w/ minimal hardware/OS overhead.
- The packet creates an event, which triggers a dedicated thread that runs w/o competition on core8.
- The packet is parsed; the operation is to look up dataX, which will be in its local NUMA memory pool.
- DataX is retrieved from local memory, which is a fast enough operation to not benefit from context switching.
- The thread then replies with a non-blocking packet that goes back thru only cores on the local CPU2, which sends ALL of its IRQs to NIC2.
Everything is isolated and nothing collides (e.g. w/ NIC1/CPU1). Software interrupts are handled locally on a CPU. IRQ affinity insures software interrupts don’t bottleneck on a single core and that they come from and go from/to their designated NIC. Core-to-core communication happens ONLY withIN the CPU die. There are no unnecessary context switches on performance-critical code paths. TCP packets are processed as events by a single thread running dedicated on its own core. Data is looked up in the local memory pool. This isolated path is the closest software path to what actually physically happens in a computer and the key to attaining peak performance.
At Aerospike, I knew I had it right when I watched the output of the “top” command, (viewing all cores) and there was near zero idle % cpu and also a very uniform balance across cores. Each core had exactly the same signature, something like: us%39 sy%35 id%0 wa%0 si%22.
Which is to say software-interrupts from tcp packets were using 22% of the core, context switches passing tcp-packets back and forth from the operating system were taking up 35%, and our software was taking up 39% to do the database transaction.
When the perfect balance across cores was achieved optimal performance was achieved, from an architectural standpoint. We can still streamline our software but at least the flow of packets to & fro Aerospike is near optimal.
Data is Served
Those are my 10 ingredients that got Aerospike’s server to one million over-the-wire database requests on a $5K commodity machine. Mixed correctly, they not only give you incredible raw speed, they give you stability/predictability/over-provisioning-for-spikes at lower speeds. Enjoy ☺
Reader Comments (17)
Maybe I'm so ingenuous, but I have to ask:
For RAM DBs like that, how to avoid getting data lost when the server crashes or goes down for some reason? I didn't see anything about replicas.
>> For RAM DBs like that, how to avoid getting data lost when the server crashes or goes down for some reason? I didn't see anything about replicas.
This is a very relevant question. Any RAM-DB needs to synchronously replicate data to insure durability in case of a server crash. This is commonly referred to as K-safety. Aerospike supports synchronous replication w/ a default setting of replication factor 2. Different read/write loads are reported here
Since tcp packet marshaling is the major bottleneck in this type of performance benchmark, writes with synchronous replication do not perform as well as reads; As a (very) general rule, write operations with synchronous replication have half the throughput (double the packets) and twice the latency (double the hops) of read operations.
Several in memory databases (VoltDB, MemSQL, Redis) support replication and disk based persistence. Aerospike supports replication and disk based persistence according to their docs. They also support larger than memory working sets where indexes fit in memory.
The huge speed gain for an in memory database is mostly from eliminating random reads that cause all kinds of issues when you want to do hundreds of thousands of queries/sec per node. There is nothing particularly slow about persistence, especially when you are logging asynchronously to query execution.
Flash mitigates the random read issue to a certain extent but doesn't solve the concurrency issues (exacerbated by transaction support) created by disk based data structures that typically assume that they will store data sets that are larger than memory.
Aerospike has two different configuration modes. The first is the 100% in-memory mode used to boast the 1M TPS on $5K hardware. The second is a SSD optimized mode where indexes are kept in memory and data is kept on SSD. SSDs are very strong at random reads, but writing to them and maintaining peak performance (especially as the months go by) requires specially optimized software.
Aerospike has invested a huge amount of effort on optimizations focusing on writing to SSDs and maintaining consistent read/write speeds over time.
SSD benchmark speeds can be found in the second box, here. This topic has tons of technical details and definitely warrants its own blog post, so I will get on that :)
Thank you. Very interesting reading. CitrusLeaf does not support range queries?
Everything I see online says that IRQ affinity to a group of CPUs *hurts* performance, but this article implies the opposite. Having all CPUs perform IRQ processing seems to be the recommendation.
>> Thank you. Very interesting reading. CitrusLeaf does not support range queries?
Aerospike (formerly Citrusleaf) has been a distributed highly available key-value store running in production since 2009.
We have recently built in distributed secondary indexes and are looking for beta customers this fall (2012). The secondary indexes support range queries, compound indexes, and simple aggregation operators.
These new features are all part of Aerospike's acquisition of AlchemyDB.
>> Everything I see online says that IRQ affinity to a group of CPUs *hurts* performance, but this article implies the opposite. Having all CPUs perform IRQ processing seems to be the recommendation.
Yes and no.
With quadcores, we found having all cores do IRQ processing yielded optimal performance. With hexacores (plus), we found that leaving 1 (or more) core(s) free (meaning no processes running on them {via taskset}) to do IRQ processing yielded optimal performance.
Both results surprised me in different ways. The quadcore simply couldn't spare an entire core to ONLY do IRQ processing w/o suffering serious performance degradation. The hexacore(plus) findings point to a very clever way to leverage more cores in peak performance benchmarks of this sort.
Needless to say, we tried every possible configuration we could think of. Thankfully, our software's thread-responsibilities & their process-ids are query-able, so the different configurations can be made entirely w/in the OS and even dynamically :)
How much did thread and IRQ binding specifically buy you? Automating that sort of thing looks annoying, and error prone when you take into account things like hyper-threading (or what AMD does). It seems like getting it wrong could be worse than letting the OS manage threading. Binding groups of cooperating threads to a NUMA node seems like an easier way to get the NUMA and L3 benefits.
Once you mix in compression, replication, disk IO, and start looking at response times at lower concurrency over peak throughput it becomes a tougher sell to do exactly on thread per core because that means thread exclusive data is inaccessible while a task is running even if the task doesn't involve the thread exclusive data.
I am looking forward to it. I am definitely interested in seeing what can be done when you focus on SSDs and assume indexes fit in memory.
>> How much did thread and IRQ binding specifically buy you?
At peak speed avoiding unneeded context-switching plus controlling which cores did IRQ processing gave us roughly a 2-2.5X throughput increase and had the added benefit of almost boring stability, the variable of the linux scheduler was largely not present.
>> Automating that sort of thing looks annoying, and error prone when you take into account things like hyper-threading (or what AMD does). It seems like getting it wrong could be worse than letting the OS manage threading.
Again, at peak speed, the linux scheduler is much worse than manually tweaking a system. What we strived to do with these performance improvements was to expose our threading internals, so our database can be dynamically dialed between peak performance mode and the default settings (which are more predictable/stable during node-failures/node-additions for the reasons you mention above).
>> Binding groups of cooperating threads to a NUMA node seems like an easier way to get the NUMA and L3 benefits.
It is easier, and it is a low hanging fruit, but the 10 ingredients above all used together give a very significant BOOST.
>> http://www.afewmoreamps.com/2012/04/jitcask.html
Good stuff, I will read through this tonight. SSDs are a tricky marriage of hardware and software indeed :)
Silly question here. What do you do with all default OS processes? Do you pin them all with something like taskset to one core? Thanks for an excellent post!
>> What do you do with all default OS processes? Do you pin them all with something like taskset to one core?
This question is pretty generic, so ...
My philosophy is to pin the default OS processes AWAY from any bottleneck, which in this type of workload is largely IRQ processing.
So on Quadcores, just let the linux task scheduler take care of OS processes.
On Hexacore+ where IRQ processing is taskset to run on 1+ dedicated cores, use taskset to keep default OS processes AWAY from these dedicated cores.
Excellent article - thanks for sharing!
35% kernel to user mode context switch overhead. Damn! I'm looking forward to the kernel mode version of your software... hehe!
Terms corrections:
key/value, not "key-value"
client/server, not "client-server"
Hi Russell -- interesting article. If you would indulge me, I have a few questions for you:
1. Did you use the off-the-shelf YCSB benchmark (workload C, 100% read) for this test, or did you create a special purpose benchmark for this particular test (either a new one, or a modified version of YCSB)?
2. If the benchmark wasn't the off-the-shelf YCSB benchmark, could you make the benchmark available so as to attempt a comparison across different vendors?
3. What was the size of the database that you used for this test, and what was the size of each record?
4. How many parallel client processes and/or threads did you use to achieve the 1M TPS?
5. What was the average read latency for these tests as you scaled up?
6. Lastly, and this is probably fairly obvious, but I wanted to ask anyways: in your description, you clearly point at isolation of various kinds (thread-core pinning, IRQ affinity to/from the NIC, CPU<->socket isolation). What I'm curious about is where you say " Multiple CPU sockets holding multiple CPUs should be used like multiple machines." Does this imply that you had multiple instances of Aerospike on this 12-core machine? As in, looking at the dual-X5690 processor database machine you used, you'd end up with a system with 4 CPU sockets total, each with 6 cores (24 cores total). So, did you have a single instance of the Aerospike database on this configuration, or were there 4 instances each serving roughly 1/4th of the total 1M TPS (so around 250,000 TPS/instance)?
Thanks!
Vik
Hi Vik,
this post is about 2 years old and I am not at Aerospike anymore so I will do my best answering your questions but I can't say w/ 100% accuracy if I remember all the details correctly :)
> 1. Did you use the off-the-shelf YCSB benchmark (workload C, 100% read) for this test, or did you create a special purpose benchmark for this particular test (either a new one, or a modified version of YCSB)?
We used a homemade client. When you go hard-core on performance testing you want to rule out possible client inefficiencies.
> 2. If the benchmark wasn't the off-the-shelf YCSB benchmark, could you make the benchmark available so as to attempt a comparison across different vendors?
Aerospike is now open source so the tool we used to benchmark this is included somewhere in the source tree (but I have no clue where).Thumbtack did some benchmarking comparing multiple vendors, perhaps they open sourced their testing tools and you can do an apples to apples comparisons.
> 3. What was the size of the database that you used for this test, and what was the size of each record?
The database size was in-memory which on this machine was in the 30GB range. As long as the data fits in memory it doesn't really matter how big it is IFF you respect NUMA locality. The record-size was smallish to really push performance. Small record-size means small enough to not cause two network packets per transaction. A single-byte record and a 1KB record show similar performance, not the exact same, but same ballpark. Once you need more than a single packet per transaction TCP performance can degrade non-linearly.
For this test we didn't experiment with using jumbo frames and larger record sizes, but that has probably been done in the meantime and it may a winner for larger record sizes.
> 4. How many parallel client processes and/or threads did you use to achieve the 1M TPS?
A lot and we had to experiment to find sweet spots for the number of client-threads and number of client-processes per client test machine. I am real hazy on the exact numbers here, but I think we used 4 smaller machines (quad-cores @2.4GHz & 1 NIC) as test machines. I think we pinned a client-process to each core and each client-process used 64-128 threads.
> 5. What was the average read latency for these tests as you scaled up?
This performance test was a precise setup so as we scaled up everything worked exactly as we had carefully set it up and read latencies did not change at all. I am guessing they remained sub millisecond as each read basically left the client, went via the 1Gb switch directly to a server-node, did an in-memory lookup, and replied via the 1Gb switch. The network was not saturated and the server CPUs were maxed out but there was no server side thrashing and minimal inter CPU socket I/O so the scale-up was smooth..
> 6. Lastly, and this is probably fairly obvious, but I wanted to ask anyways: in your description, you clearly point at isolation of various kinds (thread-core pinning, IRQ affinity to/from the NIC, CPU<->socket isolation). What I'm curious about is where you say " Multiple CPU sockets holding multiple CPUs should be used like multiple machines." Does this imply that you had multiple instances of Aerospike on this 12-core machine? As in, looking at the dual-X5690 processor database machine you used, you'd end up with a system with 4 CPU sockets total, each with 6 cores (24 cores total). So, did you have a single instance of the Aerospike database on this configuration, or were there 4 instances each serving roughly 1/4th of the total 1M TPS (so around 250,000 TPS/instance)?
We used a single Aerospike server instance on the server machine. We toyed with the idea of using an AS server instance per CPU socket because it is a really simple way to isolate at the CPU socket level, but it complicates fail-overs in distributed databases and is a non-option in practice. In a distributed database, if you have 4 logical servers on a physical server and the physical server dies, you lose 4 logical servers which is a massive failure for a distributed database -> you would need K-safety of 5 to insure you didn't lose any data.
There are other hacks where you can misuse rack awareness settings to put multiple logical servers on a single physical server and tolerate failures without increasing K-safety, but I personally cringe at this practice.
> Thanks!
Cheers, hope the answers aren't too hazy ;)
Hi Vik,
One more thing, the wording in my post can be wrongly interpreted to read that the server had 24 cores, but it only had 12 cores. I should have written 2 CPU sockets, each with a Intel X5690 Hexacore @3.47GHz for a total of 12 cores.
- Russell