Entries in Performance (43)

Wednesday
Sep042013

Wide Fast SATA: the Recipe for Hot Performance

This is a guest post by Brian Bulkowski, CTO and co-founder of Aerospike, a leading clustered NoSQL database, has worked in the area of high performance commodity systems since 1989.

This blog post will tell you exactly how to build a multi-terabyte high throughput datacenter server. A fast, reliable multi-terrabyte data tier can be used for recent behavior (messages, tweets, plays, actions), or anywhere that today you use Redis or Memcache.

You need to know:

  • Which SSDs work
  • Which chassis work
  • How to configure your RAID cards

Intel’s SATA solutions – combined with a high capacity storage server like the Dell R720xd and a host bus adapter based on the LSI 2208, and a Flash optimized database like Aerospike, enables high throughput and low latency.

In a wide configuration, with 12 to 20 drives per 2U server, individual servers can cost-effectively serve at high throughput with 16T at $2.50 per GB with the s3700, or $1.25 with the s3500. Other SSD offerings – from Crucial (Micron) and Samsung (S843) – are at other densities and price-performance points.

This is in-memory computing at a stunningly new, accessible price level – but there are some details you need to know...

Click to read more ...

Wednesday
Jun192013

Paper: MegaPipe: A New Programming Interface for Scalable Network I/O

The paper MegaPipe: A New Programming Interface for Scalable Network I/O (video, slides) hits the common theme that if you want to go faster you need a better car design, not just a better driver. So that's why the authors started with a clean-slate and designed a network API from the ground up with support for concurrent I/O, a requirement for achieving high performance while scaling to large numbers of connections per thread, multiple cores, etc.  What they created is MegaPipe, "a new network programming API for message-oriented workloads to avoid the performance issues of BSD Socket API."

The result: MegaPipe outperforms baseline Linux between 29% (for long connections) and 582% (for short connections). MegaPipe improves the performance of a modified version of memcached between 15% and 320%. For a workload based on real-world HTTP traces, MegaPipe boosts the throughput of nginx by 75%.

What's this most excellent and interesting paper about?

Click to read more ...

Thursday
Jun132013

Busting 4 Modern Hardware Myths - Are Memory, HDDs, and SSDs Really Random Access?

"It’s all a numbers game – the dirty little secret of scalable systems"

Martin Thompson is a High Performance Computing Specialist with a real mission to teach programmers how to understand the innards of modern computing systems. He has many talks and classes (listed below) on caches, buffers, memory controllers, processor architectures, cache lines, etc.

His thought is programmers do not put a proper value on understanding how the underpinnings of our systems work. We gravitate to the shiny and trendy. His approach is not to teach people specific programming strategies, but to teach programmers to fish so they can feed themselves. Without a real understanding strategies are easy to apply wrongly.  It's strange how programmers will put a lot of effort into understanding complicated frameworks like Hibernate, but little effort into understanding the underlying hardware on which their programs run.

A major tenant of Martin's approach is to "lead by experimental observation rather than what folks just blindly say," so it's no surprise he chose a MythBuster's theme in his talk Mythbusting Modern Hardware to Gain "Mechanical Sympathy." Mechanical Sympathy is term coined by Jackie Stewart, the race car driver, to say you get the best out of a racing car when you have a good understanding of how a car works. A driver must work in harmony with the machine to get the most of out of it. Martin extends this notion to say we need to know how the hardware works to get the most out of our computers. And he thinks normal developers can understand the hardware they are using. If you can understand Hibernate, you can understand just about anything.

The structure of the talk is to take a few commonly held myths and go all MythBusters on them by seeing if they are really true. Along the way there's incredible detail on how different systems work, far too much detail to gloss here, but it's an absolute fascinating talk. Martin really knows what he is talking about and he is a good teacher as well.

The most surprising part of the talk is the counter intuitive idea that many of the devices we think of as random access, like RAM, HDDs, and SSDs, effectively become serial devices in certain circumstances. A disk, for example, is really just a big tape that's fast. It's not true random access. Keep on reading to see why that is...

Click to read more ...

Thursday
May302013

Google Finds NUMA Up to 20% Slower for Gmail and Websearch

When you have a large population of servers you have both the opportunity and the incentive to perform interesting studies. Authors from Google and the University of California in Optimizing Google’s Warehouse Scale Computers: The NUMA Experience conducted such a study, taking a look at how jobs run on clusters of machines using a NUMA architecture. Since NUMA is common on server class machines it's a topic of general interest for those looking to maximize machine utilization across clusters.

Some of the results are surprising:

Click to read more ...

Wednesday
May222013

Strategy: Stop Using Linked-Lists

What data structure is more sacred than the link list? If we get rid of it what silly interview questions would we use instead? But not using linked-lists is exactly what Aater Suleman recommends in Should you ever use Linked-Lists?

In The Secret To 10 Million Concurrent Connections one of the important strategies is not scribbling data all over memory via pointers because following pointers increases cache misses which reduces performance. And there’s nothing more iconic of pointers than the link list.

Here are Aeter's reasons to be anti-linked-list:

Click to read more ...

Monday
May132013

The Secret to 10 Million Concurrent Connections -The Kernel is the Problem, Not the Solution

Now that we have the C10K concurrent connection problem licked, how do we level up and support 10 million concurrent connections? Impossible you say. Nope, systems right now are delivering 10 million concurrent connections using techniques that are as radical as they may be unfamiliar.

To learn how it’s done we turn to Robert Graham, CEO of Errata Security, and his absolutely fantastic talk at Shmoocon 2013 called C10M Defending The Internet At Scale.

Robert has a brilliant way of framing the problem that I’ve never heard of before. He starts with a little bit of history, relating how Unix wasn’t originally designed to be a general server OS, it was designed to be a control system for a telephone network. It was the telephone network that actually transported the data so there was a clean separation between the control plane and the data plane. The problem is we now use Unix servers as part of the data plane, which we shouldn’t do at all. If we were designing a kernel for handling one application per server we would design it very differently than for a multi-user kernel. 

Which is why he says the key is to understand:

  • The kernel isn’t the solution. The kernel is the problem.

Which means:

  • Don’t let the kernel do all the heavy lifting. Take packet handling, memory management, and processor scheduling out of the kernel and put it into the application, where it can be done efficiently. Let Linux handle the control plane and let the the application handle the data plane.

The result will be a system that can handle 10 million concurrent connections with 200 clock cycles for packet handling and 1400 hundred clock cycles for application logic. As a main memory access costs 300 clock cycles it’s key to design in way that minimizes code and cache misses.

With a data plane oriented system you can process 10 million packets per second. With a control plane oriented system you only get 1 million packets per second.

If this seems extreme keep in mind the old saying: scalability is specialization. To do something great you can’t outsource performance to the OS. You have to do it yourself.

Now, let’s learn how Robert creates a system capable of handling 10 million concurrent connections...

Click to read more ...

Wednesday
Jan302013

Better Browser Caching is More Important than No Javascript or Fast Networks for HTTP Performance

Performance guru Steve Souders gave his keynote presentation, Cache is King! (slides), at the HTML5DevCon, besides being an extremely clear explanation of how caching works on the Internet and how to optimize your use of HTTP to get the best performance, Steve ran experiments that found some surprising results on what gave the best web site performance improvements.

In his base line test, page loads took 7.65 seconds (median of three runs). What change--Fast Network, No Javascript, or Primed Cache--would make the biggest performance improvement? It was Primed Cache.

  • Fast Network - Using a fast FIOS network the load time was 4.13 seconds. Steve was surprised how big a difference this made, given how much work must happen in the browser. 
  • No JavaScript - 4.74 seconds after disabling JavaScript. Both reduces transfers and skips parsing by the browser. Steve thought the effect would have been larger.
  • Primed Cache - 3.46 seconds using a warm cache, less than half than the empty cache page view time because it reduced the number of HTTP requests and reduced the total transfer times. Key for mobile where higher latencies are common.

The implication being that caching is important so you must understand how HTTP caching works and how to make the best use of it. That's the rest of the talk.

Some key takeaways: 

Click to read more ...

Tuesday
Oct092012

Batoo JPA - The new JPA Implementation that runs over 15 times faster...

This post is by Hasan Ceylan, an Open Source software enthusiast from Istanbul.

I loved the JPA 1.0 back in early 2000s. I started using it together with EJB 3.0 even before the stable releases. I loved it so much that I contributed bits and parts for JBoss 3.x implementations.

Click to read more ...

Thursday
Aug302012

Dramatically Improving Performance by Debugging Brutally Complex Prolems

Debugging complex problems is 90% persistence and 50% cool tools. Brendan Gregg in 10 Performance Wins tells a fascinating story of how a team at Joyent solved some weird and challenging performance issues deep in the OS. It took lots of effort, DTraceFlame Graphs, USE Method, and writing custom tools when necessary. Here's a quick summary of the solved cases:

  • Monitoring. 1000x improvement. An application blocked while paging anonymous memory back in. It was also blocked during file system fsync() calls. The application was misconfigured and sometimes briefly exceeded available memory, getting page out.
  • Riak. 2x improvement. The Erlang VM used half the CPU count it was supposed to, so CPUs remained unused.  Fix was a configuration change.
  • ...

Click to read more ...

Thursday
Sep102009

When optimizing - don't forget the Java Virtual Machine (JVM) 

Recently, I was working on a project that was coming to a close. It was related to optimizing a database using a Java based in-memory cache to reduce the load. The application had to process up to a million objects per day and was characterized by its heavy use of memory and the high number of read, write and update operations. These operations were found to be the most costly, which meant that optimization efforts were concentrated here.

The project had already achieved impressive performance increases, but one question remained unanswered - would changing the JVM increase performance?


Read more at: http://bigdatamatters.com/bigdatamatters/2009/08/jvm-performance.html