The Secret to 10 Million Concurrent Connections -The Kernel is the Problem, Not the Solution
Monday, May 13, 2013 at 8:30AM
HighScalability Team in Example, Performance

Now that we have the C10K concurrent connection problem licked, how do we level up and support 10 million concurrent connections? Impossible you say. Nope, systems right now are delivering 10 million concurrent connections using techniques that are as radical as they may be unfamiliar.

To learn how it’s done we turn to Robert Graham, CEO of Errata Security, and his absolutely fantastic talk at Shmoocon 2013 called C10M Defending The Internet At Scale.

Robert has a brilliant way of framing the problem that I’ve never heard of before. He starts with a little bit of history, relating how Unix wasn’t originally designed to be a general server OS, it was designed to be a control system for a telephone network. It was the telephone network that actually transported the data so there was a clean separation between the control plane and the data plane. The problem is we now use Unix servers as part of the data plane, which we shouldn’t do at all. If we were designing a kernel for handling one application per server we would design it very differently than for a multi-user kernel. 

Which is why he says the key is to understand:

Which means:

The result will be a system that can handle 10 million concurrent connections with 200 clock cycles for packet handling and 1400 hundred clock cycles for application logic. As a main memory access costs 300 clock cycles it’s key to design in way that minimizes code and cache misses.

With a data plane oriented system you can process 10 million packets per second. With a control plane oriented system you only get 1 million packets per second.

If this seems extreme keep in mind the old saying: scalability is specialization. To do something great you can’t outsource performance to the OS. You have to do it yourself.

Now, let’s learn how Robert creates a system capable of handling 10 million concurrent connections...

C10K Problem - So Last Decade 

A decade ago engineers tackled the C10K scalability problems that prevented servers from handling more than 10,000 concurrent connections. This problem was solved by fixing OS kernels and moving away from threaded servers like Apache to event-driven servers like Nginx and Node. This process has taken a decade as people have been moving away from Apache to scalable servers. In the last few years we’ve seen faster adoption of scalable servers.

The Apache Problem

The C10M Problem - The Next Decade 

In the very near future servers will need to handle millions of concurrent connections. With IPV6 the number of potential connections from each server is in the millions so we need to go to the next level of scalability.

What the 10M Concurrent Connection Challenge means:

  1. 10 million concurrent connections

  2. 1 million connections/second - sustained rate at about 10 seconds a connections

  3. 10 gigabits/second connection - fast connections to the Internet.

  4. 10 million packets/second - expect current servers to handle 50K packets per second, this is going to a higher level. Servers used to be able to handle 100K interrupts per second and every packet caused interrupts.

  5. 10 microsecond latency - scalable servers might handle the scale but latency would spike.

  6. 10 microsecond jitter - limit the maximum latency

  7. 10 coherent CPU cores - software should scale to larger numbers of cores. Typically software only scales easily to four cores. Servers can scale to many more cores so software needs to be rewritten to support larger core machines.

We’ve Learned Unix Not Network Programming 

How do you write software that scales?

How do change your software to make it scale? A lot of or rules of thumb are false about how much hardware can handle. We need to know what the performance capabilities actually are.

To go to the next level the problems we need to solve are:

  1. packet scalability
  2. multi-core scalability
  3. memory scalability 

Packet Scaling - Write Your Own Custom Driver to Bypass the Stack

Multi-Core Scalability

Multi-core scalability is not the same thing as multi-threading scalability. We’re all familiar with the idea processors are not getting faster, but we are getting more of them.

Most code doesn’t scale past 4 cores. As we add more cores it’s not just that performance levels off, we can get slower and slower as we add more cores. That’s because software is written badly. We want software as we add more cores to scale nearly linearly. Want to get faster as we add more cores.

Multi-threading coding is not multi-core coding

Memory Scalability

Summary 

The control plane is left to Linux, for the data plane, nothing. The data plane runs in application code. It never interacts with the kernel. There’s no thread scheduling, no system calls, no interrupts, nothing. 

Yet, what you have is code running on Linux that you can debug normally, it’s not some sort of weird hardware system that you need custom engineer for. You get the performance of custom hardware that you would expect for your data plane, but with your familiar programming and development environment. 

Related Articles 

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.