Paper: MegaPipe: A New Programming Interface for Scalable Network I/O

The paper MegaPipe: A New Programming Interface for Scalable Network I/O (video, slides) hits the common theme that if you want to go faster you need a better car design, not just a better driver. So that's why the authors started with a clean-slate and designed a network API from the ground up with support for concurrent I/O, a requirement for achieving high performance while scaling to large numbers of connections per thread, multiple cores, etc. What they created is MegaPipe, "a new network programming API for message-oriented workloads to avoid the performance issues of BSD Socket API."
The result: MegaPipe outperforms baseline Linux between 29% (for long connections) and 582% (for short connections). MegaPipe improves the performance of a modified version of memcached between 15% and 320%. For a workload based on real-world HTTP traces, MegaPipe boosts the throughput of nginx by 75%.
What's this most excellent and interesting paper about?
Message-oriented network workloads, where connections are short and/or message sizes are small, are CPU intensive and scale poorly on multi-core systems with the BSD Socket API. We present MegaPipe, a new API for efficient, scalable network I/O for message-oriented workloads. The design of MegaPipe centers around the abstraction of a channel a per-core, bidirectional pipe between the kernel and user space, used to exchange both I/O requests and event notifications. On top of the channel abstraction, we introduce three key concepts of MegaPipe: partitioning, lightweight socket (lwsocket), and batching.We implement MegaPipe in Linux and adapt memcached and nginx. Our results show that, by embracing a clean-slate design approach, MegaPipe is able to exploit new opportunities for improved performance and ease of programmability. In microbenchmarks on an 8-core server with 64 B messages, MegaPipe outperforms baseline Linux between 29% (for long connections) and 582% (for short connections). MegaPipe improves the performance of a modified version of memcached between 15% and 320%. For a workload based on real-world HTTP traces, MegaPipe boosts the throughput of nginx by 75%.
Performance with Small Messages:
Small messages result in greater relative network I/O overhead in comparison to larger messages. In fact, the per-message overhead remains roughly constant and thus, independent of message size; in comparison with a 64 B message, a 1 KiB message adds only about 2% overhead due to the copying between user and kernel on our system, despite the large size difference.
Partitioned listening sockets:
Instead of a single listening socket shared across cores, MegaPipe allows applications to clone a listening socket and partition its associated queue across cores. Such partitioning improves performance with multiple cores while giving applications control over their use of parallelism.
Lightweight sockets:
Sockets are represented by file descriptors and hence inherit some unnecessary filerelated overheads. MegaPipe instead introduces lwsocket, a lightweight socket abstraction that is not wrapped in filerelated data structures and thus is free from system-wide synchronization.
System Call Batching:
MegaPipe amortizes system call overheads by batching asynchronous I/O requests and completion notifications within a channel.
Reader Comments (3)
This sounds pretty great. So, I guess what I'm not seeing are (1) what are the downsides and (2) why isn't this already the way we do things?
I only read the slides, but the downside might be increased latency. The benefits they advertise are increased throughput (proven with benchmarks) and a more intuitive API (subjective, but I am inclined to agree). The throughput increase is achieved by reducing the number of system calls per transaction from 1 to 1/(number of operations in a batch). For use cases which are not sensitive to a moderate latency increase such as streaming video, there is no disadvantage.
The reason this isn't the way we do things already is that it is a non-obvious innovation.
very nice talk, however, I wonder why it's just a talk, and not source code. Not even mentioned anywhere, nor in the google search results. I would have loved to look at it, and see if I could integrated it into my own projects and see how performance scales then.
OTOH, I am pretty sure the Linux kernel devs would have implemented it already if it would be really better? no?