Netflix: Use Less Chatty Protocols in the Cloud - Plus 26 Fixes
In 5 Lessons We’ve Learned Using AWS, Netflix's John Ciancutti says one of the big lessons they've learned is to create less chatty protocols:
In the Netflix data centers, we have a high capacity, super fast, highly reliable network. This has afforded us the luxury of designing around chatty APIs to remote systems. AWS networking has more variable latency. We’ve had to be much more structured about “over the wire” interactions, even as we’ve transitioned to a more highly distributed architecture.
There's not a lot of advice out there on how to create protocols. Combine that with a rush to the cloud and you have a perfect storm for chatty applications crushing application performance. Netflix is far from the first to be surprised by the less than stellar networks inside AWS.
A chatty protocol is one where a client makes a series of requests to a server and the client must wait on each reply before sending the next request. On a LAN this can work great. LAN's are typically fast, wide, and drop few packets.
Move that same application to a different network, one where round trip times can easily be an order of magnitude or larger because either the network is slow, lossy or poorly designed, and if a protocol takes many requests to complete a transaction, then it will make a dramatic difference in performance.
My WAN acceleration friends says Microsoft's Common Internet File System (CIFS) is infamous for being chatty. Transferring a 30MB file could tally something like 300msecs of latency on a LAN. On a WAN that could stretch to 7 minutes. Very unexpected results. What is key here is how the quality characteristics of the pipe interacts with the protocol design.
OK, chatty protocols are bad. What can you do about it? The overall goal is to reduce the number of times a client has to wait for a response and reduce the number of packets sent.
- eduPERT has a great example of how sending mail using SMTP was reduced from 9 to 4 rountrips through pipelining.
- Great paper on improving X Window System Network Performance using protocol-level performance analysis combined with passive network monitoring.
Some possible strategies are:
- Identify and remove unnecessary roundtrips. This may be obvious, but take a look at your protocols and see if any of the rountrips can be combined or eliminated. There may be a lot of low hanging fruit here with just a little thought.
- Think services, not objects. One of the reasons protocols become so chatty is programmers are thinking in terms of fine-grained distributed objects rather coarser-grained services. This leads to chatty protocols because there is a lot of back and forth for each method call as methods tend to be pretty low in granularity. What you want is a service interface that does as large an amount of work for each request as possible. To transform an image, for example, you want to make one call and pass in all the transforms so they can be done in that one call rather than making a separate remote method invocation for each transform. This is a classic and common mistake when first using CORBA, for example.
- Aggregate data to reduce the number roundtrips.
- Use token buckets. Token buckets are a technique for allowing a certain amount of burstiness while imposing a limit on the average data transmission rate.
- Batch/pipeline requests. Send more than one operation in a request and receive a single response.
- Batch via integration timers. Queue up requests for X milliseconds so they can be sent in a batch rather than one by one.
- Use fat pings. Fat pings reduce the amount of data that needs to be sent, especially compared to constant polling.
- Data Reduction Techniques. Keep data as small as possible so more work can fit inside a single TCP segment.
- Data compression.
- Use a binary protocol. Please, no flamewars, but a custom binary protocol will generally be more space efficient than text + compression.
- Do not transmit redundant information. This is more effective than compressing the information.
- Use high level descriptions rather than low level descriptions. Can you create a DSL or other way of describing your data and what you want done using a more compact code?
- Allow requests to reuse recently transmitted information. If data has already be sent can later day refer back to it so it doesn't have to be transmitted again?
- Keep a local cache. Simply avoid going over the network by caching data locally. This of course pushes the problem out to become one of maintaining consistency.
- Use Asynchronous APIs. Don't pause an application waiting for replies. This doesn't fix the protocol necessarily, but may lessen the effect.
- Use ackless protocols. Don't even worry about replies. Fire off requests to a message queue. When a worker processes a message it can forward the response on to the next stage of the processing pipeline.
- Lazy and bulk initialization. When applications start they sometimes need to go through an elaborate initial configuration phase. Try to minimize the amount of configuration an application needs. When configuration is needed try to do it in bulk so that a program will get all it's configuration attributes in one big structure rather than making a lot of individual calls. Another approach is to use lazy initialization so you just get configuration as needed rather than all at once. Maybe you won't need it?
- Move functionality to the client and out of the server. Is it possible for the client to do more work so nothing has to go back to the server? The X Window System saved a lot by moving font storage to the client.
- Realize more bandwidth is usually not a fix. Latency and packet loss are usually the problems so simply adding more bandwidth won't solve the problem, unless of course a lack of bandwidth is really the problem.
- Use rack affinity and availability zones. Latency and RTT times are increased by going through a long series of routers, switches etc. Take care that data stays within the same availability zone as much as possible. Some databases like Cassandra even let you specify rack affinity to further reduce this overhead.
- Avoid fine grained operations. Don't, for example, implement a drag and drop operation over the network that can take 100s of requests to implement. Older versions of Windows would go get the directory on the remote side and for every file it would perform individual file stat operations for each file attribute. What a WAN accelerator does is get all the file data at once, create a local cache, and serve those requests from the local cache. A good transparent work around, but of course the idea is to avoid these fine grained type operations completely.
- Use persistent connections. Create a connection and keep it nailed up. This avoids the overhead of handshaking and connection establishment. On the web consider using Comet.
- Get right to business. Pass real data back and forth as soon as possible. Don't begin with a long negotiation phase or other overhead.
- Use stateful protocols. Do you need to send the same state all the time? For an application protocol the answer is probably no. For something more general like HTTP statelessness is probably the higher goal.
- Tune TCP. This is one of the typical functions of a WAN accelerator. TCP was not developed with high latency in mind, but you can fake it into behaving. On a WAN you could try creating your own MPLS network, but inside a cloud you don't have a lot of options. Hopefully in the future the L2/L3 network fabrics inside datacenters will improve as will datacenter interconnects.
- Colocate behavior and state. Collocating behavior and state reduces the number of networks hops necessary to carry out operations.
- Use the High Performance Computing (HPC) cluster. HPC cluster is backed by a high performance network, and if your use case fits, it could be a big win.
- Measure the effect of changes in a systematic fashion. Profile your application to see what's really going on. Surprises may abound. When a change is made verify an improvement is actually being made.
Netflix posted a related article, Redesigning the Netflix API, where they go into a more details about their process. The idea is to ask why is a device so chatty? Can the number of requests be cut in half? Since this will drive up payload sizes, they will try using partial responses as a counter measure. Their goal is to conceptualize the API as a database. What does this mean? A partial response is like a SQL select statement where you can specify only the fields you want back. Previously all fields for objects were returned, even if the client didn't need them. So the goal is reduce payload sizes by being more selective about what data is returned.
Reader Comments (4)
This is kind of hinted at by some of your rules (e.g. 2, 4, 12) but I've found one very general guideline is to keep operations at a high semantic level whenever possible. For example, the problem in your #12 is that Windows expresses a high-level operation of listing a directory in terms of low-level operations on individual files. Another example is SQL; instead of expressing separate operations on individual rows, express a single operation to be performed on multiple rows that match a criterion. (One of the reasons I think REST sucks, BTW, is that in its "purer" forms it practically forces people to violate this principle by passing many nouns back and forth instead of one verb.) In practice, replacing naive fetch-examine-modify-store loops with predicates and filters/wildcards attached to the request itself can drastically reduce round trips. It also helps to preserve important semantic boundaries when caching/replicating/logging/reporting, provides separation of concerns between service and application, etc. At the extreme end of this spectrum you have various "move the code to the data" approaches, but once you start considering issues such as safety and security that quickly becomes a whole different discussion.
In a multi-cloud/region situation my database traffic is going over public internet so i want to encrypt it.
What would be the best approach to such situation? ssh tunnel would create a pretty significant overhead, especially for small queries. if ssh tunnel drops, time to re-establish session can be relatively lengthy.
Is there some other secure, less chatty method of ensuring my db traffic is protected?
Do those WAN-accelerators make any significant difference in a cloud environment? with physical boxes i can understand those devices talk directly to the hardware, but in a cloud/vm wouldn't those optimizations be limited by hypervisors?
Regarding #16 Tune TCP: I wrote a patch for the EL5 (RHEL5 / CentOS) kernel that introduces several TCP tunable sysctl parameters. (The one that makes a huge difference is net.ipv4.tcp_iw.)
Below is a "readme" with brief descriptions of each new tunable parameter followed by the patch itself. Technical feedback / code review / improvements are welcomed and very much appreciated.
README
- net.ipv4.tcp_rto_init:
Description: Initial TCP retransmission timer in milliseconds.
Valid range: 30-3000
Default value: 1000
Governing Specification: RFC1122 (http://tools.ietf.org/html/rfc1122)
Non-patched kernel default value: 3000
- net.ipv4.tcp_rto_min:
Description: Minimum TCP retransmission timer in milliseconds.
Valid range: 30-1000
Default value: 30
Governing Specification: None.
Non-patched kernel default value: 1000
- net.ipv4.tcp_rto_max:
Description: Maximum TCP retransmission timer in milliseconds.
Valid range: 1000-60000
Default value: 60000
Governing Specification: None.
Non-patched kernel default value: 120000
- net.ipv4.tcp_iw:
Description: TCP initial congestion window.
Valid range: 2-16
Default value: 16
Governing Specification: RFC3390 (http://tools.ietf.org/html/rfc3390)
Non-patched kernel default value: 2
- net.ipv4.tcp_delayed_ack:
Description: When ON, RFC1122 defined TCP delayed acknowledgments are enabled.
Valid range: 0 = OFF, 1 = ON
Default value: 0
Governing Specification: RFC1122 (http://tools.ietf.org/html/rfc1122)
Non-patched kernel default value: 1
PATCH
--- include/linux/sysctl.h.org 2010-12-13 16:10:41.000000000 +0800
+++ include/linux/sysctl.h 2010-12-13 15:10:47.000000000 +0800
@@ -452,6 +452,11 @@ enum
NET_UDP_MEM=122,
NET_UDP_RMEM_MIN=123,
NET_UDP_WMEM_MIN=124,
+ NET_TCP_RTO_INIT=125,
+ NET_TCP_RTO_MIN=126,
+ NET_TCP_RTO_MAX=127,
+ NET_TCP_DELAYED_ACK=128,
+ NET_TCP_INITIAL_WINDOW=129,
};
enum {
--- include/net/tcp.h.org 2010-12-13 16:10:37.000000000 +0800
+++ include/net/tcp.h 2010-12-13 16:00:28.000000000 +0800
@@ -121,9 +121,9 @@ extern void tcp_time_wait(struct sock *s
#define TCP_DELACK_MIN 4U
#define TCP_ATO_MIN 4U
#endif
-#define TCP_RTO_MAX ((unsigned)(120*HZ))
-#define TCP_RTO_MIN ((unsigned)(HZ/5))
-#define TCP_TIMEOUT_INIT ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value */
+#define TCP_RTO_MAX ((unsigned)(sysctl_tcp_rto_max))
+#define TCP_RTO_MIN ((unsigned)(sysctl_tcp_rto_min))
+#define TCP_TIMEOUT_INIT ((unsigned)(sysctl_tcp_rto_init))
#define TCP_RESOURCE_PROBE_INTERVAL ((unsigned)(HZ/2U)) /* Maximal interval between probes
* for local resources.
@@ -228,6 +228,19 @@ extern int sysctl_tcp_mtu_probing;
extern int sysctl_tcp_base_mss;
extern int sysctl_tcp_workaround_signed_windows;
extern int sysctl_tcp_slow_start_after_idle;
+extern int sysctl_tcp_rto_max;
+extern int tcp_rto_max_min;
+extern int tcp_rto_max_max;
+extern int sysctl_tcp_rto_min;
+extern int tcp_rto_min_min;
+extern int tcp_rto_min_max;
+extern int sysctl_tcp_rto_init;
+extern int tcp_rto_init_min;
+extern int tcp_rto_init_max;
+extern int sysctl_tcp_delayed_ack;
+extern int sysctl_tcp_iw;
+extern int tcp_iw_min;
+extern int tcp_iw_max;
extern atomic_t tcp_memory_allocated;
extern atomic_t tcp_sockets_allocated;
--- net/ipv4/sysctl_net_ipv4.c.org 2010-12-13 16:10:41.000000000 +0800
+++ net/ipv4/sysctl_net_ipv4.c 2010-12-13 15:31:32.000000000 +0800
@@ -836,6 +836,54 @@ ctl_table ipv4_table[] = {
.strategy = &sysctl_intvec,
.extra1 = &zero
},
+ {
+ .ctl_name = NET_TCP_RTO_INIT,
+ .procname = "tcp_rto_init",
+ .data = &sysctl_tcp_rto_init,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .extra1 = &tcp_rto_init_min,
+ .extra2 = &tcp_rto_init_max
+ },
+ {
+ .ctl_name = NET_TCP_RTO_MIN,
+ .procname = "tcp_rto_min",
+ .data = &sysctl_tcp_rto_min,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .extra1 = &tcp_rto_min_min,
+ .extra2 = &tcp_rto_min_max
+ },
+ {
+ .ctl_name = NET_TCP_RTO_MAX,
+ .procname = "tcp_rto_max",
+ .data = &sysctl_tcp_rto_max,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .extra1 = &tcp_rto_max_min,
+ .extra2 = &tcp_rto_max_max
+ },
+ {
+ .ctl_name = NET_TCP_DELAYED_ACK,
+ .procname = "tcp_delayed_ack",
+ .data = &sysctl_tcp_delayed_ack,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec
+ },
+ {
+ .ctl_name = NET_TCP_INITIAL_WINDOW,
+ .procname = "tcp_iw",
+ .data = &sysctl_tcp_iw,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .extra1 = &tcp_iw_min,
+ .extra2 = &tcp_iw_max
+ },
{ .ctl_name = 0 }
};
--- net/ipv4/tcp_input.c.org 2010-12-13 16:10:41.000000000 +0800
+++ net/ipv4/tcp_input.c 2010-12-13 16:03:21.000000000 +0800
@@ -81,6 +81,27 @@ int sysctl_tcp_ecn;
int sysctl_tcp_dsack = 1;
int sysctl_tcp_app_win = 31;
int sysctl_tcp_adv_win_scale = 2;
+int sysctl_tcp_delayed_ack = 0;
+
+int sysctl_tcp_iw = 16;
+int tcp_iw_min = 2;
+int tcp_iw_max = 16;
+EXPORT_SYMBOL_GPL(sysctl_tcp_iw);
+
+int sysctl_tcp_rto_init = 1000;
+int tcp_rto_init_min = 30;
+int tcp_rto_init_max = 3000;
+EXPORT_SYMBOL_GPL(sysctl_tcp_rto_init);
+
+int sysctl_tcp_rto_min = 30;
+int tcp_rto_min_min = 30;
+int tcp_rto_min_max = 1000;
+EXPORT_SYMBOL_GPL(sysctl_tcp_rto_min);
+
+int sysctl_tcp_rto_max = 60000;
+int tcp_rto_max_min = 1000;
+int tcp_rto_max_max = 60000;
+EXPORT_SYMBOL_GPL(sysctl_tcp_rto_max);
int sysctl_tcp_stdurg;
int sysctl_tcp_rfc1337;
@@ -3647,6 +3668,8 @@ static void __tcp_ack_snd_check(struct s
&& __tcp_select_window(sk) >= tp->rcv_wnd) ||
/* We ACK each frame or... */
tcp_in_quickack_mode(sk) ||
+ /* Delayed ACK is disabled or ... */
+ sysctl_tcp_delayed_ack == 0 ||
/* We have out of order data. */
(ofo_possible &&
skb_peek(&tp->out_of_order_queue))) {
--- net/ipv4/tcp_ipv4.c.org 2010-12-15 04:25:45.000000000 +0800
+++ net/ipv4/tcp_ipv4.c 2010-12-15 01:57:49.000000000 +0800
@@ -1287,7 +1287,7 @@ static int tcp_v4_init_sock(struct sock
* algorithms that we must have the following bandaid to talk
* efficiently to them. -DaveM
*/
- tp->snd_cwnd = 2;
+ tp->snd_cwnd = sysctl_tcp_iw;
/* See draft-stevens-tcpca-spec-01 for discussion of the
* initialization of these values.
--- net/ipv4/tcp_minisocks.c.org 2010-12-15 04:25:45.000000000 +0800
+++ net/ipv4/tcp_minisocks.c 2010-12-15 01:57:49.000000000 +0800
@@ -381,7 +381,7 @@ struct sock *tcp_create_openreq_child(st
* algorithms that we must have the following bandaid to talk
* efficiently to them. -DaveM
*/
- newtp->snd_cwnd = 2;
+ newtp->snd_cwnd = sysctl_tcp_iw;
newtp->snd_cwnd_cnt = 0;
newtp->bytes_acked = 0;
--- net/ipv6/tcp_ipv6.c.org 2010-12-15 04:25:45.000000000 +0800
+++ net/ipv6/tcp_ipv6.c 2010-12-15 01:57:49.000000000 +0800
@@ -1457,7 +1457,7 @@ static int tcp_v6_init_sock(struct sock
* algorithms that we must have the following bandaid to talk
* efficiently to them. -DaveM
*/
- tp->snd_cwnd = 2;
+ tp->snd_cwnd = sysctl_tcp_iw;
/* See draft-stevens-tcpca-spec-01 for discussion of the
* initialization of these values.
I'm providing an update just in case somebody is using this patch.
Since I posted my original kernel TCP patch the mainline kernel has moved to IW10. I've also realised a larger initial congestion window inherently makes delayed ACKs less of an issue. Lastly I became a bit concerned that reducing the initial RTO could become and issue for mobile clients. Ultimately I've decided to KISS by simply back porting these three TCP IW10 related patches in to the EL5 kernel:
* TCP: Increase the initial congestion window to 10
* TCP: Increase default initial receive window
* TCP: Fix >2 iw selection
EL5 kernel (2.6.18-238.1.1.el5) patch:
diff -Nur a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h 2011-02-13 19:56:38.643053044 +0000
+++ b/include/net/tcp.h 2011-02-13 20:01:30.435922376 +0000
@@ -55,6 +55,9 @@
*/
#define MAX_TCP_WINDOW 32767U
+/* Offer an initial receive window. */
+#define TCP_DEFAULT_INIT_RCVWND 10
+
/* Minimal accepted MSS. It is (60+60+8) - (20+20). */
#define TCP_MIN_MSS 88U
@@ -186,6 +189,9 @@
#define TCP_NAGLE_CORK 2 /* Socket is corked */
#define TCP_NAGLE_PUSH 4 /* Cork is overridden for already queued data */
+/* TCP initial congestion window */
+#define TCP_INIT_CWND 10
+
extern struct inet_timewait_death_row tcp_death_row;
/* sysctl variables for tcp */
diff -Nur a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c 2011-02-13 19:56:41.183078107 +0000
+++ b/net/ipv4/tcp_input.c 2011-02-13 19:59:17.104613681 +0000
@@ -754,18 +754,13 @@
}
}
-/* Numbers are taken from RFC2414. */
__u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst)
{
__u32 cwnd = (dst ? dst_metric(dst, RTAX_INITCWND) : 0);
- if (!cwnd) {
- if (tp->mss_cache > 1460)
- cwnd = 2;
- else
- cwnd = (tp->mss_cache > 1095) ? 3 : 4;
- }
- return min_t(__u32, cwnd, tp->snd_cwnd_clamp);
+ if (!cwnd)
+ cwnd = TCP_INIT_CWND;
+ return min_t(__u32, cwnd, tp->snd_cwnd_clamp);
}
/* Set slow start threshold and cwnd not falling to slow start */
@@ -846,6 +841,8 @@
tcp_bound_rto(sk);
if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp)
goto reset;
+
+cwnd:
tp->snd_cwnd = tcp_init_cwnd(tp, dst);
tp->snd_cwnd_stamp = tcp_time_stamp;
return;
@@ -860,6 +857,7 @@
tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
}
+ goto cwnd;
}
static void tcp_update_reordering(struct sock *sk, const int metric,
diff -Nur a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c 2011-02-13 19:56:39.603062517 +0000
+++ b/net/ipv4/tcp_output.c 2011-02-13 19:47:51.517817495 +0000
@@ -208,19 +208,18 @@
}
}
- /* Set initial window to value enough for senders,
- * following RFC2414. Senders, not following this RFC,
- * will be satisfied with 2.
- */
- if (mss > (1<<*rcv_wscale)) {
- int init_cwnd = 4;
- if (mss > 1460*3)
- init_cwnd = 2;
- else if (mss > 1460)
- init_cwnd = 3;
- if (*rcv_wnd > init_cwnd*mss)
- *rcv_wnd = init_cwnd*mss;
- }
+ /* Set initial window to a value enough for senders starting with
+ * initial congestion window of TCP_DEFAULT_INIT_RCVWND. Place
+ * a limit on the initial window when mss is larger than 1460.
+ */
+ if (mss > (1 << *rcv_wscale)) {
+ int init_cwnd = TCP_DEFAULT_INIT_RCVWND;
+ if (mss > 1460)
+ init_cwnd =
+ max_t(u32, (1460 * TCP_DEFAULT_INIT_RCVWND) / mss, 2);
+ else
+ *rcv_wnd = min(*rcv_wnd, init_cwnd * mss);
+ }
/* Set the clamp no higher than max representable value */
(*window_clamp) = min(65535U << (*rcv_wscale), *window_clamp);