Entries in Strategy (358)

Monday
Jul272009

Handle 700 Percent More Requests Using Squid and APC Cache

This post on www.ilovebonnie.net documents some impressive system performance improvements by the addition of Squid Cache (a caching proxy) and APC Cache (opcode cache for PHP).

  • Apache is able to deliver roughly 700% more requests per second with Squid when serving 1KB and 100KB images.
  • Server load is reduced using Squid because the server does not have to create a bunch of Apache processes to handle the requests.
  • APC Cache took a system that could barely handle 10-20 requests per second to handling 50-60 requests per second. A 400% increase.
  • APC allowed the load times to remain under 5 seconds even with 200 concurrent threads slamming on the server.
  • These two caches are easy to setup and install and allow you to get a lot more performance out of them.
The post has an in-depth discussion and a number of supporting charts. The primary point is how simple it can be to improve performance and scalability by adding caching.

Saturday
Jul252009

Latency is Everywhere and it Costs You Sales - How to Crush it

Update 8: The Cost of Latency by James Hamilton. James summarizing some latency info from  Steve Souder, Greg Linden, and Marissa Mayer.  Speed [is] an undervalued and under-discussed asset on the web.

Update 7: How do you know when you need more memcache servers?. Dathan Pattishall talks about using memcache not to scale, but to reduce latency and reduce I/O spikes, and how to use stats to know when more servers are needed.
Update 6: Stock Traders Find Speed Pays, in Milliseconds. Goldman Sachs is making record profits off a 500 millisecond trading advantage. Yes, latency matters. As an interesting aside, Libet found 500 msecs is about the time it takes the brain to weave together an experience of consciousness from all our sensor inputs.
Update 5: Shopzilla's Site Redo - You Get What You Measure. At the Velocity conference Phil Dixon, from Shopzilla, presented data showing a 5 second speed up resulted in a 25% increase in page views, a 10% increase in revenue, a 50% reduction in hardware, and a 120% increase traffic from Google. Built a new service oriented Java based stack. Keep it simple. Quality is a design decision. Obsessively easure everything. Used agile and built the site one page at a time to get feedback. Use proxies to incrementally expose users to new pages for A/B testing. Oracle Coherence Grid for caching. 1.5 second page load SLA. 650ms server side SLA. Make 30 parallel calls on server. 100 million requests a day. SLAs measure 95th percentile, averages not useful. Little things make a big difference.
Update 4: Slow Pages Lose Users. At the Velocity Conference Jake Brutlag (Google Search) and Eric Schurman (Microsoft Bing) presented study data showing delays under half a second impact business metrics and delay costs increase over time and persist. Page weight not key. Progressive rendering helps a lot.
Update 3: Nati Shalom's Take on this article. Lots of good stuff on designing architectures for latency minimization.
Update 2: Why Latency Lags Bandwidth, and What it Means to Computing by David Patterson. Reasons: Moore's Law helps BW more than latency; Distance limits latency; Bandwidth easier to sell; Latency help BW, but not vice versa; Bandwidth hurts latency; OS overhead hurts latency more than BW. Three ways to cope: Caching, Replication, Prediction. We haven't talked about prediction. Games use prediction, i.e, project where a character will go, but it's not a strategy much used in websites.
Update: Efficient data transfer through zero copy. Copying data kills. This excellent article explains the path data takes through the OS and how to reduce the number of copies to the big zero.

Latency matters. Amazon found every 100ms of latency cost them 1% in sales. Google found an extra .5 seconds in search page generation time dropped traffic by 20%. A broker could lose $4 million in revenues per millisecond if their electronic trading platform is 5 milliseconds behind the competition.

The Amazon results were reported by Greg Linden in his presentation Make Data Useful. In one of Greg's slides Google VP Marissa Mayer, in reference to the Google results, is quoted as saying "Users really respond to speed." And everyone wants responsive users. Ka-ching! People hate waiting and they're repulsed by seemingly small delays.

The less interactive a site becomes the more likely users are to click away and do something else. Latency is the mother of interactivity. Though it's possible through various UI techniques to make pages subjectively feel faster, slow sites generally lead to higher customer defection rates, which lead to lower conversation rates, which results in lower sales. Yet for some reason latency isn't a topic talked a lot about for web apps. We talk a lot about about building high-capacity sites, but very little about how to build low-latency sites. We apparently do so at the expense of our immortal bottom line.

I wondered if latency went to zero if sales would be infinite? But alas, as Dan Pritchett says, Latency Exists, Cope!. So we can't hide the "latency problem" by appointing a Latency Czar to conduct a nice little war on latency. Instead, we need to learn how to minimize and manage latency. It turns out a lot of problems are better solved that way.

How do we recover that which is most meaningful--sales--and build low-latency systems?

I'm excited that the topic of latency came up. There are a few good presentations on this topic I've been dying for a chance to reference. And latency is one of those quantifiable qualities that takes real engineering to create. A lot of what we do is bolt together other people's toys. Building high-capacity low-latency system takes mad skills. Which is fun. And which may also account for why we see latency a core design skill in real-time and market trading type systems, but not web systems. We certainly want our nuclear power plant plutonium fuel rod lowering hardware to respond to interrupts with sufficient alacrity. While less serious, trading companies are always in a technological arms race to create lower latency systems. He with the fastest system creates a sort of private wire for receiving and acting on information faster than everyone else. Knowing who has the bestest price the firstest is a huge advantage. But if our little shopping cart takes an extra 500 milliseconds to display, the world won't end. Or will it?

Latency Defined

My unsophisticated definition of latency is that it is the elapsed time between A and B where A and B are something you care about. Low-latency and high-latency are relative terms. The latency requirements for a femptosecond laser are far different than for mail delivery via the pony express, yet both systems can be characterized by latency. A system has low-latency if it's low enough to meet requirements, otherwise it's a high-latency system.

Latency Explained

The best explanation of latency I've ever read is still It's the Latency, Stupid by admitted network wizard Stuart Cheshire. A wonderful and detailed rant explaining latency as it relates to network communication, but the ideas are applicable everywhere.

Stuart's major point: If you have a network link with low bandwidth then it's an easy matter of putting several in parallel to make a combined link with higher bandwidth, but if you have a network link with bad latency then no amount of money can turn any number of them into a link with good latency.

I like the parallel with sharding in this observation. We put shards in parallel to increase capacity, but request latency through the system remains the same. So if we want to increase interactivity we have to address every component in the system that introduces latency and minimize or remove it's contribution. There's no "easy" scale-out strategy for fixing latency problems.

Sources of Latency

My parents told me latency was brought by Santa Clause in the dead of night, but that turns out not to be true! So where does latency come from?

  • Low Level Infrastructure. Includes OS / Kernel, Processors / CPU's, Memory, Storage related I/O, and Network related I/O.
  • High Level Infrastructure. Analysis of sources of latency in downloading web pages by Marc Abrams. The study examines several sources of latency: DNS, TCP, Web server, network links, and routers. Conclusion: In most cases, roughly half of the time is spent from the moment the browser sends the acknowledgment completing the TCP connection establishment until the first packet containing page content arrives. The bulk of this time is the round trip delay, and only a tiny portion is delay at the server. This implies that the bottleneck in accessing pages over the Internet is due to the Internet itself, and not the server speed.
  • Software Processing. Software processing accounts for much of the difficult to squeeze out latency in a system. In very rough terms a 2.0 GHz microprocessor can execute a few hundred lines of code every microsecond. Before a packet is delivered to an endpoint many thousands of instructions have probably already been executed. Then the handling software will spend many thousands more processing the message and then sending a reply. It all can add up to a substantial part of the latency budget. Included in this category are support services like databases, search engines, etc.
  • Frontend. 80-90% of the end-user response time is spent on the frontend, so it makes sense to concentrate efforts there before heroically rewriting the backend.
  • Service Dependency Latency. Dependent components increase latency. If component A calls compont B then the latency is the sum of the latency for each component and overall availability is reduced.
  • Propagation Latency. The speed at which data travels through a link. For fibre optic cable, the rate of signal propagation is roughly two-thirds the speed of light in vacuum. Every 20km takes about 100 microseconds of propagation latency. To reduce latency your only choice is to reduce the distance between endpoints.
  • Transmission Latency. The speed at which a data is transmitted on a communication link. On a 1Gbps network a 1000 bit packet takes about one millionth of a second to transmit. It's not dependent on distance. To reduce latency you need a faster link.
  • Geographical Distribution. BCP (Business Continuity Planning) requires running in multiple datacenters which means added WAN latency constraints.
  • Messaging Latency. The folks at 29west provide a great list forces that increase message latency: Intermediaries, Garbage Collection, Retransmissions, Reordering, Batching, CPU Scheduling, Socket Buffers, Network Queuing, Network Access Control, Serialization, Speed of Light.

    Draw out the list of every hop a client request takes and the potential number of latency gremlins is quite impressive.

    The Downsides of Latency

    Lower sales may be the terminal condition of latency problems, but the differential diagnosis is made of many and varied ailments. As latency increases work stays queued at all levels of the system which puts stress everywhere. It's like dementia, the system forgets how to do anything. Some of the problems you may see are: Queues grow; Memory grows; Timeouts cascade; Memory grows; Paging increases; Retries cascade; State machines reset; Locks are held longer; Threads block; Deadlock occurs; Predictability declines; Throughput declines; Messages drop; Quality plummets.

    For a better list take a look at The Many Flavors of System Latency.. along the Critical Path of Peak Performance by Todd Jobson. A great analysis of the subject.

    Managing Latency

    The general algorithm for managing latency is:
  • Continually map, monitor, and characterize all sources of latency.
  • Remove and/or minimize all latency sources that are found.

    Hardly a revelation, but it's actually rare for applications to view their work flow in terms of latency. This is part of the Log Everything All the Time mantra. Time stamp every part of your system. Look at mean latency, standard deviation, and outliers. See if you can't make the mean a little nicer, pinch in that standard deviation, and chop off some of those spikes. With latency variability is the name of the game, but that doesn't mean that variability can't be better controlled and managed. Target your latency slimming efforts where it matters the most and you get the most bang for your buck.

    Next we will talk about various ideas for what you can do about latency once you've found it.

    Dan Pritchett's Lessons for Managing Latency

    Dan Pritchett is one of the few who has openly written on architecting for latency. Here are some of Dan's suggestions for structuring systems to manage latency:
  • Loosely Couple Components. Loose coupling has a number of benifits: Tightly coupled systems are impossible distribute across data centers, tightly couples systems fail together, and loosely coupled systems can be independently scaled and engineered for latency.
  • Use Asynchronous Interfaces. Set an expectation of asynchronous behavior between components. This allows you to add latency when you need to make changes. Getting users on hooked on synchronous low-latency interactions doesn't allow for architecture flexibility. So start from the beginning with asynch semantics.
  • Horizontally Scale from the Start. It's very difficult to change a monolithic schema once you meet a scaling wall. Start with a horizontal architecture so you don't build in too many problems that will be hard to remove later.
  • Create an Active/Active Architecture. Most approaches to BCP take an active/passive approach, only one data center is active at a time. Creating an active/active system, where all data centers operate simultaneously allows users to be served from the closest data center which decreases latency.
  • Use a BASE (basically available, soft state, eventually consistent) Instead of ACID (atomicity, consistency, isolation, durability) Shared Storage Model. BASE is derived from the CAP Theorem which is the highly counter intuitive notion that database services cannot ensure all three of the following properties at once: Consistency, Availability, Partition tolerance. A BASE based system is more tolerant to latency because it is an inherently partitioned and loosely coupled architecture and it uses eventual consistency. With eventual consistency you can make an update to one partition and return. You don't have to coordinate a transaction across multiple database servers, which makes a system have a higher and more variable latency.

    Clearly each of these principles is a major topic all on their own. For more details please read: Dan Pritchett has written a few excellent papers on managing latency: The Challenges of Latency, Architecting for Latency, Latency Exists, Cope!.

    GigaSpaces Lessons for Lowering Latency

    GigsSpaces is an in-memory grid vendor and as such is on the vanguard of the RAM is the New Disk style of application building. In this approach disk is pushed aside for keeping all data in RAM. Following this line of logic GigaSpaces came up with these low latency architecture principles:

  • Co-location of the tiers (logic, data, messaging, presentation) on the same physical machine (but with a shared-nothing architecture so that there is minimal communication between machines)
  • Co-location of services on the same machine
  • Maintaining data in memory (caching)
  • Asynch communication to a persistent store and across geographical locations

    The thinking is the primary source of latency in a system centers around accessing disk. So skip the disk and keep everything in memory. Very logical. As memory is an order of magnitude faster than disk it's hard to argue that latency in such a system wouldn't plummet.

    Latency is minimized because objects are in kept memory and work requests are directed directly to the machine containing the already in-memory object. The object implements the request behavior on the same machine. There's no pulling data from a disk. There isn't even the hit of accessing a cache server. And since all other object requests are also served from in-memory objects we've minimized the Service Dependency Latency problem as well.

    GigaSpaces isn't the only player in this market. You might want to also take a look at: Scaleout Software, Grid Gain, Teracotta, GemStone, and Coherence. We'll have more on some of these products later.

    Miscellaneous Latency Reduction Ideas

  • Cache. No, really? Well it had to be said. See A Bunch of Great Strategies for Using Memcached and MySQL Better Together.
  • Use a CDN. No, really? See What CDN would you recommend?.
  • Use a Caching Proxy Server. At least this is a little less obvious. See Strategy: Front S3 with a Caching Proxy.
  • Enhance Your Web Operations Capability. There are plenty of available tools to help you pinpoint and correct operation related problems. See Velocity Conference for more information.
  • Use Yslow to Make Your Pages Go. Yslow is a tool to show sources of latency on the client side and suggest ways to fix any problems found. See Yslow to speed up your web pages.
  • Use an Edge DNS Accelerator. This type of service "will ensure that a name server most accessible to the end user will pick up the request and respond." See Edge Acceleration Strategies: Akamai.
  • Optimize Virtual Machines. People often forget VM's exact a performance tax. Virtualized I/O can suffer a substantial performance penalty. See if it can't be tuned.
  • Use Ajax to minimize perceived latency to the user. Clever UI design can make a site feel faster than it really is.
  • Use a faster network. A high speed InfiniBand link can have an end-to end latency of about 1 microsecond. Another option is a 10 GigE network.
  • Scale up. Faster processors means less software induced latency.
  • Optimize firewalls. An often hidden latency enhancer is your firewall system.
  • Use Small Memory Chunks When Using Java. GC in Java kills latency. One way to minimize the impact of garbage collection on latency is to use more VMs and less memory in each VM instead of VM with a lot of memory. This prevents a large GC run and makes latency more predictable.
  • Use a TCP Offload Engine (TOE). TOE tech offloads the TCP/IP stack from the main CPU and puts it on the network controller. This means network adapters can respond faster which means faster end-to-end communication. Network adapters respond faster because bus wait time is reduced as the number of transactions across the system I/O bus and memory bus are reduced.
  • Design low latency network topoligies. Phil Dykstra in Issues Impacting Gigabit Networks:
    Why don't most users experience high data rates?
    pinpoints poor network design as one major source of latency: On a single high performance network today, measured latencies are typically ~1.5x - 3x that expected from the speed of light in fiber. This is mostly due to taking longer than line-of-site paths. Between different networks (via NAPs) latency is usually much worse. Some extra distance is required, based on the availability of fiber routes and interconnects, but much more attention should be given to minimizing latency as we design our network topologies and routing.
  • Make TCP Faster. FastTCP, for example, tweaks TCP to provide smoother and faster data delivery.
  • Copy Data Zero Times. Efficient data transfer through zero copy. Copying data kills. This excellent article explains the path data takes through the OS and how to reduce the number of copies to the big zero.
  • Increase the speed of light. Warp capability could really help speed up communication. Get to work on that!

    Application Server Architecture Matters Again

    With the general move over the past few years to a standard shared nothing two-tierish architecture, discussion of application server architectures has become a neglected topic, mainly because there weren't application servers anymore. Web requests came in, data was retrieved from the database, and results were calculated and returned to the user. No application server. The web server became the application server. This was quite a change from previous architectures which were more application server oriented. Though they weren't called application servers, they were call daemons or even just servers (as in client-server).

    Let's say we buy into RAM is the New Disk. This means we'll have many persistent processes filled with many objects spread over many boxes. A stream of requests are directed at each process and those requests must be executed in each process. How should those processes be designed?

    Sure, having objects in memory reduces latency, but it's very easy through poor programming practice to lose all of that advantage. And then some. Fortunately we have a ton of literature on how to structure servers. I have a more thorough discussion here in Architecture Discussion. Also take a look at SEDA, an architecture for highly concurrent servers and ACE, an OO network programming toolkit in C++.

    A few general suggestions:
  • Stop Serializing/Deserializing Messages. It boggles my mind why we still serialize and deserialize messages. Leave messages in a binary compressed format and decode only on access. Very few activities waste more CPU and cause more lock contention through the memory library than does serialization.
  • Load Balance Across Read Replicas. The more copies of objects you have the more work you can perform in parallel. Consider keeping objects replicas for both high availability and high scalability. This is the same strategy distributed file systems use handle more load. It works in-memory as well.
  • Don't Block. The goal for a program is to use its whole CPU time quanta when it's scheduled to run. Don't block. Don't give the processor back to the OS for someone else to get it. Block for any reason and your performance tanks because not only do you incur the latency of the operation but there's added rescheduling latency as well. Who knows when your thread will get scheduled again?
  • Minimize Paging. Thrashing is when a system experiences excessive page faults. More work is spent on moving memory around than is being given to tasks to perform real work. It's usually three orders of magnitude slower to access a page from disk instead of memory. Unfortunately, memory managers in most languages make reducing paging difficult as you have no control over where memory is placed or how it us used. With CPU speeds these days basically an operation is free when you are operating on paged-in memory.
  • Minimize/Remove locking. Locks add latency and variability to a processing pipeline. A lock is a blocking operation. So you are choosing not to run when you have the CPU which means you incur a number of different forms of latency. Select a server architecture that minimizes the need for locks.

    Colocate

    Locating applications together reduces latency by reducing data hops. The number and location of network hops a message has to travel through is a big part of the end-to-end latency of a system.

    For example, from New York to the London Stock Exchange a round trip message takes 84 milliseconds to send, from Frankfurt it take 18 milliseconds, and from Tokyo it takes 208 milliseconds. If you want to minimize latency then the clear strategy is to colocate your service in the London Stock Exchange. Distance is minimized and you can probably use a faster network too.

    Virtualization technology makes it easier than ever to compose separate systems together. Add a cloud infrastructure to that and it becomes almost easy to dramatically lower latencies by colocating applications.

    Minimize the Number of Hops

    Latency increases with each hop in a system. The fewer hops the less latency. So put those hops on a diet. Some hop reducing ideas are:
  • Colocation. Colocation is one hop reducing strategy. It reduces the number of WAN links, routers, etc that a message has to go through. If a router takes 400 microsecond for each packet, for example, getting rid of that router reduces latency. Colocation also works for code and data, as in the GigaSpaces architecture. They maintain a sharded in-memory object cache so an extra database hop is avoided when executing an operation.
  • Simplify Software Architecture. Remove intermediate daemons, brokers and other latency adding components. Dispatch work to where it will be processed as simply and fast as possible. Peer-to-peer architectures and sharding approaches are good at this. Avoid sending work into a hub for central dispatching. Dispatch as far out at the edge as possible.
  • Open a New Datacenter. Facebook opened a new datacenter on the east coast in order to save 70 milliseconds.

    Build Your own Field-programmable Gate Array (FPGA)

    This one may seem a little off the wall, but creating your own custom FPGA may be a killer option for some problems. A FPGA is a semiconductor device containing programmable logic. Typical computer programs are a series of instructions that are loaded and interpreted by a general purpose microprocessor, like the one in your desk top computer. Using a FPGA it's possible to bypass the overhead of a general purpose microprocessor and code your application directly into silicon. For some classes of problems the performance increases can be dramatic.

    FPGAs are programmed with your task specific algorithm. Usually something compute intensive like medical imaging, modeling bond yields, cryptography, and matching patterns for deep packet inspections. I/O heavy operations probably won't benefit from FPGAs. Sure, the same algorithm could be run on a standard platform, but the advantage FPGAs have is even though they may run at a relatively low clock rates, FPGAs can perform many calculations in parallel. So perhaps orders-of-magnitude more work is being performed each clock cycle. Also, FPGAs often use content addressable memory which provides a significant speedup for indexing, searching, and matching operations. We also may see a move to FPGAs because they use less power. Stay lean and green.

    In embedded projects FPGAs and ASICS (application-specific integrated circuit) are avoided like the plague. If you can get by with an off-the-shelf microprocessors (Intel, AMD, ARM, PPC, etc) you do it. It's a time-to-market issue. Standard microprocessors are, well, standard, so that makes them easy to work with. Operating systems will already have board support packages for standard processors, which makes building a system faster and cheaper. Once custom hardware is involved it becomes a lot of work to support the new chip in hardware and software. Creating a software only solution is much more flexible in a world where constant change rules. Hardware resists change. So does software, but since people think it doesn't we have to act like software is infinitely malleable.

    Sometimes hardware is the way to go. If you are building a NIC that has to process packets at line speed the chances are an off-the-shelf processor won't be cost effective and may not be fast enough. Your typical high end graphics card, for example, is a marvel of engineering. Graphics cards are so powerful these days distributed computation projects like Folding@home get a substantial amount of their processing power from graphics cards. Traditional CPUs are creamed by NVIDIA GeForce GPUs which perform protein-folding simulations up to 140 times faster. The downside is GPUs require very specialized programming, so it's easier to write for a standard CPU and be done with it.

    That same protein folding power can be available to your own applications. ACTIV Financial, for example, uses a custom FGPA for low latency processing of high speed financial data flows. ACTIV's competitors use a traditional commodity box approach where financial data is processed by a large number of commodity servers. Let's say an application takes 12 servers. Using a FPGA the number of servers can be collapsed down to one because more instructions are performed simultaneously which means fewer machines ar needed. Using the FPGA architecture they process 20 times more messages than they did before and have reduced latency from one millisecond down to less than 100 microseconds.

    Part of the performance improvement comes from the high speed main memory and network IO access FPGAs enjoy with the processor. Both Intel and AMD make it relatively easy to connect FPGAs to their chips. Using these mechanisms data moves back and forth between your processing engine and the main processor with minimal latency. In a standard architecture all this communication and manipulation would happen over a network.

    FPGAs are programmed using hardware description languages like Verilog and VHDL. You can't get away from the hardware when programming FPGAs, which is a major bridge to cross for us software types. Many moons ago I took a Verilog FPGA programming class. It's not easy, nothing is ever easy, but it is possible. And for the right problem it might even be worth it.

    Related Articles

  • The Challenges of Latency by Dan Pritchett
  • Latency Exists, Cope! by Dan Pritchett
  • Architecting for Latency by Dan Pritchett
  • BASE: An ACID Alternative by Dan Pritchett, eBay
  • Comet: Sub-Second Latency with 10K+ Concurrent Users by Alexander Olaru
  • It's the Latency, Stupid by Stuart Cheshire
  • Latency and the Quest for Interactivity by Stuart Cheshire
  • The importance of bandwidth versus latency by Dion Almaer
  • Fallacies of Distributed Computing - The second fallacy is "Latency is Zero"
  • List of device bandwidths
  • Computing over a high-latency network means you have to bulk up by Raymond Chen
  • AJAX Latency problems: myth or reality? by Jep Castelein
  • RAM Guide: Part I DRAM and SRAM Basics and Part 2 by Jon "Hannibal" Stokes
  • The Many Flavors of System Latency.. along the Critical Path of Peak Performance and Processors and Performance : Chips, MIPS, and Sizing blips.. by Todd Jobson. A very detailed and helpful analysis of latency sources.
  • Ethernet Latency: The Hidden Performance Killer by Kevin Burton
  • Network latency vs. end-to-end latency by Nati Shalom
  • Low-Latency Delivery Enters Mainstream; But Standard Measurement Remains Elusive by Andrew Delaney
  • The three faces of latency by By Scott Parsons, Chief Scientist at Exegy, Inc.
  • Architecture Discussion
  • The JVM needs Value Types - Solving the next bottleneck. Value types use less space, less paging, less memory allocation, better cache usage, better garbage collection profile.
  • Latency, Bandwidth, and Response Times by Chris Loosley.
  • Anatomy of real-time Linux architectures by M. Time Jones.
  • True Cost of Latency by GemStone
  • Tuesday
    Jun302009

    Hot New Trend: Linking Clouds Through Cheap IP VPNs Instead of Private Lines 

    You might think major Internet companies have a latency, availability, and bandwidth advantage because they can afford expensive dedicated point-to-point private line networks between their data centers. And you would be right. It's a great advantage. Or it at least it was a great advantage. Cost is the great equalizer and companies are now scrambling for ways to cut costs. Many of the most recognizable Internet companies are moving to IP VPNs (Virtual Private Networks) as a much cheaper alternative to private lines. This is a strategy you can effectively use too.

    This trend has historical precedent in the data center. In the same way leading edge companies moved early to virtualize their data centers, leading edge companies are now virtualizing their networks using IP VPNs to build inexpensive private networks over a shared public network. In kindergarten we learned sharing was polite, it turns out sharing can also save a lot of money in both the data center and on the network.

    The line of reasoning for adopting IP VPNs goes something like this:

  • Major companies are saving 1/4 to 1/2 of their networking costs by moving from private lines to IP VPNs. This does not even include the benefits of lower equipment costs (GigE ports are basically free) and more flexible provisioning (any-to-any connectivity, easy bandwidth dialup).
  • Cheaper comes with a cost. Private lines are reliable. The Internet is inherently unreliable, especially when two endpoints are linked by potentially dozens of routers in between. In particular Internet connections suffer from: 1) dropped packets 2) out of order packets. Statistically this may happen for only 1% of packets, but when it does the user experience plummets. To get a feel for the impact imagine you have a 200ms latency link to Europe and you're trying to do something interactive. Lose a packet and you'll have to wait for a retransmission which will take at least 1 second. So IP VPNs can provide an order of magnitude more bandwidth for less money, but they often have less actual throughput and reliability.
  • Since latency and quality are so important to Internet companies, how can they possibly afford to use IP VPNs? They cheat. They fix the IP connection by using WAN accelerators.
  • WAN accelerators are typically thought to be mostly about caching, but they can also can trick TCP into giving a better connection even over unreliable networks. It's like wearing corrective lenses for your network. And that's what you need when dropping dedicated lines for Internet connections.
  • Relatively inexpensive WAN accelerators can turn somewhat unreliable Internet connections into a very reliable cost effective connection option. Your customers won't believe it's not butter.
  • The result: lots of money saved and a quality costumer experience.

    We take TCP for granted so to learn it has this unsightly packet loss/delay problem is a bit unsettling. But here's the impact packet loss has on throughput:
  • Latency: 100ms, Loss: 1%, Throughput: 1.2 Mbps
  • Latency: 200ms, Loss: 1%, Throughput: .6 Mbps
  • Latency: 100ms, Loss: .5%, Throughput: 1.7 Mbps

    These numbers are independent of your WAN link capacity. You could have an 100Mbps link with 1% loss and 100ms latency and you're limited to 1Mbps!

    The reason why we have this bandwidth robbing state of affairs is because when TCP was designed packet loss meant network congestion. The way to deal with congestion is to stop sending data in order to avoid congestion. This drops throughput drastically for a very long time. Over long distance WAN connections packets can be delayed which seems like a packet loss which causes congestion avoidance measures to kick in. Or maybe only a single packet was dropped and that kicks in congestion avoidance.

    The trick is convincing TCP that everything is cool so the full connection bandwidth can be used. WAN accelerators have a number of complex features to keep TCP happy. Damon Ennis, VP Product Management and Customer Support for Silver Peak, a WAN accelerator company, talks about why clouds, IP VPNs, and WAN accelerators are a perfect match:
    Moving applications into the cloud offers substantial cost savings for enterprises. Unfortunately those savings come at the cost of application performance. Often performance is so hampered that users’ productivity is severely limited. In extreme cases, users refuse to utilize the cloud-based application altogether and resort to old habits like saving files locally without centralized backup or returning to their old “thick” applications.

    The cloud limits performance because the applications must be accessed over the WAN. WANs are different from LANs in three ways – WAN bandwidth is a fraction of LAN bandwidth, WAN latency is orders of magnitude higher than LAN latency, and packet loss exists on the WAN where none existed on the LAN. Most IT professionals are familiar with the impacts of bandwidth on transfer times – a 100MB file takes approximately 1 second to transfer on a Gbps LAN and approximately 10 seconds to transfer on a 100Mbps LAN. They then extrapolate this thinking to the WAN and assume that it will take 10 seconds to transfer the same file on a 100Mbps WAN. Unfortunately, this isn’t the case. Introduce 100ms of latency and this transfer now takes almost 3 minutes. Introduce just 1 % packet loss and this transfer now takes over 10 minutes.

    There’s a calculator available that will let you figure out the effective throughput of your own WAN if you know its bandwidth, latency, and loss. Once you know your effective throughput simply divide 800Mb (100MB) by your effective throughput to determine how long it would take to transfer the same example file over your WAN.

    Latency and loss don’t just impact file transfer times, they also have a dramatic impact on any applications that need to be accessed in real-time over the WAN. In this context a real-time application is one that requires real-time response to users’ keystrokes – think of any application that is served over a thin-client infrastructure or Virtual Desktop Infrastructure (VDI). Not only is the server 100 ms away but any lost packet will result in delays of up to half a second waiting for the loss to be detected and the retransmission to occur. This is the root cause of the frustrated user banging on the enter key looking for a response.
    This all seems like a lot effort, doesn't it? Why not just dump TCP and move to a better protocol? Sounds good but everything works on TCP so to change now would be monumental. And as strange as it seems TCP is doing it's job. It's a protocol so there's a separation of what's above it from what's below it which allows innovation at the TCP level without breaking anything. And that's what layering is all about.

    The upshot is with a little planning you can take advantage of much cheaper IP VPN costs, improve latency, and maximize bandwidth usage. Just like the big guys.

    Related Articles

  • Cloud Computing Requires Infrastructure 2.0 by Gregory Ness
  • Myth of Bandwidth and Application Performance by Ameet Dhillon
  • How Does WAN Optimization Work? by Paul Rubens
  • SilverPeak Technology Overview
  • Wednesday
    Jun242009

    Habits of Highly Scalable Web Applications 

    Nick Belhomme wrote up a excellent summary of a talk given by Eli White on building scalable web applications. Eli worked at digg.com and is now the PHP Community Manager & DevZone Editor-in-Chief at Zend Technologies. Eli takes us on a grand tour through various proven scaling strategies. On the trip you'll visit:

  • What is scalable application design
  • Tip 1: load balancing the webserver
  • Tip 2: scaling from a single DB server to a Master-Slave setup
  • Tip 3: Partitioning, Vertical DB Scaling
  • Tip 4: Partitioning, horizontal DB Scaling
  • Tip 5: Application Level Partitioning
  • Tip 6: Caching to get around your database
  • Resources
  • Tuesday
    Jun232009

    Learn How to Exploit Multiple Cores for Better Performance and Scalability

    InfoQueue has this excellent talk by Brian Goetz on the new features being added to Java SE 7 that will allow programmers to fully exploit our massively multi-processor future. While the talk is about Java it's really more general than that and there's a lot to learn here for everyone.

    Brian starts with a short, coherent, and compelling explanation of why programmers can't expect to be saved by ever faster CPUs and why we must learn to exploit the strengths of multiple core computers to make our software go faster.

    Some techniques for exploiting multiple cores are given in an equally short, coherent, and compelling explanation of why divide and conquer as the secret to multi-core bliss, fork-join, how the Java approach differs from map-reduce, and lots of other juicy topics.

    The multi-core "problem" is only going to get worse. Tilera founder Anant Agarwal estimates by 2017 embedded processors could have 4,096 cores, server CPUs might have 512 cores and desktop chips could use 128 cores. Some disagree saying this is too optimistic, but Agarwal maintains the number of cores will double every 18 months.

    An abstract of the talk follows though I would highly recommend watching the whole thing. Brian does a great job.

    Why is Parallelism More Important Now?

  • Coarse grain concurrency was all the rage for Java 5. The hardware reality has changed. The number of cores is increasing so applications must now search for fine grain parallelism (fork-join)
  • As hardware becomes more parallel, more and more cores, software has to look for techniques to find more and more parallelism to keep the hardware busy.
  • Clock rates have been increasing exponentially over the last 30 years or so. Allowed programmers to be lazy because a faster processor would be released that saved your butt. There wasn't a need to tune programs.
  • That wait for faster processor game is up. Around 2003 clock rates stopped increasing. Hit the power wall. Faster processors require more power. Thinner chip conductor lines were required and the thinner lines can't dissipate the increased power without causing overheating which effects the resistance characteristics of the conductors. So you can't keep increasing clock rate.
  • Fastest Intel CPU 4 or 5 years ago was 3.2 Ghz. Today it's about the same or even slower.
  • Easier to build 2.6 Ghz or 2.8 Ghz chips. Moore's law wasn't repealed so we can cram more transistors on each wafer. So more processing power could be put on a chip which leads to putting more and more processing cores on a chip. This is multicore.
  • Multicore systems are the trend. The number of cores will grow at exponential rate for the next 10 years. 4 cores at the low end. The high end 256 (Sun) and 800 (Azul) core systems.
  • More cores per chip instead of faster chips. Moore's law has been redirected to multicore.
  • The problem is it's harder to make a program go faster on a multicore system. A faster chip will run your program faster. If you have a 100 cores you program won't go faster unless you explicitly design it to take advantage of those chips.
  • No free lunch anymore. Must now be able to partition your program so it can run faster by running on multiple cores. And you must be able keep doing that as the number of cores keeps improving.
  • We need a way to specify programs so they can be made parallel as topologies change by adding more cores.
  • As hardware evolves platforms must evolve to take advantage of the new hardware. Started off with course grain tasks which was sufficient given the number of cores. This approach won't work as the number cores increase.
  • Must find finer-grained parallelism. Example sorting and searching data. Opportunities around data. The data can for sorting can be chunked and sorted and the brought together with a merge sort. Searching can be done in parallel by searching subregions of the data and merging the results.
  • Parallel solutions use more CPU in aggregate because of the coordination needed and that data needs to be handled more than once (merge). But the result is faster because it's done in parallel. This adds business value. Faster is better for humans.

    What has Java 7 Added to Support Parallelism?

  • Example problem is to find the max number from a list.
  • The course grained threading approach is to use a thread pool, divide up the numbers, and let the task pool compute the sub problems. A shared task pool is slow as the number increases which forces the work to be more course grained. No way to load balance. Code is ugly. Doesn't match the problem well. The runtime is dominated by how long it takes the longest subtask to run. Had to decide up front how many pieces to divide the problem into.
  • Solution using divide and conquer. Divide set into pieces recursively until the problem is so small the sequential solution is more efficient. Sort the pieces. Merge the results. 0(n log n), but problem is parallelizable. Scales well and can keep many CPUs busy.
  • Divide and conquer uses fork-join to fork off subtasks and wait for them to complete and then join the results. A typical thread pool solution is not efficient. Creates too many threads and creating threads are expensive and use a lot of memory.
  • This approach portable because it's abstract. It doesn't know how many processors are available It's independent of the topology.
  • The fork-join pool is optimized for fine grained operations whereas the thread pool is optimized for course grained operations. Best used for problems without IO. Just computations using CPU that tend to fork off sub problems. Allows data to be shared read-only and used across different computations without copying.
  • This approach scales nearly linearly with the number of hardware threads.
  • The goal for fork-join: Avoid context switches; Have as many threads as hardware threads and keep them all busy; Minimize queue lock contention for data structures. Avoid common task queue.
  • Implementation uses Work-Stealing. Each thread has a work queue that is a double ended queue. Each thread pulls work from the head of queue and processes it. When there's nothing do it steals work from the tail of another queue. No contention for the head because only one thread access it. Rare contention on tail because stealing is infrequent as the stolen work is large which takes them time to process. Process starts with one task. It breaks up the work. Other tasks steal work and start the same process. Load balances without central coordination, few context switches, little coordination.
  • The same approach also works for graph traversal, matrix operations, linear algebra, modeling, generate moves and evaluate the result. Latent parallelism can be found in a lot of places once you start looking.
  • Support higher level operations like ParallelArray. Can specify filtering, transformation, and aggregation options. Not a generalized in-memory database, but has a very transparent cost model. It's clear how many parallel operations are happening. Can look at the code and quickly know what's a parallel operation so you will know the cost.
  • Looks like map reduce except this is scaling across a multicore system, one single JVM, whereas map reduce is across a cluster. The strategy is the same: divide and conquer.
  • Idea is to make specifying parallel operations so easy you wouldn't even think of the serial approach.

    Related Articles

  • The Free Lunch Is Over - A Fundamental Turn Toward Concurrency in Software By Herb Sutter
  • Intuition, Performance, and Scale by Dan Pritchett
  • "Multi-core Mania": A Rebuttal by Ted Neward
  • CPU designers debate multi-core future by Rick Merritt
  • Multicore puts screws to parallel-programming models by Rick Merritt
  • Challenges in Multi-Core Era – Part 1 and Part 2 by Gaston Hillar.
  • Running multiple processes to understand multicore CPUs power.
  • Learning to Program all Over Again by Vineet Gupta
  • Friday
    May082009

    Eight Best Practices for Building Scalable Systems

    Wille Faler has created an excellent list of best practices for building scalable and high performance systems. Here's a short summary of his points:

  • Offload the database - Avoid hitting the database, and avoid opening transactions or connections unless you absolutely need to use them.
  • What a difference a cache makes - For read heavy applications caching is the easiest way offload the database.
  • Cache as coarse-grained objects as possible - Coarse-grained objects save CPU and time by requiring fewer reads to assemble objects.
  • Don’t store transient state permanently - Is it really necessary to store your transient data in the database?
  • Location, Location - put things close to where they are supposed to be delivered.
  • Constrain concurrent access to limited resource - it's quicker to let a single thread do work and finish rather than flooding finite resources with 200 client threads.
  • Staged, asynchronous processing - separate a process using asynchronicity into separate steps mediated by queues and executed by a limited number of workers in each step.
  • Minimize network chatter - Avoid remote communication if you can as it's slower and less reliable than local computation.

    Click to read more ...

  • Friday
    Mar062009

    Cloud Programming Directly Feeds Cost Allocation Back into Software Design

    Update 6: CARS = Cost Aware Runtimes and Services by William Louth.
    Update 5: Damn You Google, Damn You Yahoo! Why D'Ya Do This to Us? Free accounts on a cloud platform are a constant drain of money.
    Update 4: Caching becomes even more important in CPU based billing environments. Avoiding the CPU means saving money.
    Update 3: An interesting simple example of this idea showed up on the Google AppEngine list. With one paging algorithm and one use of AJAX the yearly cost of the site was $1000. By changing those algorithms the site went under quota and became free again. This will make life a lot more interesting for developers.
    Update 2: Business Model Influencing Software Architecture by Brandon Watson. The profitability of your project could disappear overnight on account of code behaving badly.
    Update: Amazon adds Elastic Block Store at $0.10 per 1 million I/O requests. Now I need some cost minimization storage algorithms!

    In the GAE Meetup yesterday a very interesting design rule came up: Design By Explicit Cost Model. A clumsy name I know, but it is explained like this:

     

    If you are going to be charged for an operation GAE wants you to explicitly ask for it. This is why some automatic navigation between objects isn't provided because that will force an explicit query to be written. Writing an explicit query is a sort of EULA for being charged. Click OK in the form of a query and you've indicated that you are prepared to pay for a database operation.

    Usually in programming the costs we talk about are time, space, latency, bandwidth, storage, person hours, etc. Listening to the Google folks talk about how one of their explicit design goals was to require programmers to be mindful of operations that will cost money made me realize in cloud programming cost will be another aspect of design we'll have to factor in.

    Instead of asking for the Big O complexity of an algorithm we'll also have to ask for the Big $ (or Big Euro) notation so we can judge an algorithm by its cost against a particular cloud profile. Maybe something like $(CPU=1.3,DISK=3,IN-BANDWIDTH=2,OUT=BANDWIDTH=3, DB=10). You could look at the Big $ notation for algorithm and shake your head saying that approach will never work for GAE, but it could work for Amazon. Can we find a cheaper Big $? 

    Typically infrastructure costs are part of the capital budget. Someone ponies up for the hardware and software is then "free" until more infrastructure is needed. The dollar cost of software design isn't usually an explicit factor considered.

    Now software design decisions are part of the operations budget. Every algorithm decision you make will have dollar cost associated with it and it may become more important to craft algorithms that minimize operations cost across a large number of resources (CPU, disk, bandwidth, etc) than it is to trade off our old friends space and time.

    Different cloud architecture will force very different design decisions. Under Amazon CPU is cheap whereas under GAE CPU is a scarce commodity. Applications between the two niches will not be easily ported.

    Don't be surprised if soon you go into an interview and they quiz you on Big $ notation and skip the dusty old relic that is Big O notation :-)

    Thursday
    Mar052009

    Strategy: In Cloud Computing Systematically Drive Load to the CPU

    Update 2: Linear Bloom Filters by Edward Kmett. A Bloom filter is a novel data structure for approximating membership in a set. A Bloom join conserves network bandwith by exchanging cheaper, more plentiful local CPU utilization and disk IO. Update: What are Amazon EC2 Compute Units?. Cloud providers charge for CPU time in voodoo units like "compute units" and "core hours." Geva Perry takes on the quest of figuring out what these mean in real life. I attended Sebastian Stadil's AWS Training Camp Saturday and during the class Sebastian brought up a wonderfully counter-intuitive idea: CPU (EC2) costs a lot less than storage (S3, SDB) so you should systematically move as much work as you can to the CPU. This is said to be the Client-Cloud Paradigm. It leverages the well pummeled trend that CPU power follows Moore's Law while storage follows The Great Plains' Law (flat). And what sane computing professional would do battle with Sir Moore and his trusty battle sword of a law? Embedded systems often make similar environmental optimizations. CPU rich and memory poor means operate on compressed serialized data structures. Deserialized data structures use a lot of memory, so why use them? It's easy enough to create an object wrapper around a buffer. Programmers shouldn't care how their objects are represented anyway. Yet we waste ginormous amounts of time and memory uselessly transforming XML in and out of different representations. Just transport compressed binary objects around and use them in place. Serialization and deserialization happen only on access (Pimpl Idiom). It never occurred to me that in the land of AWS plenty similar "tricks" would make sense. But EC2 is a loss leader in AWS. CPU is plentiful and cheap. It's IO and storage that costs you... The implication is that in your system design you should try and use EC2 as much as possible:

  • Compress data. Saves on bandwidth and storage (the expensive bits) and uses cheaper CPU to compress/decompress.
  • Slurp data. Latency cost is higher than performing operations locally. SDB can take up to 400 msecs between data centers and 200 msecs inside the same data center. This is very slow. It's usually faster, but it can take that long. Following the more traditional serial processing path of "get a record do a record" will take forever and cost more. Slurp up all your records from SDB and farm them out to your CPU nodes to be worked on in parallel.
  • Think parallel. Do multiple operations at once on your cheap CPUs rather than serially performing high latency operations on expensive storage. With enough nodes, total execution time approaches max latency.
  • Client side joins. Pull all data from the relatively expensive SDB and perform client side joins on relatively cheap EC2 nodes.
  • Leverage SQS. It's a relatively cheap part of the ecosystem. Keeping a work queue in SDB would be far more expensive. When all the implications are fully explored it's a little different take on designing a system. I found some interesting numbers in a Slashdot thread comparing values: No persistent storage; not great value: And it's still not a great value. It seems cheap. $72/mo for a 1.7GB RAM server. Well, look at Slicehost and you can get a 2GB RAM Xen instance (same virtualization software as EC2) for $140 WITH persistent storage and 800GB of bandwidth. That doesn't sound like a great deal UNTIL you calculate what EC2 bandwidth costs. 800GB would cost you $144 at $0.18 per GB bringing the total cost to $216 ($76 more than Slicehost). That 18 cents doesn't sound like much, but it adds up. The same situation happens with Joyent. For $250 you get a 2GB RAM server from them (running under Solaris' Zones) with 10TB of bandwidth. That would cost you $1,872 with EC2. Even if you assume that you'll only use 10% of what Joyent is giving you, EC2 still comes in at a cost of $252 - and without persistent storage!

    Click to read more ...

  • Sunday
    Feb222009

    Building and Scaling a Startup on Rails: 12 Things We Learned the Hard Way

    Garry Tan, cofounder of Posterous, lists 12 lessons for scaling that apply to more than just Rails.

  • Use cloud storage for static files.
  • Use HTTP Cache Control to tell the browser what it can cache.
  • Use Sphinx for text search.
  • Use InnoDB for more crash resistant and faster writes.
  • Don't use textbook Rails ActiveRecord objects. Use New Relic to find exactly what is slow in your system.
  • Use memcache later so you find your database bottlenecks now.
  • Use mongrel proctitle to find your slow queries. You are only as fast as your slowest queries.
  • Use asynchronous job queuing to do work in parallel.
  • Use monitoring so you'll know when your site went down and why.
  • Learn by reading the source code, fixing problems, and submitting them back to the community.
  • Use new plugins. Old plugins can't be trusted.
  • Use new information. Old information can't be trusted.

    Click to read more ...

  • Monday
    Feb162009

    Handle 1 Billion Events Per Day Using a Memory Grid

    Moshe Kaplan of RockeTier shows the life cycle of an affiliate marketing system that starts off as a cub handling one million events per day and ends up a lion handling 200 million to even one billion events per day. The resulting system uses ten commodity servers at a cost of $35,000. Mr. Kaplan's paper is especially interesting because it documents a system architecture evolution we may see a lot more of in the future: database centric --> cache centric --> memory grid. As scaling and performance requirements for complicated operations increase, leaving the entire system in memory starts to make a great deal of sense. Why use cache at all? Why shouldn't your system be all in memory from the start?

    General Approach to Evolving the System to Scale

  • Analyze the system architecture and the main business processes. Detect the main hardware bottlenecks and the related business process causing them. Focus efforts on points of greatest return.
  • Rate the bottlenecks by importance and provide immediate and practical recommendation to improve performance.
  • Implement the recommendations to provide immediate relief to problems. Risk is reduced by avoiding a full rewrite and spending a fortune on more resources.
  • Plan a road map for meeting next generation solutions.
  • Scale up and scale out when redesign is necessary.

    One Million Event Per Day System

  • The events are common advertising system operations like: ad impressions, clicks, and sales.
  • Typical two tier system. Impressions and banner sales are written directly to the database.
  • The immediate goal was to process 2.5 million events per day so something needed to be done.

    2.5 Million Event Per Day System

  • PerfMon was used to check web server and DB performance counters. CPU usage was at 100% at peak usage.
  • Immediate fixes included: tuning SQL queries, implementing stored procedures, using a PHP compiler, removing include files and fixing other programming errors.
  • The changes successfully double the performance of the system within 3 months. The next goal was to handle 20 million events per day.

    20 Million Event Per Day System

  • To make this scaling leap a rethinking of how the system worked was in order.
  • The main load of the system was validating inputs in order to prevent forgery.
  • A cache was maintained in the application servers to cut unnecessary database access. The result was 50% reduction in CPU utilization.
  • An in-memory database was used to accumulate transactions over time (impression counting, clicks, sales recording).
  • A periodic process was used to write transactions from the in-memory database to the database server.
  • This architecture could handle 20 million events using existing hardware.
  • Business projections required a system that could handle 200 million events.

    200 Million Event Per Day System

  • The next architectural evolution was to a scale out grid product. It's not mentioned in the paper but I think GigaSpaces was used.
  • A Layer 7 load balancer is used to route requests to sharded application servers. Each app server supports a different set of banners.
  • Data is still stored in the database as the data is used for statistics, reports, billing, fraud detection and so on.
  • Latency was slashed because logic was separated out of the HTTP request/response loop into a separate process and database persistence is done offline. At this point architecture supports near-linear scaling and it's projected that it can easily scale to a billion events per day.

    Related Articles

  • GridGain: One Compute Grid, Many Data Grids

    Click to read more ...