Entries by Todd Hoff (380)

Monday
Jul202009

A Scalability Lament

In Scalability issues for dummies Alex Barrera talks movingly about the challenges he faces trying to scale his startup inkzee as the lone developer. Inkzee is an online news reader that automatically groups similar topics. This is a cool problem and is one you know right away is going to have some killer scalability problems as the number of feeds and the number of users increase. And these problems lead to the point of the post, to explain here what are scalability problems and how deep the repercussions are for a small company, which Alex does admirably.

Some takeaways:

  • Sites are composed of a frontend and backend. The backend isn't visible to users, but it does all the work and this is where the scalability problems show up.
  • As more and more users use a site it becomes slow because more users reveal bottlenecks in the system that weren't visible before.
  • There can be many many reasons for these bottlenecks. They are often very hard to find because the backend systems are very complex and have a lot of complex interactions.
  • It takes a while to find the bottlenecks and create fixes for them. You are never quite sure if the fixes will really work. Many of these problems are unique to the problem space so pre-canned solutions aren't always available. And because you don't want to destroy your production servers it takes a while to put fixes into the system. This means your release cycle is slow which means progress on your site is slow.
  • The process of redesign sucks up all your resources so progress on the site stalls almost completely, especially for a small development group. This stops growth as you can't handle new customers and your existing customers become disenchanted.
  • The process is unfortunately iterative. Solving one problem just puts you in the queue for the next problem. There's a reason it took Twitter a while to get on their feet.
  • Scalability problems aren’t something you can discard as being ONLY technical, it’s roots might be technical but its effects will shake the whole company.

    I found Alex's commentary quite touching and familiar. As I imagine many of you do too. It's the modern equivalent of an explorer following a dream. Going alone into uncharted territory where the Dragons live and trying to survive when everything seems against you. For every great returning hero there are 10 who do not make it back. And that's hard to deal with.

    Everyone will certainly have their ideas on how to "fix" the problem, as that's what engineers do. But it also doesn't hurt to use our Venus brain for a moment and simply recognize the toll this process can take. It can be dispiriting. The continual stream of problems and lack of positive feedback can wear you down after a while. To stick with it takes a bit of craziness in the heart.

    Switching back to being Mars brained I might suggest:
  • Find a partner. There's always a Jerry to go along with Ben. A Martin to go along with Lewis. And Rambo is just a movie. Going it alone is hard hard hard. Partners can pick each other up and give each other a breather when necessary.
  • Use the Cloud. Whatever your opinions of the economics of the cloud, it makes testing of the type Alex was needing a breeze. Architect for the cloud and you can spin up a system in parallel, run a load generator, and never touch your production system at all. And do it all for cheap. This is one of the biggest wins for the cloud and would seem to solve a lot of problems.

    Related Articles

  • Scalable Web Architectures and Application State by Marton Trencseni
  • Scalability for dummies by Royans Tharakan
  • Thursday
    Jul162009

    Scaling Traffic: People Pod Pool of On Demand Self Driving Robotic Cars who Automatically Refuel from Cheap Solar

    Update 17: Are Wireless Road Trains the Cure for Traffic Congestion? BY ADDY DUGDALE. The concept of road trains--up to eight vehicles zooming down the road together--has long been considered a faster, safer, and greener way of traveling long distances by car

    Update 16: The first electric vehicle in the country powered completely by ultracapacitors. The minibus can be fully recharged in fifteen minutes, unlike battery vehicles, which typically takes hours to recharge.

    Update 15: How to Make UAVs Fully Autonomous. The Sense-and-Avoid system uses a four-megapixel camera on a pan tilt to detect obstacles from the ground. It puts red boxes around planes and birds, and blue boxes around movement that it determines is not an obstacle (e.g., dust on the lens).
    Update 14: ATNMBL is a concept vehicle for 2040 that represents the end of driving and an alternative approach to car design. Upon entering ATNMBL, you are presented with a simple question: "Where can I take you?" There is no steering wheel, brake pedal or driver's seat. ATNMBL drives for you. Electric powered plus solar assist, with wrap-around seating for seven, ATNMBL offers living and/or working comfort, views, conversations, entertainment, and social connectedness.

    Update 13: The Next Node on the Net? Your Car!. A new radio system developed in Australia is transforming the vehicles on the street into nodes on a network.
    Update 12: United Arab Emirates building network of driverless electric taxis. When the system's fully built, planners say the podcars will be able to deliver riders within 100 meters of any location in the city. The whole network of tracks for the cars will be two stories beneath street level.


    Update 11: Self-driving cars set to cut fuel consumption. Large-scale test seeks to put humans in the back seat. NEDO says it will start testing several key technologies that allow for autonomous driving between 2010 and 2012.
    Update 10: Fighting Traffic Jams With Data. Researchers from different universities are working on ways for cars to better communicate with each other and relay crucial driver information such as traffic speed, weather and road conditions.
    Update 9: Accident Ahead? New Software Will Enable Cars To Make Coordinated Avoidance Maneuvers. In dangerous situations, the cars can independently perform coordinated maneuvers without their drivers having to intervene. In this way, they can quickly and safely avoid one another.
    Update 8: Great article in Wired on Better Place's proposal for a new electric car distribution system. The idea is to blanket the country with "smart" charge spots. You buy your car from them and purchase a recharge plan. Profit come from selling electricity.
    Update 7: Capturing solar energy from asphalt pavements. An interesting way to make the system self-sufficient.
    Update 6: Why We Drive the Way We Do Unlocks How to Unclog Traffic. Vanderbilt says: The fundamental problem is that you've got drivers who make user-optimal rather than system-optimal decisions. Josh McHugh replies: Make the packets (cars) dumb and able to take marching orders from traffic routing nodes.
    Update 5: Traffic jams are not caused by flaws in road design but by flaws in human nature. Nearly 80 percent of crashes involve drivers not paying attention for up to three seconds. The both good and scary thing about computers is they always pay attention.
    Update 4: Volvo Says It Will Have An Injury Proof Car By 2020.
    Update 3: Map Reading For Dummies. Europe (again) is developing a system that will read satellite navigation maps and warn the driver of upcoming hazards – sharp bends, dips and accident black spots – which may be invisible to the driver. Even better, the system can update the geographic database. Another key capability of the People Pod system.
    Update 2: Road Safety: The Uncrashable Car?. A European research project basic could lead to a car that is virtually uncrashable. An uncrashable car would definitely ease people's concerns over computerized navigation.
    Update: Shockwave traffic jam recreated for first time - "Pinpointing the causes of shockwave jams is an exercise in psychology more than anything else. 'If they had set up an experiment with robots driving in a perfect circle, flow breakdown would not have occurred. Human error is needed to cause the fluctuations in behaviour.'"

    Traffic in the San Francisco Bay area is like Dolly Parton, 10 pounds in a 5 pound sack. Mass transit has been our unseen traffic woe savior for a while. But the ring of political fire circling the bay has prevented any meaningful region wide transportation solution. As everyone scrambles to live anywhere they can afford, we really need a region wide solution rather than the local fixes that can never go quite far enough. The solution: create a People Pod Pool of On Demand Self Driving Robotic Cars who Automatically Refuel from Cheap Solar.

    Commuters are Satisfied Not Carpooling

    You might think we would car pool more. But people of the bay don't like carpools and they don't much like mass transit either. In the Metro, a local weekly, they published a wonderful article Fueling the Fire, on how we need to cure our car addiction using the same marginalization techniques used to "stop" smoking.

    A telling quote shows how difficult going cold turkey off our cars will be:

    Mitch Baer, a public policy and environment graduate student at George Mason University in Virginia, recently surveyed more than 2,000 commuters in the Washington, D.C., area. He found that people who drove to work alone were more emotionally satisfied with their commute than those who rode public transportation or carpooled with others.

    Even stuck in traffic jams, those commuters said they felt they had more control over their arrival and departure times as well as commuting route, radio stations and air conditioning levels.

    Commuters said that driving alone was both quicker and more affordable, according to the study.

    "They will have a tougher time moving people out of their cars," Baer said. "It's easier for most people to drive than take mass transit."

    The key phrase to me is: people who drove to work alone were more emotionally satisfied. How can people jostled in the great pinball machine that are our roadways be emotionally satisfied? That's crazy talk. Shouldn't we feel less satisfied?

    In Our Cars We Feel Good Because We Are in Control

    Solving the mystery of why we feel satisfied while stuck in traffic turns on an important psychological clue: the more we perceive ourselves in control of a situation the less stress we feel. Robert Sapolsky talks about this surprising insight into human nature in Why Zebras Don't Get Ulcers.

    Notice we simply need more "perceived" control. Take control of a situation in your mind and stress goes down. You don't actually need to be in more control of a situation to feel less stress. If you have diabetes, facing your possibly bleak future can be less stressful if you try to control your blood sugars. If you are a speed demon, buying a radar detector can make you feel more in control and less stressed as you zoom along the seldom empty highways. If you are bullied, figuring out ways to avoid your torturer puts you more in control and therefor less stressed.

    Figure out a way to control and an out of control situation and you'll feel happier. That's what I think we are accomplishing by driving alone in cars. In our car we have complete control. Cars are our castles with a 2 inch air moat cushion. Most cars are plusher than any room in your average house. Fine leather, a rad sound system, perfect temperature control, and a nice beverage of choice within easy reaching distance. In our cars we've created a second womb. The result is we feel more control, less stress, and more satisfaction, even when outside, across the moat, a tempestuous sea of stressors await.

    Our Mass Transit System Must Supply Perceived Control

    Given the warm inner glow we feel from being wrapped in the cold steel of our cars, if you want people to get out of their cars and onto mass transit you must provide the same level of perceived control. None of our mass transit options do that now. Buses are on fixed schedules that don't go where I want to go when I want to go. Neither do trains, BART, or light rail. So the car it is. Unless a system could be devised that provided the benefits of mass transit plus the pleasing characteristics of control our cars give us.

    With Recent Technological Advances We Can Create a New Type of Mass Transit System

    New technologies are being developed the will allow us to create a mass transit system that matches our psychological and physical needs. Just berating people and telling them they should take mass transit to save the planet won't work. The pain is too near and the benefits are too far for the mental cost-benefit calculation to go the way of mass transit.

    The technologies I am talking about are:

  • Inexpensive solar with $1/watt solar panels. Our mass transit must of course be green and cost effective.
  • Breakthrough battery could boost electric cars. Toshiba promises 'energy solution' with nearly full recharge in 5 minutes.
  • Personal transportation pods. A reusable vehicle that can take anyone anywhere they want to go.
  • Self driving vehicles. We are making great strides in creating robot cars that can drive themselves in traffic. Already they drive better than most humans can drive (low bar, I know).

    Mix these all together and you get a completely different type of mass transit system. A mashup, if you will.

    Create a People Pod Pool of On Demand Autonomous Self Driving Robotic Cars that Automatically Refuel from Cheap Solar

    Many company campuses offer a pool of bicycles so workers can ride between buildings and make short trips. Some cities even make bikes available to their citizens. The idea is to do the same for cars, but with a twist or two.

    The cars (people pods) can be stored close to demand points and you can call for one anytime you wish. The cars are self driving. You don't actually drive them and are free to work or play during transit. Different kinds would be available depending on your purpose. Just one person on a shopping trip would receive a different car than a family. The pods would autonomously search out and find energy sources as needed to recharge.There's no reason to assume a centralized charging and storage facility. When repair was needed they could drive themselves to a repair depot or wait for the people pod ambulance service.

    The advantages of such a system are:
  • Perceived control. You have a personal "car" you control the destination for, the interior environment of, and your own actions inside. This gets over the biggest hurdle with current mass transit options.
  • Better regional traffic flow. The autonomous cars could drive cooperatively to smooth out traffic jams. Traffic jams are largely caused by people speeding up and slowing down which causes ripples of slowness up and down the road. And automated system could prevent that.
  • Go where you want to go. It would be used because people can go to exactly where they need to go and be picked up exactly where they need to leave from at exactly the time they wish. None of these are characteristic of current systems.
  • Leverage existing road ways. Creating light rail and trains is expensive and wasteful (except for the high speed point-point variety). They don't extend to where people live and they don't go where people go. So it creates a multi-hop mess out of every trip. We already have an expansive road system that goes where everyone wants to go. Using the road infrastructure more efficiently makes a lot more sense than creating hugely expensive partial solutions. And since these cars would be eco-friendly, most arguments against using cars fall away.
  • Cheaper delivery. One force keeping truly distributed manufacturing and retailing from blossoming is high delivery costs. A $2 item is simply too expensive to buy remotely and ship because shipping costs more than the product. An automated transportation system would make this model more affordable.
  • Live where you want to live. Most mass transit systems are based on trying to socially reengineer our current suburbian and exurbian living pattern into a high density live-work pattern. While this should be an option, most mass transit proposals assume this pattern as a given and can't deal with current realities. For the foreseeable future people will not give up their houses or their lifestyles. The People Pod approach solves the mass transit problem and the "difficulties" of having to change a whole populace to behave in a completely different way for less than compelling reasons.
  • Still can own your own car. This isn't a replacement for the current car culture. It's leveraging the car culture. You can still own and drive your own car. Nobody is trying to steal your car away from you.
  • Cleaner and safer. Mass transit is disliked by many because it is perceived as dirty and unsafe. The pods would be safe and clean.
  • Road safety. Our new robot overloads will make our lives safer. Hopefully, possibly, maybe...

    Funding

  • Current transportation budgets. Money could be redeployed from existing less than successful approaches.
  • Advertising. The outside of vehicles could contain advertising as could the inside, especially from the internal search system. Imagine wanting a new place to eat and asking the pod to suggest one. That's prime targeted marketing. Social networks and massive multi-player games could also be created between pods.
  • In-flight services. Movies on demand and so on.
  • Efficiencies. The plug-in cars are electric and efficient and low maintenance. That will save money.
  • Up sells. Individuals could buy their own pods and trick them out. Also, people could pay for a higher class of pod from the pod pool.

  • Licensing. Technology used in making the pods could be sold to other manufacturers. Create a standardized market so competition and cooperation can erupt.
  • Sponsorship. Companies could buy rights to play music, stock the food locker, use their equipment, etc.
  • Naming rights. The rights to name parts of the system could be sold.

    Implementation

  • Challenge prize. Maybe someone with a vision and a dream can put up a $50 million prize to get it going. Something like the Xprize.
  • Government funding. Don't laugh, it might happen.
  • Startup. I'm available if interested :-) With a large enough challenge prize this is a viable model.

    It's a Usable System so People Would Use It

    After a lot of reading on the topic and a lot of self-examination on why I am such a horrible person that I don't use mass transit more, this is the type of system I could really see myself using. It doesn't try to change the world, it uses what we got, and gives people what they want. It just might work.
  • Thursday
    Jul022009

    It Must be Crap on Relational Dabases Week 

    It's hard to be a relational database lately. After years of faithful service everywhere you look the world is turning against you:

  • Recently at the NoSQL conference 150 revolutionaries met with their new anti-RDBMS arms suppliers. And you know what happens when revolutionaries are motivated, educated, funded, and well armed.
  • The revolution has gone mainstream when Computerworld writes No to SQL? Anti-database movement gains steam. It's not just whispers anymore, it's everywhere.
  • And perennial revolutionary Michael Stonebraker runs from blog to blog shouting the The End of a DBMS Era (Might be Upon Us). Relational vendors are selling legacy software, are 50x slower than other alternatives, and that can not stand.
  • The Greek Chorus on Hacker News sings of anger and lies.

    Certainly some say stick with the past. It's your fault, you aren't doing it right, give us another chance and all will be as it ever was. Some smirk saying this is nothing but a return to a more ancient time when IBM was King.

    But it's in the air. It's in the code. A revolution is coming. To what? That is what is not yet clear.

    See also:
  • NoSQL? by Curt Monash.
  • CouchDB says BigTable clones are too complex.
  • Yahoo! Developer Network Blog. Very nice summary of the different talks.
  • Thursday
    Jul022009

    Product: Hbase

    Update 3: Presentation from the NoSQL Conference: slides, video.
    Update 2: Jim Wilson helps with the Understanding HBase and BigTable by explaining them from a "conceptual standpoint."
    Update: InfoQ interview: HBase Leads Discuss Hadoop, BigTable and Distributed Databases. "MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing."

    Hbase is the open source answer to BigTable, Google's highly scalable distributed database. It is built on top of Hadoop (product), which implements functionality similar to Google's GFS and Map/Reduce systems. 

    Both Google's GFS and Hadoop's HDFS provide a mechanism to reliably store large amounts of data. However, there is not really a mechanism for organizing the data and accessing only the parts that are of interest to a particular application.

    Bigtable (and Hbase) provide a means for organizing and efficiently accessing these large data sets.

    Hbase is still not ready for production, but it's a glimpse into the power that will soon be available to your average website builder.

    Google is of course still way ahead of the game. They have huge core competencies in data center roll out and they will continually improve their stack.

    It will be interesting to see how these sorts of tools along with Software as a Service can be leveraged to create the next generation of systems.

    Thursday
    Jul022009

    Hypertable is a New BigTable Clone that Runs on HDFS or KFS

    Update 3: Presentation from the NoSQL conference: slides, video 1, video 2.

    Update 2: The folks at Hypertable would like you to know that Hypertable is now officially sponsored by Baidu, China’s Leading Search Engine. As a sponsor of Hypertable, Baidu has committed an industrious team of engineers, numerous servers, and support resources to improve the quality and development of the open source technology.

    Update: InfoQ interview on Hypertable Lead Discusses Hadoop and Distributed Databases. Hypertable differs from HBase in that it is a higher performance implementation of Bigtable.

    Skrentablog gives the heads up on Hypertable, Zvents' open-source BigTable clone. It's written in C++ and can run on top of either HDFS or KFS. Performance looks encouraging at 28M rows of data inserted at a per-node write rate of 7mb/sec.

    Thursday
    Jul022009

    Product: Facebook's Cassandra - A Massive Distributed Store

    Update 2: Presentation from the NoSQL conference: slides, video.
    Update: Why you won't be building your killer app on a distributed hash table by Jonathan Ellis. Why I think Cassandra is the most promising of the open-source distributed databases --you get a relatively rich data model and a distribution model that supports efficient range queries. These are not things that can be grafted on top of a simpler DHT foundation, so Cassandra will be useful for a wider variety of applications.

    James Hamilton has published a thorough summary of Facebook's Cassandra, another scalable key-value store for your perusal. It's open source and is described as a "BigTable data model running on a Dynamo-like infrastructure." Cassandra is used in Facebook as an email search system containing 25TB and over 100m mailboxes.

  • Google Code for Cassandra - A Structured Storage System on a P2P Network
  • SIGMOD 2008 Presentation.
  • Video Presentation at Facebook
  • Facebook Engineering Blog for Cassandra
  • Anti-RDBMS: A list of distributed key-value stores
  • Facebook Cassandra Architecture and Design by James Hamilton
  • Thursday
    Jul022009

    Product: Project Voldemort - A Distributed Database

    Update: Presentation from the NoSQL conference: slides, video 1, video 2.

    Project Voldemort is an open source implementation of the basic parts of Dynamo (Amazon’s Highly Available Key-value Store) distributed key-value storage system. LinkedIn is using it in their production environment for "certain high-scalability storage problems where simple functional partitioning is not sufficient."

    From their website:

  • Data is automatically replicated over multiple servers.
  • Data is automatically partitioned so each server contains only a subset of the total data
  • Server failure is handled transparently
  • Pluggable serialization is supported to allow rich keys and values including lists and tuples with named fields, as well as to integrate with common serialization frameworks like Protocol Buffers, Thrift, and Java Serialization
  • Data items are versioned to maximize data integrity in failure scenarios without compromising availability of the system
  • Each node is independent of other nodes with no central point of failure or coordination
  • Good single node performance: you can expect 10-20k operations per second depending on the machines, the network, and the replication factor
  • Support for pluggable data placement strategies to support things like distribution across data centers that are geographical far apart.

    They also have a nice design page going over some of their architectural choices: key-value store only, no complex queries or joins; consistent hashing is used to assign data to nodes; JSON is used for schema definition; versioning and read-repair for distributed consistency; a strict layered architecture with put, get, and delete as the interface between layers.

    Just a hint when naming a project: don't name it after one of the most popular key words in muggledom. The only way someone will find your genius via search is with a dark spell. As I am a Good Witch I couldn't find much on Voldemort in the real world. But the idea is great and is very much in line with current thinking on scalable database design. Worth a look.

    Related Articles

  • The CouchDB Project
  • Tuesday
    Jun302009

    Hot New Trend: Linking Clouds Through Cheap IP VPNs Instead of Private Lines 

    You might think major Internet companies have a latency, availability, and bandwidth advantage because they can afford expensive dedicated point-to-point private line networks between their data centers. And you would be right. It's a great advantage. Or it at least it was a great advantage. Cost is the great equalizer and companies are now scrambling for ways to cut costs. Many of the most recognizable Internet companies are moving to IP VPNs (Virtual Private Networks) as a much cheaper alternative to private lines. This is a strategy you can effectively use too.

    This trend has historical precedent in the data center. In the same way leading edge companies moved early to virtualize their data centers, leading edge companies are now virtualizing their networks using IP VPNs to build inexpensive private networks over a shared public network. In kindergarten we learned sharing was polite, it turns out sharing can also save a lot of money in both the data center and on the network.

    The line of reasoning for adopting IP VPNs goes something like this:

  • Major companies are saving 1/4 to 1/2 of their networking costs by moving from private lines to IP VPNs. This does not even include the benefits of lower equipment costs (GigE ports are basically free) and more flexible provisioning (any-to-any connectivity, easy bandwidth dialup).
  • Cheaper comes with a cost. Private lines are reliable. The Internet is inherently unreliable, especially when two endpoints are linked by potentially dozens of routers in between. In particular Internet connections suffer from: 1) dropped packets 2) out of order packets. Statistically this may happen for only 1% of packets, but when it does the user experience plummets. To get a feel for the impact imagine you have a 200ms latency link to Europe and you're trying to do something interactive. Lose a packet and you'll have to wait for a retransmission which will take at least 1 second. So IP VPNs can provide an order of magnitude more bandwidth for less money, but they often have less actual throughput and reliability.
  • Since latency and quality are so important to Internet companies, how can they possibly afford to use IP VPNs? They cheat. They fix the IP connection by using WAN accelerators.
  • WAN accelerators are typically thought to be mostly about caching, but they can also can trick TCP into giving a better connection even over unreliable networks. It's like wearing corrective lenses for your network. And that's what you need when dropping dedicated lines for Internet connections.
  • Relatively inexpensive WAN accelerators can turn somewhat unreliable Internet connections into a very reliable cost effective connection option. Your customers won't believe it's not butter.
  • The result: lots of money saved and a quality costumer experience.

    We take TCP for granted so to learn it has this unsightly packet loss/delay problem is a bit unsettling. But here's the impact packet loss has on throughput:
  • Latency: 100ms, Loss: 1%, Throughput: 1.2 Mbps
  • Latency: 200ms, Loss: 1%, Throughput: .6 Mbps
  • Latency: 100ms, Loss: .5%, Throughput: 1.7 Mbps

    These numbers are independent of your WAN link capacity. You could have an 100Mbps link with 1% loss and 100ms latency and you're limited to 1Mbps!

    The reason why we have this bandwidth robbing state of affairs is because when TCP was designed packet loss meant network congestion. The way to deal with congestion is to stop sending data in order to avoid congestion. This drops throughput drastically for a very long time. Over long distance WAN connections packets can be delayed which seems like a packet loss which causes congestion avoidance measures to kick in. Or maybe only a single packet was dropped and that kicks in congestion avoidance.

    The trick is convincing TCP that everything is cool so the full connection bandwidth can be used. WAN accelerators have a number of complex features to keep TCP happy. Damon Ennis, VP Product Management and Customer Support for Silver Peak, a WAN accelerator company, talks about why clouds, IP VPNs, and WAN accelerators are a perfect match:
    Moving applications into the cloud offers substantial cost savings for enterprises. Unfortunately those savings come at the cost of application performance. Often performance is so hampered that users’ productivity is severely limited. In extreme cases, users refuse to utilize the cloud-based application altogether and resort to old habits like saving files locally without centralized backup or returning to their old “thick” applications.

    The cloud limits performance because the applications must be accessed over the WAN. WANs are different from LANs in three ways – WAN bandwidth is a fraction of LAN bandwidth, WAN latency is orders of magnitude higher than LAN latency, and packet loss exists on the WAN where none existed on the LAN. Most IT professionals are familiar with the impacts of bandwidth on transfer times – a 100MB file takes approximately 1 second to transfer on a Gbps LAN and approximately 10 seconds to transfer on a 100Mbps LAN. They then extrapolate this thinking to the WAN and assume that it will take 10 seconds to transfer the same file on a 100Mbps WAN. Unfortunately, this isn’t the case. Introduce 100ms of latency and this transfer now takes almost 3 minutes. Introduce just 1 % packet loss and this transfer now takes over 10 minutes.

    There’s a calculator available that will let you figure out the effective throughput of your own WAN if you know its bandwidth, latency, and loss. Once you know your effective throughput simply divide 800Mb (100MB) by your effective throughput to determine how long it would take to transfer the same example file over your WAN.

    Latency and loss don’t just impact file transfer times, they also have a dramatic impact on any applications that need to be accessed in real-time over the WAN. In this context a real-time application is one that requires real-time response to users’ keystrokes – think of any application that is served over a thin-client infrastructure or Virtual Desktop Infrastructure (VDI). Not only is the server 100 ms away but any lost packet will result in delays of up to half a second waiting for the loss to be detected and the retransmission to occur. This is the root cause of the frustrated user banging on the enter key looking for a response.
    This all seems like a lot effort, doesn't it? Why not just dump TCP and move to a better protocol? Sounds good but everything works on TCP so to change now would be monumental. And as strange as it seems TCP is doing it's job. It's a protocol so there's a separation of what's above it from what's below it which allows innovation at the TCP level without breaking anything. And that's what layering is all about.

    The upshot is with a little planning you can take advantage of much cheaper IP VPN costs, improve latency, and maximize bandwidth usage. Just like the big guys.

    Related Articles

  • Cloud Computing Requires Infrastructure 2.0 by Gregory Ness
  • Myth of Bandwidth and Application Performance by Ameet Dhillon
  • How Does WAN Optimization Work? by Paul Rubens
  • SilverPeak Technology Overview
  • Monday
    Jun292009

    HighScalability Rated #3 Blog for Developers

    Hey we're moving up in the world, jumping from 19th place to 3rd place. In case you aren't sure what I'm talking about, Jurgen Appelo goes through this massive effort of ranking blogs according to Google PageRank, Technorati Authority, Alexa Rank, Google links, and Twitter Grader Rank.

    Through some obviously mistaken calculations HighScalability comes out #3. Given all the superb competition I'm not exactly sure how that can be. Well, thanks for all the excellent people who contribute and all the even more excellent people that read. Now at least I have something worthy to put on my tombstone :-)

    Monday
    Jun292009

    How to Succeed at Capacity Planning Without Really Trying : An Interview with Flickr's John Allspaw on His New Book


    Update 2: Velocity 09: John Allspaw, 10+ Deploys Per Day: Dev and Ops Cooperation at Flickr. Insightful talk. Some highlights: Change is good if you can build tools and culture to lower the risk of change. Operations and developers need to become of one mind and respect each other. An automated infrastructure is the one tool you need most. Common source control. One step build. One step deploy. Don't be a pussy, deploy. Always ship trunk. Feature flags - don't branch code, make features runtime configurable in code. Dark launch - release data paths early without UI component. Shared metrics. Adaptive feedback to prioritize important features. IRC for communication for human context. Best solutions occur when dev and op work together and trust each other. Trust is earned by helping each other solve their problems. Look at what new features imply for operations, what can go wrong, and how to recover. Provide knobs and levers to help operations. Devs should have access to production machines. Fire drills to train. No finger pointing - fix stuff first. Design like you'll get woken up first when there's a problem. Say you're sorry. Not easy - like any relationship.
    Update: Operational Efficiency Hacks Web20 Expo2009 by John Allspaw. 131 picture perfect slides on operations porn. If you're interested in that kind of thing.

    Dream with me a little bit. Your startup becomes wildly successful. Hard work and random chance have smiled on you. To keep flirting with lady luck your system must scale. But how much stuff (space, hardware, software, etc) will you need to handle the growth, when will you need it and when will you need more?

    That's what Flickr's John Allspaw helps you figure out in his ground breaking new book on capacity planning: The Art of Capacity Planning: Scaling Web Resources.

    When I read statements about The Art of Capacity Planning like capacity planning is a term that to me means paying attention, All the information you need to make an educated forecast is in your historical metrics, and startups that are going to experience massive growth simply don't have time for anything but a 'steering by your wake' approach, I get the same sea change feeling I felt when the industry ran from waterfall design and embraced agile design. Current capacity planning is heavy. All up-front. Too analytical and too divorced from real life.

    Other capacity planning books assault you with models, math, and simulations. Who has the time? John has developed a common sense, low math approach to capacity planning that works using the system you already have. John's goal is to have you say: Oh, right, duh. That's common sense, not voodoo.

    Here's my email interview with John Allspaw on The Art of Capacity Planning. Enjoy.

    Please tell us who you are and what you've brought to show and tell today?

    I'm John Allspaw. I manage the Operations team at Flickr.com, and I've written a book (The Art of Capacity Planning: Scaling Web Resources) about capacity planning for growing websites.

    After spending a good chunk of your life writing this book, can you summarize it in just a few sentences so people will know why it should matter to them?


    This book is basically a guide to adaptive capacity planning for growing websites. It's an approach that relies much less on benchmarking and simulation, than on the close observation of production loads to guide future decisions. It's not rocket science, and I'm hoping people can use it to justify the what, why, and when of getting more resources to allow them to grow as fast as they need to. It's worked really well for me at Flickr and other organizations.

    Give me your ripple of evil. What happens without capacity planning?

    Capacity planning is a term that to me means paying attention. Web applications can fail in all sorts of dramatic ways, and you're not going to foresee all of them. What you can do, however, is make use of what you do know about what happens in your real world on a regular basis. Things like: my database can do X queries per second before it keels over. Or my cache can only keep Y minutes worth of changing objects. You're not going to predict every failure mode of the whole system, but knowing the failure modes of individual pieces should be considered mandatory. Armed with that, you can make decent forecasts about the future.

    I'm a guy or gal at a startup who's freaking out because my boss asked me how much hardware we need for the next quarter/year? What do I do now?

    Buy my book? ☺ All the information you need to make an educated forecast is in your historical metrics. You do have system and application-level statistics, right? Tying your system level stats (CPU, memory, network, storage, etc.) to application-level metrics (users, posts, photos, videos, page views, widgets sold, etc.) is key, because then you have history to back up the guesstimates. Your business, product or marketing team also has their own guesses for some of those application-level metrics, so you should get the two forecasts together and see where/how they match up or differ. Capacity should enable the business, not hinder it.

    So your book, is it the best book on Capacity Planning or the greatest book on Capacity Planning?

    My book is the best book on capacity planning. If it were the 'greatest' book, it'd be a lot bigger than it is. It's good for scaling your hardware. It's not good for flattening out posters.

    My site is gonna explode and I don't know at what point it's going to die. What do I do now?

    Argh! Panic! Handle the current explosion, and make it priority one to find out the limits of your capacity when the emergency is over.

    Try to panic gracefully. If it's dying right now, and you can't easily add any more capacity, then try some of the tried and trusted WebOps 101 tricks mentioned everywhere, including this blog:
    - disable features (preferably the heavier load-causing ones)
    - cache previously dynamic content into static bits
    - avoid loaded backend calls by serving stale content
    Of course it's easier to do those things when you have easy config flags to turn things on or off, and a list to run through of what things are acceptable to serve stale and static. We currently have about 195 'features' we can turn off at Flickr in dire circumstances. And we've used those flags when we needed to.

    Having said all of that, knowing when your resources are going to die should be mandatory, not optional. Know how many qps your databases, webservers, caching systems, and storage can handle before degradation. Test that stuff. With production traffic. If at all possible, with live production traffic, not just recorded and replayed production loads.

    How do you compare your approach with well known approaches like Guerrilla Capacity Planning? Isn't focusing on queuing theory, Little's Law, and Possion arrival rates misguided? What will really help people on-the ground?

    My approach is a bit different from Mr. Gunther's, although his book was an inspiration for mine. I do think that having a general understanding of queuing theory and the mathematics of open and closed systems can be important, don't get me wrong. But many of the startups that are going to experience massive growth simply don't have time for anything but a 'steering by your wake' approach, and done right, I think that approach will serve them well. I think some people recognize this, but in my experience, I'd say the development timelines are even tighter than most people realize.

    Almost any time and effort in constructing and running a simulation, model, or benchmark that involves the main moving parts of a web site's back-end is pretty much wasted due to how quickly application logic, use cases, and even hardware configurations can change. For example, by the time I could construct a useful model to capture the webserver-database interactions for Flickr, the results won't resemble production load, since the development cycle we have is so tight. As I write this, in the last week there were 50 code deploys of 550 changes by 19 Flickr staff. I realize that's a lot more than a lot of organizations, but that rate of change requires that the capacity planning process is easily adjustable.

    I also found the existing books on the topic to be pretty dry with the math. The foundations of queuing theory are interesting, but proofs of Little's Law isn't going to quickly justify to my finance guy that we need 11 more webservers.

    I have a cloud, do I really need to capacity plan anymore? That's so old school. Can't I decrease my administrator-to-server ratio and reduce the cost of managing systems by removing capacity planning resources?

    Nope, just trust The Cloud™ that everything will be all right. No need to pay any attention at all. Oh wait, that's a magic unicorn talking. Cloud services are a resource just like any in-house capacity: they have limits that should be paid attention to. The best use cases of cloud computing both recognize the benefits (shrinking 'procurement' times, for example) and the limitations. And those limitations are going to vary from application to application, organization to organization.

    You can build a real brand around a name like "Guerrilla Capacity Planning." Don't you think you could have come up with a better name for your approach?

    Yeah, I guess I should have a name for it. How about "Paying Attention Capacity Planning"?

    You recommend testing the limits of your hardware and software on a production site. Are you crazy?

    It's very possible that I'm crazy. But, using production traffic to define your resources ceilings in a controlled setting allows you to see firsthand what would happen when you run out of capacity in a particular resource. Of course I'm not suggesting that you run your site into the ground, but better to know what your real (not simulated) loads are while you're watching, than find out the hard way. In addition, a lot of unexpected systemic things can happen when load increases in a particular cluster or resource, and playing "find the butterfly effect" is a worthwhile exercise.

    Where does capacity planning sit in the stack? It seems to impact the entire application stack, but capacity planning is more of an operations thing and that may not have much impact in engineering.

    Capacity planning is a process, and should sit in the stack along with the other things that happen on a regular basis, like bug scrubs, like weekly meetings, etc. I'm spoiled as an Ops manager because we've got developers at Flickr who very much think like operations people. We're all addicted to graphs, and we're all pretty aware of how close each piece of the infrastructure is to becoming too 'hot', and we act accordingly. Planning, procurement, and measurement of capacity might lie on operations' shoulders, but intelligent use of it is everyone's business. I'm also blessed to have product management and customer care teams who are also aware of the state of capacity. As projects pop up that warrant capacity to be part of the considerations, it is. You simply can't launch things like FlickrVideo and the Yahoo Photos migration without capacity being part of the requirements, so we've got a good feedback loop going with respect to operations and capacity.

    Studies say 4/5 of people like a trip to the dentist more than they like capacity planning. Why does capacity planning have such a bad rep?

    Because no one wants to guess and be wrong? Because most of the cap planning literature out there are filled with boring math?

    Does the move to Web 2.0 make traditional approaches to capacity planning more difficult?

    I would say yes, but of course I'm biased. In many ways, use of the site is simply out of our hands. We gather metrics on as many aspects of the site as we can get our hands on, and add more all the time. Having an open API makes things interesting, because the use cases can vary wildly. Out of nowhere, a simple application can reveal an edge case that we hadn't foreseen, so we might have to adjust forecasts quickly. As to it being more difficult: I don't know. When I'm wrong, people can't see photos. If the guy at wellsfargo.com is wrong, then people can't get their money. What's more annoying? :)

    I really like your dashboard. Most dashboards say what happened, yours has an estimated days left before capacity runs out, which makes it actionable. How accurate have those numbers been?

    Well that particular dashboard is still being built here, but the general design is to capture those metrics on a regular basis, and to use it as a guide. The numbers for some clusters have been pretty trustworthy. But they do fluctuate in how accurate they are, since "days left" are obviously affected by things outside of the forecasting process. Sometimes, a new feature is planned that requires pounding the database shards, or bumps CPU on the webservers by a noticeable amount. It's because of these things that the dashboard is only a piece of the puzzle. Awareness and communications with product management and development have to inform planning decisions, as well as the historic metrics. Again: no crystal balls here.

    One of the equations you use is a "UR LIMITZ = Ceiling * Factor of Safety". Theo Schlossnagle has noted the evolution of phenomenal spikes in traffic, even for large sites. How do you have enough capacity as a safety factor if traffic is so spikey? What are the implications for forecasting?

    Start with the assumption that no one can accurately predict the future. Capacity planning isn't the only part of successful web operations, and not everything is a capacity issue. Theo nails what happens in the real-world, for sure. The answer is: you estimate the best you can with the history that you have. Obviously, one mistake is to make forecasts ignoring the spikes you've experienced in the past. Something important to consider is how expected (or not) your spikes are. If you sell things, you might have a seasonal holiday spike or plateau in traffic. If you're a content site that frequently gains traction with news-related sites (like Theo's example) and from time to time, you get massive unexpected spikes, then your factor of safety should include considerations for those spikes. But of course the flipside of that might mean that you have a boatload of servers doing nothing except wasting money while waiting for massive spikes that may or may not come. So it's a balance. Be reasonable with planning, follow the four guidelines that Theo points out in his post, and you've got yourself a strategy to deal with those unexpected spikes.

    At what point does it make sense for me to buy my own equipment?

    It's a difficult question. I've seen groups go from self-run colocation to managed hosting and cloud services, and others go the opposite way, for all legit reasons. Just as there are limitations with managed hosting, those limitations might not matter until you're massive. I do believe that there is a point at which owning your own equipment makes a lot of sense, because you've blown past the average needs of a managed hosting or cloud customer. Your TCO might be a lot different than mine, so to state that there's a single point for everyone would just be dumb.

    Quick fire round. What's your quick reaction to:

    1. Let's just go with a working prototype for now. We can change it when we grow big.

    Fine, for some definitions of "working", "change", and "big". I can't complain too much about that approach in some sense because that's how a lot of Flickr's backend evolved. But on the other hand, all you need is to fail a couple of times with prototypes to figure out what sort of homework needs to be done before launching something that is potentially explosive. So again, there's a balance. Don't be lazy, but don't be rigid and too fearful.

    2. Our VC told us that we're worrying about scalability too early. They doesn't want us to blow our scarce resources on preparing for success.

    Your VC is smart. If they're really smart, they'll also suggest worrying about scalability before it's too late. This answer isn't a cop-out, it's just reality. Realize that being scalable means being able to easily add capacity wherever you need it, whenever you need it. Buying too much equipment too soon is just as insane as hiring people that sit around and do nothing. The trick is knowing what "too much" and "too soon" is, and it's going to differ from company to company, from product to product.

    3. Premature optimization is the root of all evil.

    I've not made it a secret that I think this quote has given many an engineer (both dev and ops) reasoning to make dumb decisions. I actually agree with the idea behind the quote, but I also think that it's another one of those balancing acts. Knuth (or Hoare) also said "We should forget about small efficiencies, say about 97% of the time…" Another good question might be: which 3% are you going to pay attention to?

    4. We plan to throw hardware at the problem.

    Ok. When does that hardware need to come, how much of it, and what will you do when you realize more hardware won't fix a particular problem? The now classic limitations of the single-write-master, many-read-slaves database architecture is a great example of this. Experienced devs and ops people consider the possibility that hardware can't solve all problems.

    Note: these questions were adapted from How Important Is Being Scalable?.

    Now that you have some free time again, what are you going to do with your life?

    Eat. Sleep. Pay attention to my family. ☺

    Related Articles

  • Capacity Management for Web Operations by John Allspaw

  • Page 1 ... 2 3 4 5 6 ... 38 Next 10 Entries »