High Scalability -

Entries in problem (4)

Wednesday

Jun012011

Why is your network so slow? Your switch should tell you.

Wednesday, June 1, 2011 at 8:52AM

Who hasn't cursed their network for being slow while waiting for that annoying little hour glass of pain to release all its grains of sand? But what's really going on? Is your network really slow? PacketPushers Show 45 – Arista – EOS Network Software Architecture has a good explanation of what may be really at fault (paraphrased):

Click to read more ...

HighScalability Team |

2 Comments |

Permalink |

Print Article

Email Article

Strategy,

problem

Wednesday

Mar112009

The Implications of Punctuated Scalabilium for Website Architecture

Wednesday, March 11, 2009 at 5:29AM

Update: How do you design and handle peak load on the Cloud? by Cloudiquity. Gives a formula to try and predict and plan for peak load and talks about how GigaSpaces XAP, Scalr, RightScale and FreedomOSS can be used to handle peak load within EC2. Theo Schlossnagle, with his usual insight, talks about in Dissecting today's surges how the nature of internet traffic has evolved over time. Traffic now spikes like a heart attack, larger and more quickly than ever from traffic inflow sources like Digg and The New York Times. Theo relates how At least eight times in the past month, we've experienced from 100% to 1000% sudden increases in traffic across many of our clients and those spike can happen as quickly as 60 seconds. To me this sounds a lot like Punctuated equilibrium in evolution, a force that accounts for much creative growth in species... VMs don't spin up in less than 60 seconds so your ability to respond to such massive quick spikes is limited. This assumes of course that you've created an architecture that can automatically scale by adding VMs. Such elastic demand is usually met with a reservoir. You have more VMs in reserve to soak up temporary spikes. But who would do this in reality? Money would be going to non productive VMs, so you are likely to already have put those VMs into production. Interestingly, Theo ties handling sudden unexpected spikes back to performance. We are always told performance and scalability are separate issues. And while I accept this notionally, in my heart of hearts I think they have more in common than not and I think Theo nails why. A well performing system acts as a kind of reservoir for handling spikes before you can ever notice there's a spike. That gives you some time to add more resources to your site if a spike continues. With that reservoir you are just crushed. Theo gives four rules for for handling spikes: Be alert, Be prepared, Perform triage, and Be calm. Please see his site for more discussion of these rules. A few things that might help:

Create fast booting VMs. It's easy to create VMs that boot glacially (intentional irony). The more you leave to run-time like software downloads and configuration, the slower your VMs boot and the slower you can react to spikes.

Cloud vendors offer a service to maintain an image cache. It would be useful if a service was offered that could guaranteed faster provisioning of VMs and quicker download of images.

Would an in-cloud service to offer stem cell VMs make sense? This is a VM that could quickly become any one of a number of different images on demand. So a service could keep a reservoir of stem cell VMs up and running, shared by a number of customers, and an application could request the low latency spin up of one of the reserved VMs. The idea that internet traffic patterns have evolved such that even our cloud architectures can't easily cope is an interesting one. I find it ironic that many of the techniques needed to build real-time systems are helpful to handle this new world too when at first glance the problems look nothing alike. Sometimes piling on more resources isn't enough, efficiency matters too.

Click to read more ...

Todd Hoff |

5 Comments |

Permalink |

Print Article

Email Article

problem

Monday

Jun092008

FaceStat's Rousing Tale of Scaling Woe and Wisdom Won

Monday, June 9, 2008 at 12:58AM

Lukas Biewald shares a fascinating slam by slam recount of how his FaceStat (upload your picture and be judged by the masses) site was battered by a link on Yahoo's main page that caused an almost instantaneous 650,000 page view jump on their site. Yahoo spends considerable effort making sure its own properties can handle the truly massive flow from the main page. Turning the Great Eye of the Internet towards an unsuspecting newborn site must be quite the diaper ready experience. Theo Schlossnagle eerily prophesized about such events in The Implications of Punctuated Scalabilium for Website Architecture: massive, unexpected and sudden traffic spikes will become more common as a fickle internet seeks ever for new entertainments (my summary). Exactly FaceStat's situation. This is also one of our first exposures to an application written on Merb, a popular Ruby on Rails competitor. For those who think Ruby is the problem, their architecture now serves 100 times the original load. How did our fine FaceStat fellowship fair against Yahoo’s onslaught? Not a lot of details of FaceStat’s architecture are available, so it’s not that kind of post. What interested me is that it’s a timely example of Theo’s traffic spike phenomena and I was also taken with how well the team handled the challenge. Few could do better so quickly. In fact, let’s apply Theo’s rubric for how to handle these situations to FaceStat:

Be Alert: build automated systems to detect and pinpoint the cause of these issues quickly (in less than 60 seconds). None initially, but they are building in more monitoring to better handle future situations. Better monitoring would have alerted them to the problems long before they actually were alerted. Perhaps many more potential customers might have been converted to actual customers. You can never have enough monitoring!

Be Prepared: understand the bottlenecks of your service systemically. As the system was relatively simple, new, and quickly changed, my assumption is they were fully aware of their system’s shortcomings, they were just busy with adding features rather than worrying about performance and scalability.

Perform Triage: understand the importance of the various services that make up your site. Definitely. They “started ripping out every database intensive feature” in response to the load.

Be Calm: any action that is not analytically driven is a waste of time and energy. They stayed amazingly calm as can be seen from the following quote: “It’s one thing to code scalably and grow slowly under increasing load, but it’s been a blast to crazily rearchitect a live site like FaceStat in a day or two.” I’m not sure how analytically driven they were however  All-in-all an impressive response to the Great Eye’s undivided attention. But not everyone was impressed as I. A commenter named Bernard said: Sorry, but this is a really dumb story. Given how dirt cheap things like slicehost and linode are, it is crazy that you launched a web app and had not already prepared a redundant, highly-scalable architecture… I’d say you were damn lucky that the disappointed users came back at all. Commenter Will thought it was a “Nice problem to be having!” Which it is, of course, being noticed is better than being ignored. But Lukas was spot on when he lamented about being noticed too soon has a downside: After working so hard to get users to come to your site, it’s amazingly frustrating to see hundreds of thousands of people suddenly locked out. Clearly we still don’t have the ability for developers to create scalable systems as simply as they create exploratory systems. Ed from Rackspace posted that they could help with their Auto Scale of Arrays feature. And Rackspace would be an excellent solution, but the cost would be $500/month and a $2500 setup fee. No “let’s put on a show” startup can afford those costs. The mode FaceStat was in is typical: We find that a Rails-like platform is invaluable for rapidly prototyping a new site, especially since we started FaceStat as a pure experiment with no idea whether people would like it or not, and with a very different feature set in mind compared to what it later became. A pay as you grow model is essential for scalability because that’s the only way you can bake scalability in from the start. And even with all the impressive advances in the industry we still don’t have the software infrastructure to make scaling second nature.

Information Sources

Scaling Fast by Lukas Biewald

FaceStat scales! on Dlores BLog

Platform

Merb. Ruby based MVC framework that is ORM-agnostic.

Thin. A fast and very simple Ruby web server.

Slicehost. Hosting service. Able to quickly provision servers as needed.

Amazon’s S3. Image server. Latency is high but it handles the load.

Capistrano. Automated deployment.

Git with github. Source code control system. Supports efficient simultaneous development, quick merging and deployment.

God. Server monitoring and management.

Memcached. Application caching layer.

PostgreSQL

The Stats

Six app servers.

One big database machine.

The Architecture

FaceStat is a write heavy application and performs involved calculations on data.

S3 is used to offload the responsibility for storing images. This freed them from the massive bandwidth requirements and complexity of managing their own images.

Memcached offloads reads from the database to allow the database to have more time for writes.

Lessons Learned

Monitor the site. The sooner you know about a problem the faster it can be fixed. Don't rely on user email or email from exception handlers or you'll never get ahead of problems.

Communicate with your users with an error page. A meaningful error pages shows you care and that you are working on the problem. That's enough for a second chance with most people.

Use a cached statically generated homepage. Hard to beat that for performance.

Big sites might want to give a heads up when they mention smaller sites. Just a short polite email saying how your world will soon turn upside down would do.

High-level platform really doesn’t matter compared to overall architecture. How you handle writes, reads, caching, deployment, monitoring, etc are relatively framework independent and it's how you solve those problems that matter.

Ruby and Merb supported rapid prototyping to experiment and create a radically different system form the one they intended.

Click to read more ...

Todd Hoff |

2 Comments |

Permalink |

Print Article

Email Article

Capistrano,

Example,

Git,

God,

Memcached,

Merb,

S3,

postgresql,

problem,

ruby,

slicehost

Tuesday

May272008

Should Twitter be an All-You-Can-Eat Buffet or a Vending Machine?

Tuesday, May 27, 2008 at 7:03AM

Om proposes one solution to the Twitter Problem is to limit followers to three square meals a day. The reasonable idea being that lower limits should mean fewer scaling problems. And as a kicker raising those limits is a good way to raise much needed revenue. Scoble thinks users should consume without limit and will drive to another buffet if all-you-can-eat privileges are revoked. The reasonable idea being that if an internet service can't solve internet scale problems then there's not much use for it. Dave says comp power users a top floor suite and shower them with free passes to the buffet. Let the good times roll! The reasonable idea being that power users help create popular restaurants, er, services in the first place and limiting them starves users and starved users won't come back. So, should web services like Twitter be a buffet, a fixed eight course fine dining experience, a small plate restaurant, a family style joint, or a vending machine? Or something else entirely? In a distant barely remembered past I actually worked at an all-you-can-eat buffet. The food was very good and most customers didn't over over indulge. If they did the place wouldn't stay in business long. But some customers did. They were called stackers. Stackers were so named because a large stack of plates would pile up on their table throughout the meal. Stackers followed a power law distribution. Few customers at any one time were stackers, but their effect could be devastating. How devastating depended on their favorite foods... A stacker who loved potato salad was manageable. We had plenty of potato salad and it was cheap and quick to make. No problem. Stacking itself was not frowned upon and never discouraged. It's an all-you-can-eat buffet after all! But if a stacker's favorite food was roast beef, that was trouble. Not only is roast beef expensive, it comes in a limited supply because it has to be prepared ahead of time. Once you ran out there was no more roast beef for the rest of the night. Good roast beef takes hours to prepare, it must be planned for. Management's job was to carefully balance projected demand against waste. The goal was to prepare enough meat to meet demand, yet not have a lot of left-overs. Stackers blow apart the finely balanced calculation of how much roast beef to make and the carving station is left trying to push the ham while apologizing for an embarrassing lack of roast beef. An ugly ugly scene. As a carver you are armed with a long scary looking knife and you are shielded by Medieval chain-mail looking glove, but hungry customers are mean and fast. You never see it coming. Unfortunately the distribution of stackers on any given night is unpredictable. You can't always cook a maximum amount of meat or you'll go broke. And if you make too little everyone is unhappy. It needs to be just right. As a person with serious stacker tendencies I try to remember the cost of things and keep a reasonable balance. The only way to make Goldilocks happy and have just the right balance is to place limits. Eventually the restaurant had to limit the number of trips to the roast beef station to three a meal. Enough that you get value for your dollar, but not so much that the restaurant goes under. Everyone happy? Of course not. The world doesn't work like that. It's all-you-can-eat some would say so I should be able to eat all I can eat ! But there are always limits. Would it be fair to back a truck up to the restaurant and start loading up because that's part of your meal? No. Is it fair to stuff your backpack with food on the way out? No. So there are always limits. The question is what are fair limits? It has been said FriendFeed has no problems handling 10,000 friends so neither should Twitter. Now, let's imagine if I spun up 1000 EC2 servers whose only task was to add more friends to feed. Would FriendFeed limit me then? Of course. It's basic web site self-defense, a right guaranteed under the constitution and long recognized by the courts in certain situations. But still, what are fair limits? How much roast beef should you be able to eat? Limit setting is a strategy we've talked about many times as a way of protecting sites from complete devastation. My favorite example is Mailinator whose prime directive is surviving attacks and they've deployed many clever practices in their own defense. And most every large web site on earth is busy watching your every move so they can bounce you at the first sign of DDOS Armageddon. Limits aren't inherently bad. But limits don't make you scale, they simply stop you from unscaling. An adequate scalable infrastructure must still be put in place. In the end I agree with Scoble in that the power of the internet is having interesting conversations with interesting people about interesting topics. For interesting conversations to happen you must be able to freely create relationships. If you or they have to pay for relationships they simply won't form. Would Google's Page Rank algorithm work so well if it could only analyze paid relationships? A web formed under a paid relationship model would look totally different and be decidedly less valuable. Similarly, a social network that can't grow naturally through preferential attachment would have much less value. Scaling relationships is a core social network competency. Relationships should be subject to DDOS type limits, but not limits artificially out of proportion with a user's internet audience. I doubt Twitter would disagree, but they are going through a tough time right now. I also agree with Om. The Freemium model is a great idea and linking that to site protecting prophylactics is even better. But limiting a core competency may not be the right target. Fotolog is an example of a service that puts Freemium ideas to good use. They charge extra for adding more photos a day, more comments a day, custom profile abilities, and social status add ons. What is the equivalent in Twitter? I don't know, but I would try to treat relationships more like potato salad than roast beef. And I also agree with Dave. It's hard to get noticed on the web. Those who help you storm the attention barrier shouldn't be punished. They should be rewarded with a tasty appropriately sized meal.

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

Print Article

Email Article

Strategy,

limit,

problem