Skype Failed the Boot Scalability Test: Is P2P fundamentally flawed?

Skype's 220 millions users lost service for a stunning two days. The primary cause for Skype's nightmare (can you imagine the beeper storm that went off?) was a massive global roll-out of a Window's patch triggering the simultaneous reboot of millions of machines across the globe. The secondary cause was a bug in Skype's software that prevented "self-healing" in the face of such attacks. The flood of log-in requests and a lack of "peer-to-peer resources" melted their system.
Who's fault is it? Is Skype to blame? Is Microsoft to blame? Or is the peer-to-peer model itself fundamentally flawed in some way?
Let's be real, how could Skype possibly test booting 220 million servers over a random configuration of resources? Answer: they can't. Yes, it's Skype's responsibility, but they are in a bit of a pickle on this one.
The boot scenario is one of the most basic and one of the most difficult scalability scenarios to plan for and test. You can't simulate the viciousness of real-life conditions in a lab because only real-life has the variety of configurations and the massive resources needed to simulate itself. It's like simulating the universe. How do you simulate the universe if the computational matrix you need is the universe itself? You can't. You end up building smaller models and those models sometimes fail.
I worked at set-top company for a while and our big boot scenario was the restart of entire neighbor hoods after a power failure. To make an easy upgrade path, each set-top downloaded their image from the head-end on boot, only a boot image was in EEPROM.
This is a very stressful scenario for the system. How do you test it? How do you test thousands of booting set-tops when they don't even exist yet? How do you test the network characteristics of a cable system in the lab? How do you design a system not to croak under the load?
Cleverness. One part of the solution was really cool. The boot images were continually broadcast over the network so each set-top would pick up blocks of the boot image. The image would be stitched together from blocks rather than having thousands of boxes individually download images, which would never work. This massively reduced the traffic over the network. Clever tricks like this can get you a long ways.
Work. Great pools of workstations were used simulate set-tops and software was made to insert drops and simulate asymmetric network communications. But how could we ever simulate 220 million different users? Then, no way. Maybe now you could use grid services like Amazon's EC2.
Help from your friends. Microsoft is not being a good neighbor. They should roll out updates at a much more gradual rate so these problems don't happen. Booting loads networks, taxes CPUs, fills queues, drops connections, stresses services, increases process switching, drops packets, encourages dead lock, steals RAM and file descriptors and other resources. So it would be nice if MS was smarter about their updates. But since you can't rely on such consideration, you always have to handle the load.
I assume they used exponential backoff algorithms to limit login attempts, but with so many people this probably didn't matter. Perhaps they could insert a random wait to smooth out login traffic. But again, with so many people it probably won't matter. Perhaps they could stop automatic logins on boot? That would solve the problem at the expense of user convenience. No go. Perhaps their servers could be tuned to accept connections at a fast rate yet condition how quickly they respond to the rest of the login process? Not good enough I suppose.
So how did Skype fix their problem? They explain it here :
The parameters of the P2P network have been tuned to be smarter about how similar situations should be handled. Once we found the algorithmic fix to ensure continued operation in the face of high numbers of client reboots, the efforts focused squarely on stabilizing the P2P core. The fix means that we’ve tuned Skype’s P2P core so that it can cope with simultaneous P2P network load and core size changes similar to those that occurred on August 16.
Whenever I see the word "tune" I get the premonition shivers. Tuning means you are just one unexpected problem away from being out of tune and your perfectly functioning symphony sounding like a band of percussion happy monkeys. Tuned things break under change. Tweak the cosmological constant just a little and wham, there's no human life. It needs to work by design. Or it needs to be self-adaptive and not finessed by human hands for each new disaster scenario.
And this is where we get into the nature of P2P. Would the same problem have happened in a centralized architecture with resources spread strategically throughout the globe and automatic load balancing between different data centers? In a centralized model would it have been easier to bring more resources on line to handle the load? Would the outage have been easier to diagnose and last a much shorter amount of time?
There are of course no definitive answers to these questions. But many of the web's most successful systems like YouTube, Amazon, Ebay, Google, GoogleTalk, and Flickr use a centralized model. They handle millions of users and massive amounts of content and have pretty good reliability records.
Does P2P bring enough to the architecture that you should build a system around it? That to me is the interesting question that arises out of this incident.
Reader Comments (11)
... or the fault of Skype for not anticipating that their entire network, running on Windows, might go down. All at once.
How do you anticipate such things?
Invite someone from the outside to imagine worst case scenarios for you. That would be a start. Invite them in and ask them, "try and bring down our network."
Microsoft is a platform vendor. They initiated the problem, and it could have been a problem for a whole bunch of services. Over time, as more and more computing shifts into the cloud while retaining a presence on the desktop, this problem will occur over and over again unless Microsoft makes a change.
Why can't they stagger the upgrades? It would reduce the loads on their own servers as well as benefiting the rest of the world.
BTW, better check in with Adobe on Flex and any other platform vendor that could trigger similar problems.
Best,
BW
I think its thought provoking, but I'd like to argue that Skype isn't really a true P2P network. :)
The problem with skype is that though this was a P2P model, it was still very dependent on the central network to keep everyone in sync. Based on http://www.royans.net/arch/2007/08/20/how-skype-network-handles-scalability/>what I've read a new node in Skype goes through 2 big steps before it fully joins the network. The first is that it has to authenticate to the central servers, and the second is that it has to find a good "supernode" to initialize/bootstrap and maintain connectivity to the central servers. A node which has a clear connectivity to central servers can eventually become a "Supernode" and allow others to connect through them. My guess is that due to the synchronized restart of clients, when all the nodes started banging on the central servers, the supernodes started loosing connectivity and dropped out of the "supernode" mode. Creating a chicken and egg problem and making the situation worse than before.
Is it possible that the real reason why Skype network failed was because of its dependence on central servers ?
Microsoft can't
They are in a lose-lose situation. Either they release all at once and get blamed for bringing down non-resilient systems, or they stagger the release and get blamed for letting people get compromised.
I say they are doing the right thing. Get the patch out there, and let people update their systems. If you aren't keeping your systems patched, and you get compromised, then it's YOUR problem and rightfully so.
As for Skype, their system has proven itself to be very resilient over the years. I'd say their track record is at least as good as the other sites mentioned. They've all experienced outages, data loss, and severe degradation in service quality. Maybe none of them have had a total blackout (although some sure came close when they were being DDoS'd a few years back), but they've had enough problems over the years. It all adds up.
Do you guys really think that Amazon, Google, or Yahoo's services are not in some way shape or form P2P behind the scenes? They work because the machines are in constant communication with each other, keeping track of who's doing what, who's available, and who's overloaded. Sure sounds a lot like the concepts behind P2P to me.
When you get to be the size of Yahoo, Amazon, or Google, you don't have some magical pixie dust firewall that can magically distribute your load across a number of machines. There is a lot going on behind the scenes, and it's horribly complicated, and the technical choices made blur the architectural lines all the time.
Bryan
As done in many protocols, to prevent clients collission, a randomized time suspension is applied.
Microsoft could easily delay reboots by minutes between clients, which would solve the problem.
cuasing a "global restart" as microsoft did, is bound to cause problems such as this, and it's their responsibility to practice some solution.
I'm surprised Microsoft didn't think of any solution or of the possibility of such a crash, which brings me to the thought that they just didn't care...
A.M.
Who do you blame? Those who designed against assumptions or those that proved that the assumptions were invalid? Microsoft never implicitly or explicitly communicated a service contract for Skype to work against, and Skype even took measures knowing this. The only one at fault here is Skype. Its nice that they are finally taking the blame, rather than construing it as another horror by the 'evil' Microsoft.
MS does updates all the time. This isn't the first time MS has done an update since Skype became popular. To top that off, it was not 220 million clients rebooting at the same time. Many people have auto-update turned off, or set to alert them when to initiate the installs. I have to think only a certain percentage of their user base uses MS and has auto updates turned on and have it scheduled for a certain time or run it instantly. Also, not all of their users are in the same time zone, so if it has the default 3:00 am install time for scheduled updates, it would be distributed across time zones.
I smell something fishy in all of this.
Ultimately it is Skype's responsibility to write good code. For crying out loud if they can't have say 20 million people login at once, then deal with it. Why not blame their users directly for keeping their patching up to date.
It's nice there's a thread to discuss the Skype outage because the really remarkable thing seems to be that it's the first time something of the sort has happened. I've been using Skype for years and never experienced any sort of outage other than when my own Internet connectivity failed. It's therefore difficult to understand why this one failure has resulted in so many comments about how the P2P architecture is doomed. If anything, it should draw our attention to well the design has scaled as Skype users have increased. As I write this almost 6,000,000 people are logged in. It's a huge system.
Another comment that does not entirely make sense is the idea that tuning is somehow bad. All big systems are tuned. I work with databases--the larger they are, the more important it is to adjust resources correctly to match the workload. It would be unreasonable to expect that Skype would be any different. The tuning parameters are obviously different but the principle is the same.
In some ways one of the most interesting questions of this whole affair is how Skype monitors their network. As a previous poster pointed out, you can't get create the real load in a lab. The true test comes after you deploy. Skype has a problem somewhat analogous to NASA with the space shuttle. You tend to get a lot of foam bouncing off the fuel tank before a piece goes through the wing. Skype must have been close to capacity on previous occasions. They either missed the warning signs or did not know the actual limits that would trigger a melt-down. I would guess there are some people in Estonia looking at that issue pretty carefully right now.
Finally, to the extent this was a bug rather than hitting a resource limit, the Skype outage sounds like another classic software failure where error recovery goes bad. Something bad happened to the super nodes and in trying to recover Skype clients seem to have made the problem a lot worse. This is the sort of chain reaction that led to the (in)famous AT&T SS7 network failure in 1991. In that case the problem appears to have been a misplaced 'break' statement. Chain reaction outages are not just a problem with P2P networks. They are a failure mode that can appear in virtually any distributed system.
Robert Hodges
CTO, Continuent, Inc.
It's not just Skype, it's a conversation I've been having, mainly with myself, about the tradeoffs between different architectures. There have been several startups I was interested in until they made a point of their architecture being based on P2P. Customers really just want a service. Does basing a service on P2P matter? Who really cares? But P2P is often given is the reason something is good and I just don't get that.
> All big systems are tuned.
And the degree to which they are tuned is also a measure of their brittleness. For example, how do you set the queue sizes for communicating actors in a distributed system? Do you tune them based on the expected nodes size, number of users, and amount of work? Or do you make your system gracefully adapt to anything thrown at it? From hard painful experience tuning doesn't work in the real world because the real world simply laughs at your assumptions. For the queue example, use end to end application level ACKs to prevent system crashing tight retry loops. Don't just pick a queue size that is probably big enough.
Hi Todd,
It seems unwise to think a business would be successful just because it uses P2P, unless the argument somehow has an economic basis. P2P scales well for some applications because users pay for the costs by contributing processing resources.
On tuning, I agree you don't want to do it unnecessarily. The case you are describing seems to involve introducing unnecessary assumptions, which just seems like bad design. But there are many cases where there are trade-offs with no right answers. Network timeouts are a classic example. When building clusters you often have to decide between a short timeout vs. a long timeout to decide that a node has failed. If you pick a long timeout, synchronous operations on all nodes (e.g., replicating changes) may hang for an unacceptably long period of time until the cluster can decide that a member has crashed. On the other hand, short timeouts may cause members to be kicked out unnecessarily if they get busy and temporarily do not respond. Different applications call for different values.
Robert Hodges
CTO, Continuent, Inc.
Skype uses a central db. The p2p network can't work without this single point of failure. The network team killed the db's network connection, so the p2p network failed. The network team wasn't able to fix this easily. Propably some hidden cyclic routing. That's what my source was saying.
@royans: yes, you are right.
Anyway, p2p scales quite well. Try to download your favorite linux distribution ia azureus. Preferable without a tracker. Bittorrent shows that massive content distribution is possible with p2p. Guess why WoW uses p2p for patch distribution?
It's just skype that sucks. And a centralized storage/db. Bittorrent avoided these problems by using a DHT and allowing decentralized tracking, peer exchange and alike. The biggest problem is bootstrapping a p2p netwerk with little to no users and without central servers. Unless you abuse the bittorrent dht for your system (there are 2 incompatible systems, azureus style and classic bittorrent style dht).
Hope this helps,
anonymous
PS: skype uses postgres http://en.wikipedia.org/wiki/PostgreSQL