Skype förklarar förra veckans krasch
Skype har haft det tufft dom senaste veckorna med sitt stora haveri, men nu har man hittat lösningen som tydligen berodde på en bugg.
What was the cause for the failure?
On Wednesday, December 22, a cluster of support servers responsible for offline instant messaging became overloaded. As a result of this overload, some Skype clients received delayed responses from the overloaded servers. In a version of the Skype for Windows client (version 5.0.0152), the delayed responses from the overloaded servers were not properly processed, causing Windows clients running the affected version to crash.
Users running either the latest Skype for Windows (version 188.8.131.52), older versions of Skype for Windows (4.0 versions), Skype for Mac, Skype for iPhone, Skype on your TV, and Skype Connect or Skype Manager for enterprises were not affected by this initial problem.
However, around 50% of all Skype users globally were running the 184.108.40.206 version of Skype for Windows, and the crashes caused approximately 40% of those clients to fail. These clients included 25Ã¢â‚¬â€œ30% of the publicly available supernodes, also failed as a result of this problem.
If approximately 20% of total Skype clients failed, why was there a much bigger disruption to Skype functionality?
Although Skype staff responded quickly to disable the overloaded servers and to eliminate client requests to them, a significant number of supernodes had already failed. A supernode is important to the P2P network because it takes on additional responsibilities compared to regular nodes, acting like a directory, supporting other Skype clients, helping to establish connections between them and creating local clusters typically of several hundred peer nodes per each supernode.
Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again. As a result, the P2P network was left with 25Ã¢â‚¬â€œ30% fewer supernodes than normal. This caused a disproportionate load on the remaining available supernodes.
Why werenÃ¢â‚¬â„¢t the other supernodes available to help?
The failure of 25Ã¢â‚¬â€œ30% of supernodes in the P2P network resulted in an increased load on the remaining supernodes. While we expect this kind of increase in the instance of a failure, a significant proportion of users were also restarting crashed Windows clients at this time. This massively increased the load as they reconnected to the peer-to-peer cloud. The initial crashes happened just before our usual daily peak-hour (1000 PST/1800 GMT), and very shortly after the initial crash, which resulted in traffic to the supernodes that was about 100 times what would normally be expected at that time of day.
Supernodes have a built in mechanism to protect themselves and to avoid adverse impact on the systems hosting them when operational parameters do not fall into expected ranges. We believe that increased load in supernode traffic led to some of these parameters exceeding normal limits, and as a result, more supernodes started to shut down. This further increased the load on remaining supernodes and caused a positive feedback loop, which led to the near complete failures that occurred a few hours after the triggering event.
Regrettably, as a result of the confluence of events Ã¢â‚¬â€œ server overload, a bug in Skype for Windows clients (version 220.127.116.11), and the decline in available supernodes Ã¢â‚¬â€œ SkypeÃ¢â‚¬â„¢s functionality became unavailable to many of our users for approximately 24 hours.
How did Skype help support supernode recovery?
In order to restore Skype functionality, the Skype engineering and operations team introduced hundreds of instances of the Skype software into the P2P network to act as dedicated supernodes, which we nick-named Ã¢â‚¬Å“mega-supernodes,Ã¢â‚¬Â to provide enough temporary supernode capacity to accelerate the recovery of the peer-to-peer cloud.
By late Wednesday night (PST) it was evident that only a proportion (about 15-20%) of Skype users connections were Ã¢â‚¬ËœhealingÃ¢â‚¬â„¢ and the volume of load on the supernodes continued to be unusually high. In response, our team introduced several thousand more mega-supernodes through the night. During Wednesday night, full recovery of the P2P network was underway and the majority of users were able to connect to the P2P network normally by early morning (California-PST) on December 23rd.
As we reported during the incident, in order to recover the core Skype functionality as quickly as possible, we utilized resources normally used to support Group Video Calling, to deploy supernodes, and over the course of Thursday night and Friday morning we returned these to their normal use and restored Group Video Calling functionality in time for Christmas.
The supernodes stabilized overnight on Thursday and by Friday, several tens of thousands of supernodes were supporting the P2P network. During Friday, we withdrew a significant proportion of the mega-supernodes from service, leaving some in operation to ensure stability of the P2P network over Christmas and New Year.
IDG har summerat hela bloggposten fint här: http://www.idg.se/2....a-windowsklient