On 23 February 2013, we’ve been down for 8 hours. We received a monitoring alert (and some e-mails and tweets as well) at around 9:50 PM (CET) notifying us of this event. After looking into it, we discovered there was not much to do about this besides sit and wait… We do wish to apologize and clarify the events for this outage (our first since July 2012).

Windows Azure Storage outageThe root cause of this outage was a global outage of Windows Azure Storage. This service is one of the code building blocks of Windows Azure and has never in the past failed (that’s 4 years of no issues). This storage system works based on the HTTP protocol and has both http and https endpoints. Most applications built on top of Windows Azure, including the platform’s own building blocks, are using the https endpoints to prevent transport-level attacks, including MyGet. Unfortunately, most clusters of Windows Azure Storage were running an expired SSL certificate on this https endpoint, the reason for this global outage of Windows Azure and every application hosted on the platform, including MyGet and the official NuGet package source and every service directly depending on www.nuget.org.

MyGet runs in the Windows Azure Europe West region (Amsterdam), with a cold disaster recovery location in the Windows Azure Europe North region (Dublin). We can fail over to this location within hours if compute or storage in the main datacenter location fail. In case of a serious outage, we can restore a disaster recovery copy of our services in any Windows Azure region around the globe. Unfortunately, there’s nothing much we can do in case of a global outage…

Our status page can always be found at http://status.myget.org. For reference, here are our uptime numbers for the past year.

Name

Uptime
(avg.: 99.79%)

February 2013

98,44%

January 2013

100%

December 2012

99,99%

November 2012

99,94%

October 2012

99,99%

September 2012

99,92%

August 2012

99,98%

July 2012

99,56%

June 2012

99,94%

May 2012

100%

April 2012

99,69%

March 2012

99,99%

 

Again, we do apologize for the inconvenience caused and are debating around possible fallback scenarios in case a severe platform outage like this occurs again.

Happy packaging!