MyGet Blog

Package management made easier!

NAVIGATION - SEARCH

Default feed endpoint issues 7 February 2013 - Post mortem

On February 7, 2013, we've experienced a partial downtime. The default endpoint for consuming NuGet feeds was slow to respond (+/- 2 minutes for a simple request) and caused issues with a lot of our users and their development teams and processes. We share your pain: we use MyGet for developing MyGet and we hate when things like this happen as much as you do. Our sincerest apologies for the inconvenience caused!

In this post, we'll have a look at what happened and the reason this happened. But before we dive in: we've had 3 hours of partial downtime on the default feed endpoint which most users have configured. We want to apologize by extending all paid subscriptions (Starter, Professional and Enterprise) with one additional month. No action is required on your end, your subscription has already been extended.

What happened?

At around 3:15 PM (CET), our monitoring alerted us about latency on the default feed endpoint going up. After a quick investigation and some additional time, latency went back to normal. One hour later, around 4:15 PM (CET), we saw the same happening again, as well as a number of support e-mails coming in.

MyGet has the following feed endpoints available (see documentation):

  • /F/<your_feed_name> - the NuGet v2 API endpoint
  • /F/<your_feed_name>/api/v2 - the NuGet v2 API endpoint for consuming packages
  • /F/<your_feed_name>/api/v2/package - the NuGet v2 API endpoint for pushing packages
  • /F/<your_feed_name>/api/v1 - the NuGet v1 API endpoint for consuming and pushing packages (still in use by Orchard CMS and some others)

In this case, the default endpoint we offer our users was experiencing issues, the other endpoints were not. We decided to reply all support requests with the advice to switch to the v2 feed endpoint to make sure you could continue work. This requires some reconfiguration, but we figured it's better than having very slow access and timeouts connecting to your feeds.

We checked logs, diagnostics, monitoring reports, read and re-read our code, but could not pinpoint any reason for the default feed endpoint having such high latency. At 5:00 PM (CET) we decided to fail-over to our secondary datacenter. Being a separate environment, we wanted to see if the issue would haoppen there as well. It didn't! So we flipped the DR switch and took our primary datacenter out of the loop to be able to investigate the servers without causing additional issues for our users. This fail-over solved the issue for most of our users, who were able to work with MyGet's default feed endpoint from around 6:00 PM (CET), mind some DNS propagation for some users.

Digging through more logs, we discovered the issue that caused the default feed endpoint to experience this latency (we'll elaborate in a minute). At 8:00 PM (CET), we rolled out a hotfix to our secondary datacenter to make sure the issue would not resurface there either. We worked on a final fix for the next few hours and at 11:30 PM (CET), we flipped the switch back to our primary datacenter.

Why was the default feed endpoint so slow?

As you know, MyGet supports adding upstream feeds and allows to proxy them. This is a very useful feature as one feed can essentially bundle several others, so a developer team only has to configure one and still get packages from multiple.

MyGet treats all upstream package sources as external feeds, even if an upstream feed is a MyGet feed. The reason for that is security contexts can be diferent and we want to have all security aspects applied at any time. This does mean that a MyGet feed having an upstream MyGet feed as a package source is effectively using the default feed endpoint of this second feed, going oer our internal network through the load balancer. Next to security, this also allows us to fan out queries going to multiple package sources over multiple machines.

One of our users configured upstream package sources that were referencing themselves: feed A referencing feed B referencing feed A. We prohibit self-referencing feeds, but the scenario this user tried to achieve was unsupported and resulted in a reference loop on the default feed endpoint. The default feed endpoint was bringing itself down by fanning out queries to itself.

Previous incidents teached us to partition as many things as possible, and luckily we partition all alternative feed endpoints. This allowed us to direct support requests towards the other feed endpoints so users could keep using our service.

Solution

As a quick fix, we disabled the faulting package source so our service could be resumed in a stable fashion. Once that was done, we worked on a permanent solution to this issue. Do we want to block self-referencing feeds? Absolutely, as that makes no sense. Do we want to block adding other MyGet feeds as upstream package sources? No.

We want to keep supporting having upstream package sources that can be MyGet feeds, so that complex hierarchies of feeds and privileges can be configured. Quite a number of feeds do this today, having one feed proxying a series of upstream feeds.

How do we prevent the same problem from happening again? We are now actively tracking refering feeds and detecting loops in these references. Doing this, we keep supporting all scenarios that were possible before while protecting ourselves from shooting ourselves in the foot. Feed A can reference feed B, which can reference feed A. Users can make use of feed A as the entry pont, as well as feed B, and get all expected packages on the feed. We will make sure reference loops are removed from the feed hierarchy.

When a MyGet feed references a non-MyGet feed, we will also be adding an HTTP header to all upstream requests, X-NuGet-Feedchain, which contains the full list of refering feeds. Other NuGet feed implementations can make use of this header to detect reference loops on their end as well.

While we replied to all support requests in under 5 minutes with a workaround to use one of the other endpoints, we will be working on extending our status page at http://status.myget.org, showing details about all endpoints. We also want to be able to post status messages and tips & tricks like switching to the secondary feed endpoints on there.

Hardware fails, software has bugs and people make mistakes. We strive to mitigate as many of these factors and maintain a high quality service with fast, good support. We'll keep doing that to provide you with the best experience possible.

Again, we do apologize for the inconvenience caused.

Happy packaging!

Downtime September 11 and 12, 2013 - Root cause

On September 11 and 12, 2013, we have experienced some downtime. The website and back-end services have crashed a number of times across all instances, bringing our system to a halt a couple of times. We hate downtime as much as you do and want to apologize for the trouble this may have caused you. Let’s have a look at the symptoms and root cause.

Symptoms

Twitter and e-mail notified us of this a good 2 minutes before our monitoring kicked in. When we had a look at our status dashboards, we saw high CPU usage on all machines, inexplicable crashes of machines and an error message in the Windows event logs right before the machines died and restarted.

In our back-end queue, where all work is added and waits to be processed, we found 25 messages that seemed to be retried over and over again, having been dequeued over 150 times.

We have manually removed these messages from the queue and inserted them again after the website was running stable again. After monitoring the situation for a couple of hours, all systems seemed to be running stable again. Seemed…

A number of hours later, we received some additional monitoring events. The problem was back! This time we decided not to fire fight but dig deeper into the issue we were facing. We took the site offline and connected to the machines to analyze event logs and so on. Except for the error message in the Windows event log, nothing was going on. Next, hooked a debugger to it, waiting for anything to happen.

The issue

Perhaps a root cause analysis is not the best moment to blog about our new features and updates in our latest deployment, but we have to in this case. Last Tuesday, we deployed our 1.8 release which brings an update to retention policies. These allow MyGet to automatically delete packages from a feed after a certain number of packages has been added, an automatic feed cleanup as it were.

The updated retention policies feature now tries not to break a dependency chain. Before, packages that were depended on by other packages but were subject to the retention policy settings would be deleted. Now, the retention policy handler respects the entire dependency chain for a package and will no longer remove depended packages. This, of course, requires us to parse the entire dependency tree for a package.

imageNow what happens if package A depends on package B, B depends on C and C depends on B? Right: that’s a potential infinite loop. Guess what was happening: a StackOverflowException, crashing our entire machine without giving it the chance to write output to our error logs.

Because of the crash, the message in our backend was requeued over and over. We typically move a message to a dead letter queue after a number of retries (typically 32). However because of the crash, that logic couldn’t kick in and move the message out, resulting in the message being requeued over and over again, triggering the same behavior again.

While we test our application thoroughly before going to production, this was an edge case. Who would create a package with circular dependencies anyway? We found that on one feed, a circular package dependency did exist.

Solution

We have made a fix to our algorithm which now supports this scenario and will stop following the circular dependency.

After confirming our algorithm, we have deployed it at 6 AM (CET) this morning, resolving the issue. The messages that were stuck in our queue (and triggered the initial issue) have been requeued to confirm correct working and  were all processed successfully.

Our status page can always be found at http://status.myget.org, showing uptime of our last months as well as the outage of the past hours.

Again, we do apologize for the inconvenience caused.

Happy packaging!

MyGet is running over HTTPS only

A couple of weeks ago we've informed you that from July 1st, 2013, MyGet would be switching to HTTPS traffic only. We have just deployed these changes and from now on, your feed URL should be prefixed with https:// instead of http://.

Haven't made the switch yet? No problem: we will be redirecting most traffic to the new HTTPS endpoint. However this may add a little latency to your requests. Hence it's best to switch your URL now if you haven't done so.

To help in updating feed URLs on developer machines, you can make use of package source discovery. http://docs.myget.org/docs/reference/package-source-discovery
In short, every developer can issue the following commands in his/her Visual Studio Package Manager Console to update feed URLs:

Install-Package DiscoverPackageSources
Discover-PackageSources -Url https://www.myget.org/Discovery/Feed/ -OverwriteExisting

Protecting the security and privacy of our users is one of our most important tasks at MyGet. The fact that you can safely store your intellectual property on our servers is the best proof of that. We are confident this one-time change will make the entire MyGet experience even more secure.

Do not hesitate to contact us if you have additional questions through the comments below.

Happy packaging!

PS: If you are a MyGet Enterprise customer and are using your own domain name, no action isrequired on your side.

Switching to full HTTPS on July 1st, 2013

Important: a change is coming to URLs of MyGet. Please read through this post carefully as there may be some actions required on your side.

Protecting the security and privacy of our users is one of our most important tasks at MyGet. The fact that you can safely store your intellectual property on our servers is the best proof of that.

Currently, MyGet supports both http as well as https to communicate with our applications. To further improve our security, we're removing http access in the near future and will be switching to https only by July 1st, 2013, using a 2048-bit key certificate. By using only https, we can guarantee a secure communication channel between you and our servers.

Unfortunately, this change may require some action on your side. We will be discontinuing the http://www.myget.org URL in favor of https://www.myget.org. This means:

  • Your developers and/or clients may have to update their configuration
  • Your continuous integration servers may have to be reconfigured to make use of this new URL

This transition will happen in the following stages:

Actions required on your end:

  • Before July 1st - All links to MyGet have to be migrated to the https://www.myget.org URL if you are not on the Enterprise plan.

To help in updating feed URLs on developer machines, you can make use of package source discovery. http://docs.myget.org/docs/reference/package-source-discovery
In short, every developer can issue the following commands in his/her Visual Studio Package Manager Console to update feed URLs:

Install-Package DiscoverPackageSources
Discover-PackageSources -Url https://www.myget.org/Discovery/Feed/


We are confident this one-time change will make the entire MyGet experience even more secure.

Do not hesitate to contact us if you have additional questions.

Best regards,
the MyGet team

We were down...

On 23 February 2013, we’ve been down for 8 hours. We received a monitoring alert (and some e-mails and tweets as well) at around 9:50 PM (CET) notifying us of this event. After looking into it, we discovered there was not much to do about this besides sit and wait… We do wish to apologize and clarify the events for this outage (our first since July 2012).

Windows Azure Storage outageThe root cause of this outage was a global outage of Windows Azure Storage. This service is one of the code building blocks of Windows Azure and has never in the past failed (that’s 4 years of no issues). This storage system works based on the HTTP protocol and has both http and https endpoints. Most applications built on top of Windows Azure, including the platform’s own building blocks, are using the https endpoints to prevent transport-level attacks, including MyGet. Unfortunately, most clusters of Windows Azure Storage were running an expired SSL certificate on this https endpoint, the reason for this global outage of Windows Azure and every application hosted on the platform, including MyGet and the official NuGet package source and every service directly depending on www.nuget.org.

MyGet runs in the Windows Azure Europe West region (Amsterdam), with a cold disaster recovery location in the Windows Azure Europe North region (Dublin). We can fail over to this location within hours if compute or storage in the main datacenter location fail. In case of a serious outage, we can restore a disaster recovery copy of our services in any Windows Azure region around the globe. Unfortunately, there’s nothing much we can do in case of a global outage…

Our status page can always be found at http://status.myget.org. For reference, here are our uptime numbers for the past year.

Name

Uptime
(avg.: 99.79%)

February 2013

98,44%

January 2013

100%

December 2012

99,99%

November 2012

99,94%

October 2012

99,99%

September 2012

99,92%

August 2012

99,98%

July 2012

99,56%

June 2012

99,94%

May 2012

100%

April 2012

99,69%

March 2012

99,99%

 

Again, we do apologize for the inconvenience caused and are debating around possible fallback scenarios in case a severe platform outage like this occurs again.

Happy packaging!

MyGet Update: version 1.3.0

We are happy to announce a new version of the MyGet Web site containing a set of improvements, fixes and updates.

The following work items have been addressed in this release:

Product

  • number of plans has been reduced. We now have Free, Starter (formerly Small plan), Professional (formerly Large plan) and a new Enterprise plan
  • introduction of the new Enterprise plan focused on companies that require more performance, integration and security

Usability

  • improved layout and readability on smaller screen resolutions
  • auto-lower cased routing for feed URL (helps avoiding issues caused by incorrect casing of feed identifier)
Performance
  • upgraded to MVC4
Features
  • new identity provider: StackExchange
  • upgraded SymbolSource integration to take benefit from their improved and freshly released API
  • Gallery feeds now support a README in Markdown format, available in the feed's Gallery Settings
  • Enterprise plan: upgraded administrative & quota management dashboard
More details about the new features will be covered in follow-up blog posts soon.
 
Happy packaging!

Planned maintenance July 26, 2012

We are planning a small window of maintenance on Thursday July 26, between 1:00pm and 2:00pm CET, during which we will roll out an update to the MyGet Web site.

During that timeframe, we will disable any synchronization we have in place with SymbolSource. Synchronization will pick up after our deployment.

We expect no other interruptions in service during this time frame, except for maybe a few seconds when switching VIPs.

Stay tuned!