On September 11 and 12, 2013, we have experienced some downtime. The website and back-end services have crashed a number of times across all instances, bringing our system to a halt a couple of times. We hate downtime as much as you do and want to apologize for the trouble this may have caused you. Let’s have a look at the symptoms and root cause.
Twitter and e-mail notified us of this a good 2 minutes before our monitoring kicked in. When we had a look at our status dashboards, we saw high CPU usage on all machines, inexplicable crashes of machines and an error message in the Windows event logs right before the machines died and restarted.
In our back-end queue, where all work is added and waits to be processed, we found 25 messages that seemed to be retried over and over again, having been dequeued over 150 times.
We have manually removed these messages from the queue and inserted them again after the website was running stable again. After monitoring the situation for a couple of hours, all systems seemed to be running stable again. Seemed…
A number of hours later, we received some additional monitoring events. The problem was back! This time we decided not to fire fight but dig deeper into the issue we were facing. We took the site offline and connected to the machines to analyze event logs and so on. Except for the error message in the Windows event log, nothing was going on. Next, hooked a debugger to it, waiting for anything to happen.
Perhaps a root cause analysis is not the best moment to blog about our new features and updates in our latest deployment, but we have to in this case. Last Tuesday, we deployed our 1.8 release which brings an update to retention policies. These allow MyGet to automatically delete packages from a feed after a certain number of packages has been added, an automatic feed cleanup as it were.
The updated retention policies feature now tries not to break a dependency chain. Before, packages that were depended on by other packages but were subject to the retention policy settings would be deleted. Now, the retention policy handler respects the entire dependency chain for a package and will no longer remove depended packages. This, of course, requires us to parse the entire dependency tree for a package.
Now what happens if package A depends on package B, B depends on C and C depends on B? Right: that’s a potential infinite loop. Guess what was happening: a StackOverflowException, crashing our entire machine without giving it the chance to write output to our error logs.
Because of the crash, the message in our backend was requeued over and over. We typically move a message to a dead letter queue after a number of retries (typically 32). However because of the crash, that logic couldn’t kick in and move the message out, resulting in the message being requeued over and over again, triggering the same behavior again.
While we test our application thoroughly before going to production, this was an edge case. Who would create a package with circular dependencies anyway? We found that on one feed, a circular package dependency did exist.
We have made a fix to our algorithm which now supports this scenario and will stop following the circular dependency.
After confirming our algorithm, we have deployed it at 6 AM (CET) this morning, resolving the issue. The messages that were stuck in our queue (and triggered the initial issue) have been requeued to confirm correct working and were all processed successfully.
Our status page can always be found at http://status.myget.org, showing uptime of our last months as well as the outage of the past hours.
Again, we do apologize for the inconvenience caused.