Production 100% CPU - A cautionary tale

Updated 18.8.2016: Added details and clarifications.

I recently had a case where I had to figure out a really weird intermittent problem in a production environment, that had been pestering the system for months and had strange qualities to it.

Symptoms

The infrastructure consisted of two front end machines that were load balanced, and there would be an EPiServer site running on both front end servers.

Now, both front ends were running EPiServer sites - or nearly ten of them. They were all running under a single application pool, but they were separated into their own sites.

Both of them were also monitored with New Relic.

Now, the symptoms would only present themselves as the CPUs going 100% at seemingly random times - staying there for a really long time and then eventually dropping back down on their own. The problem was in production for a months and as the system was changed during this time, the CPU spikes would recover slightly faster. There'd be constant CPU spikes, and really heavy spikes that would last longer less often.

This problem would also present itself on one of the front end machines, not both.

Trying to find issues with the system

When considering what could be causing the problem, one has to consider a lot of things. 

In our case, New relic was showing that the intermittent CPU spikes had not always happened (not even tiny spikes that would follow a similar pattern), but they had started happening on their own on a fairly random day from the middle of nowhere.

My first point of interest was to study, if anything external had changed in the infrastructure or the software around the time the problem began. This included requesting for system change logs and reports from the hosting provider. I also looked into what had gone to production for the sites during and around the time the problems began.

There was nothing at all. Nothing was changed, and nothing was put into production anywhere near that time. Dead end.

The next trail was to figure out if any content was changed on the site - if a feature had been taken into use at that time, or a piece of content was drastically changed in a high-traffic page. Also a dead end. Absolutely nothing was found on the logs suggesting that the content had drastically been modified.

Next I figured I need to check for external things, such as Google Tag Manager for clues if a script had been added to the site that could affect performance, albeit unlikely due to how GTM is built - it was still worth checking out. Dead end.

During all this prodding and poking at the system, as a side effect I formed some idea of what the peak hours were, and how the site behaved under stress. It all pointed at a server that just wasn't coping.

Running out of ideas

I was noticing that the only thing that was out of ordinary on the day the intermittent issues began was a heavy traffic spike out of the blue, and I started thinking around what the effects of such a spike could be.

Before all this, a change was introduced on the site on how a huge chunk of data was cached for the most data intense operation the site had. Previously, all that data would be loaded at once when the first request on the site would happen. This would cause the first user to have absolutely abysmal load times, but it would not become a problem again during the day. The cache time for all this data was all day, so it would never expire.

Later it was changed so that each block of data is loaded as it is needed and is cached for 15 minutes in smaller chunks. At worst there could be anywhere from 5-10 big blocks of these datas on a single high traffic page, and those blocks would be asynchronously loaded as the first user would scroll on them.

All in all, a good change which made a huge impact on how quickly the site could recover from being rebooted.

We could see from the New Relic transaction logs that this operation, even when asynchronous, would still take the most time by a long shot. 40% of the load times were spent here.

I decided to move on having had so many dead ends, and instead profile the site in runtime. I’d use perfmon and observe the garbage collector in action trying to find correlation between the CPU spikes and the garbage collector. I’d also monitor the amount of deadlocked threads and things like that. Aside from some really random spikes in the GC CPU% there was nothing significant here.

I’d set up Windows Performance Analyser / Recorder on the server, and start collecting high cpu profiles and dumps. This by far ended up being the most useful tool for analyzing the performance of the sites.

After having collected multiple separate dumps, I’d have pretty much multiple separate seeming reasons for why the CPU ended up spiking at 100%. Some of the profiles would show the garbage collector spending most of the time. Others would be time spent in the loading of the blocks of data on the website. Very random.

99 problems...

After stepping back and looking at the dumps and the perfmon data together, a pattern would emerge on how the CPU would spike. It would seem random, but any time a smaller CPU spike would occur, it would still be within the ballpark of about 15 minutes but never under 15 minutes. That 15 minutes also happened to be the expiration time of the caches for the data intense blocks of data.

Note that there was two different types of CPU spikes occurring. The one occurring every 15 minutes or so was slightly lower and it's duration was small, and then approximately twice a day there would be more devastating CPU spikes fully at 100%, which would kill performance completely and make page load times go up to the 10s of seconds.

My theory on why the lower type of CPU spike wouldn't occur exactly every 15 minutes but around that number, was because it turns out that the GC only collects expired items off memory when it has to, so basically when it needs more space - not the moment content expires, at least in theory.

I asked myself:

“What if a person who loaded the most data intense yet to be cached pages on the site would scroll down the page really fast, and that would cause all the blocks of data to be cached at the same time, and to expire at the same time?”

You can see I was reaching here but still, the more traffic a site would have, the more common that problem could become. 

It couldn't become a problem though because the website was restarted automatically at night, at off-peak times and then users slowly trickle back to the site during morning hours, giving all the content seemingly random cache times which could never end up overlapping.

I started examining the caches more because of the clear correlation to those cache times anyway. Perfmon has nice counters for this. These wouldn’t really directly point at a problem, just slightly hint again, that there was a problem lurking - not the smoking gun I was looking for though.

At this point I had a built a solid understanding of the system and the feeling that it wasn't exactly thriving grew stronger. I was almost ready to conclude that it was indeed just too little hardware for this software to be running, but I felt I absolutely needed to find actual proof for this.

Then a breakthrough came when I took each of the really high CPU spikes (the devastating ones) timestamps from New Relic an and I went through the application logs looking at exactly these times. I’d see that ASP.NET had initiated a shutdown during those times due to a Configuration Change. Wait what?

This boggled my mind because it should’ve been the first thing to look at (shame on me), but also, just why was this happening? 

Turned out, one of these super high CPU times were exactly spot on during peak hours.

During peak hours...

And that’s when all the clues above came together and hit me:

1. The site would restart and lose all of it’s caches at the same time during peak hours. 

2. After the restart, it would high CPU spike just trying to get back to speed because there’s an absolute ton of users spamming the site and here’s the kicker:

3. All the caches would be filled at the exact same time. This would also make them expire 15 minutes later all at the same time, and then get reloaded at the same exact time again in a loop - forever, at least until the user count would severely go down.  One massive CPU spike followed by smaller ones every 15 minutes?

The perfect storm.

Looking at the dumps and at Perfmon with this theory in mind and from this perspective - it made a lot of sense.

So why was the site restarting at this peak hour? After a bit of googling around, it turned out some backup software have had problems with touching the web.config files when they take a backup - but I knew for a fact that no backups are taken during peak hours - for obvious reasons.

Then I figured the only other thing that it can be is the virus software. I ran the virus scan manually, on the server, against the site files and guess what?

It restarted.

Smoking gun, right there. I checked the scheduling on the virus scan, and it hit the timing spot on.

A little bit of googling later, I found this:

http://forums.iis.net/post/1926721.aspx

So, the hosting people were informed, and they patched McAfee, and that problem goes away, and so does the high cpu spikes. 

Unbelievable? Believe it.

Lessons learned

Enough things had to go wrong for such behaviour to occur, and it only needed one thing to be fixed in the chain to resolve the problem. There is still a problem with the blocks of data having equal cache times and having optimisations issues, but it no longer has the right conditions to become a problem, and it can be managed.

Any production environment is a time bomb. At some point, you're going to have to optimise it.

Looking at what I describe here in hindsight, it is clear that the data I gathered pointed to a system that simply had problems providing adequate resources for the application to run efficiently. That alone would've probably been manageable, but add the problem with the virus software and then you end up here. When the problems look random, it’s definitely a sign that the system needs optimisation or more hardware. If the problem is consistent and can be reproduced, then you know it’s a simple bug or an optimisation point. Always remember that the level of randomness of the symptoms is the best clue as to what is going on.

We also moved all the Episerver sites under one site, so a single application pool and a single site runs all the domains. This allowed for a lot less GC churn because the caches would be shared on all domains where the data was the same. This turned out also to ease out the CPU problems significantly. This alone would've probably been enough to make the whole situation resolve itself.

Another thing when it comes to these sort of problems is the ASP.NET Session state which locks a single user to a single thread essentially, and disallows them from using your API-endpoints truly asynchronously. Always consider using async action methods or disabling session state for data-intensive places to allow your application to serve data without waiting for a thread to free up. If new relic is showing you that the api endpoint is reporting slow response times - drill into this to find if it's waiting on a thread to free up and if so, disable session state or make it async to optimize it.

If you know exactly what goes into your production environment at all times, you will always know what causes the problem. If you buy hosting off another provider, and they silently add a broken patch to the virus software without a trace, then you're only left with figuring it out the hard way.