Replicating load related crashes in non-production environments

views:

226

answers:

+1 Q:

Replicating load related crashes in non-production environments

Hi all,

We're running a custom application on our intranet and we have found a problem after upgrading it recently where IIS hangs with 100% CPU usage, requiring a reset.

Rather than subject users to the hangs, we've rolled back to the previous release while we determine a solution. The first step is to reproduce the problem -- but we can't.

Here's some background:

Prod has a single virtualized (vmware) web server with two CPUs and 2 GB of RAM. The database server has 4GB, and 2 CPUs as well. It's also on VMWare, but separate physical hardware.

During normal usage the application runs fine. The w3wp.exe process normally uses betwen 5-20% CPU and around 200MB of RAM. CPU and RAM fluctuate slightly under normal use, but nothing unusual.

However, when we start running into problems, the RAM climbs dramatically and the CPU pegs at 98% (or as much as it can get). The site becomes unresponsive, necessitating a IIS restart. Resetting the app pool does nothing in this situation, a full IIS restart is required.

It does not happen during the night (no usage). It happens more when the site is under load, but it has also happened under non-peak periods.

First step to solving this problem is reproducing it. To simulate the load, we starting using JMeter to simulate usage. Our load script is based on actual usage around the time of the crash. Using JMeter, we can ramp the usage up quite high (2-3 times the load during the crash) but the site behaves fine. CPU is up high, and the site does become sluggish, but memory usage is reasonable and nothing is hanging.

Does anyone have any tips on how to reproduce a problem like this in a non-production environment? We'd really like to reproduce the error, determine a solution, then test again to make sure we've resolved it. During the process we've found a number of small things that we've improved that might solve the problem, but I'd really feel a lot more confident if we could reproduce the problem and test the improved version.

Any tools, techniques or theories much appreciated!

Darren.

I encountered something very similar earlier this year, working on the QA team. We had to drop a debug version of our program (along with some extra logging messages crafted specifically for this purpose) to catch it. Turned out to be a hardware issue, they had swapped out the dual core processor for a quad and it was causing timing issues. We were under the impression that they were using our standard server model.

Ed Swangren 2008-08-13 06:09:31

+1 A:

You can find some information about troubleshooting this kind of problem at this blog entry. Her blog is generally a good debugging resource.

Curt Hagenlocher 2008-08-13 06:13:21

I have an article about debugging ASP.NET in production which may provide some pointers.

Jeff Atwood 2008-08-13 06:14:11

+1 A:

I'm assuming you have logging capabilities from IIS and/or your application?

When the problems start, what exactly is happening that causes this? How is this different to what your load-testing simulates?

Andrew Grant 2008-08-13 06:17:37

Is your test env the same really as live? i.e 2 separate vm instances on 2 physical servers - with the network connection and account types?

Is there any other instances on the Database?

Is there any other web applications in IIS?

Is the .Net Config right?

Is the App Pool Config right for service accounts ? Try look at this - MS Article on II6 Optmising for Performance

Lots of tricks.

2008-08-13 07:26:33

I have the same problem, take a look at your vmware performance logs, look at the memory over a month or two, for us every 8-10 days this happens, when we had less ram it was every six days. It looks like there is a memory leak. We had the exact same website running on a real physical server with no problems, it appears to be an issue with IIS and vmware.

To answer your question, try running your load tool for while (weeks) and set up tracing on the test server. If you find a solution, please post, I have gotten nowhere.

Mike 2008-09-10 15:31:01

ansaurus

tags:

views:

answers:

Replicating load related crashes in non-production environments

related questions