Some background info:
- We have several websites running on a 64-bit machine with IIS6
- These websites all have the same core code, but different skins and content
- We have a SQL 2005 database which is fairly heavily used throughout the site
- Historically we've used SQL stored procs, but have been gradually transitioning to NHibernate. The majority of our code uses NHibernate now, but not all.
- These sites have been running fine on our live web server for a while, although we get a few errors a day regarding SQL connectivity / deadlocking.
Last Thursday we noticed the sites going very slow, then checking task manager revealed one of the websites was hogging over 1.6Gb of memory. Ever since then we've been restarting the app and watching it slowly increase in size over the course of the day.
We apparently have a memory leak (or at least, that's the effect), but I'm losing hair trying to work out how to trace it.
It only appears to be happening on this one website, even though as far as I am aware nothing had changed in the code before it started happenning. It is, however, our busiest website so it could be a traffic issue.
Debug Diagnostics hasn't revealed any issues.
Refreshing certain pages very quickly causes the memory to jump up rapidly, then fall slightly, but all the time the gradual progression is upwards.
I cannot replicate the issue on our test servers or locally. Probably because the traffic has something to do with it.
My suspicion is that the problem lies in database connectivity / locking. However, I'm not sure how that would cause the problem specified.
Any ideas?
Edit
Okay so not exactly sure I've found the problem but we're getting closer. It's definately SQL related. The error log reveals lots of errors since last thursday.
It all happened after we ran some windows updates on our servers. One of the updates failed on the SQL server so not sure if this caused some problems.
The warnings we're getting are:
- SQL Server has encountered XX occurence(s) of I/O requests taking longer than 15 seconds to complete on file .. tempdb.mdf
Where XX is anything between 17 and 90! Does that sound like a deadlocking issue?
Followed by the following erors:
- Unable to complete login process due to delay in opening server connection
These coincide with our log times for when the websites have been "blipping".
We've increased the page file size on SQL server to the recommended size, as it was set to a max of 4Gb, but recommended was 12Gb. I think we may need to roll back the windows updates we did on Thursday if that doesn't fix it.
Unfortunately I can't get into Activity monitor as it tells me Timeout expired!
Edit
Okay after a reboot I'm into Activity monitor. How many sleeping processes would you say would be normal? We have roughly 127 sleeping. That's serving over 10 websites.
If there is a deadlock or timeout issue, will NHibernate not clean up its connections properly?