views:

82

answers:

3

We have a really strange problem. One of the servers in the server farm becomes really slow. We see a number of timeouts in the logs and overall response time is not where it should be (and is on other servers in the farm).

What is also strange is that it is not just the web app - Just logging into the server takes up to 1.5 min to show you the desktop. Once you are in, the system is as responsive as ever - unless you try to launch something, i.e. notepad - it takes another minute to launch and after launch it works fine.

I checked a number of things - memory utilization is reasonable, CPU is below 15%, windows handles, event logs do not show anything.

Recycling the aps.net process does not fix it - it still takes over a minute to log in. Rebooting the server helped, but now it started to slow down again.

After a closer look we found out that Windows Temp directory is full of temp files - over 65k files. This is certainly something to take care of. But my question is could it be the root cause of the sluggishness, or there is still something else lurking in the shadows?

Edit

After more digging I am zeroing in on the issue related to the size of temp directories. This article: describes something very similar. I am still not too sure because the fact that the server is slow to open even Notepad remains unexplained.

Is it possible that under such conditions creating a new temp file takes over a minute?

A: 

Did you check virtual memory as well ? paging ? does you app logs a lot of data in different files ? also - check - maybe the utilization happens in kernel mode and not user mode.

Dani
yep - all of it. We took the server out of rotation, there was no activity on the web app, even the process running the web app was not there anymore, but it still was taking over a min to log in
mfeingold
A: 

You might want to check how many threads your using in the ASP.NET thread pool when the timeouts occur. Another idea might be to look at the GC information in perfmon and see if the GC is running a gen2 collection?

Martin Clarke
We checked the number of threads (both managed and physical) as well as GC stats. They are very similar between bad server and the good ones. Besides nothing .NET related explains server sluggishness AFTER the web app has shut down
mfeingold
might be worth asking on/moving to serverfault
Martin Clarke
A: 

Ok, It is official, all of this was grief caused by this issue. When one of our servers was again behaving badly we cleaned the temp directory and it fixed the problem, including the slow login.

This last part still baffles me - I do not understand how excessive number of files in a temp directory can cause login to take over 1 min, leave alone launching a program, but whatever it is clearing the directory fixed it and I can live with it.

mfeingold

related questions