views:

370

answers:

4

So I have an issue on our production environment where 2 threads have been running for like 9 hours and 5 hours and they are causing the cpu usage to stay around 99%

I've included the stack trace from !Clrstack and kb 2000 I've been trolling around google and etc... forever and I can't find anything that helps me figure out what these threads are doing and why they are consuming so much in resources

0:048> !clrstack
OS Thread Id: 0x345c (48)
ESP       EIP     
01e5f068 7c8285ec [HelperMethodFrame_1OBJ: 01e5f068] System.Threading.WaitHandle.WaitOneNative(Microsoft.Win32.SafeHandles.SafeWaitHandle, UInt32, Boolean, Boolean)
01e5f114 792b687f System.Threading.WaitHandle.WaitOne(Int64, Boolean)
01e5f130 792b6835 System.Threading.WaitHandle.WaitOne(Int32, Boolean)
01e5f144 7a9390a2 System.Net.ConnectionPool.CleanupCallback()
01e5f154 7a938fc3 System.Net.ConnectionPool.CleanupCallbackWrapper(Timer, Int32, System.Object)
01e5f184 7aa97f5f System.Net.TimerThread+TimerNode.Fire()
01e5f1cc 7a584c84 System.Net.TimerThread+TimerQueue.Fire(Int32 ByRef)
01e5f20c 7a55db8b System.Net.TimerThread.ThreadProc()
01e5f25c 792d6cf6 System.Threading.ThreadHelper.ThreadStart_Context(System.Object)
01e5f268 792f5611 System.Threading.ExecutionContext.runTryCode(System.Object)
01e5f698 79e71b4c [HelperMethodFrame_PROTECTOBJ: 01e5f698] System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode, CleanupCode, System.Object)
01e5f700 792f5507 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
01e5f71c 792e0175 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
01e5f734 792d6c74 System.Threading.ThreadHelper.ThreadStart()
01e5f960 79e71b4c [GCFrame: 01e5f960] 
01e5fc50 79e71b4c [ContextTransitionFrame: 01e5fc50] 



0:048> kb 2000
ChildEBP RetAddr  Args to Child              
01e5edf8 7c827cfb 77e6202c 00000001 01e5ee48 ntdll!KiFastSystemCallRet
01e5edfc 77e6202c 00000001 01e5ee48 00000000 ntdll!NtWaitForMultipleObjects+0xc
01e5eea4 79f4c88a 00000001 01e5f0e4 00000001 kernel32!WaitForMultipleObjectsEx+0x11a
01e5ef0c 79f4c4bb 00000001 01e5f0e4 00000001 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0x6f
01e5ef2c 79f4c5c4 00000001 01e5f0e4 00000001 mscorwks!Thread::DoAppropriateAptStateWait+0x3c
01e5efb0 79f4c659 00000001 01e5f0e4 00000001 mscorwks!Thread::DoAppropriateWaitWorker+0x13c
01e5f000 79f159e8 00000001 01e5f0e4 00000001 mscorwks!Thread::DoAppropriateWait+0x40
01e5f104 792b687f 00000000 00000000 00000000 mscorwks!WaitHandleNative::CorWaitOneNative+0x156
01e5f120 792b6835 00000000 00000000 7aa3488c mscorlib_ni+0x1f687f
01e5f138 7a9390a2 00000000 21b09738 01e5f168 mscorlib_ni+0x1f6835
01e5f14c 7a938fc3 041c7bcc 00000000 00000000 System_ni+0x4f90a2
01e5f178 7aa97f5f 041c7bcc 1b790a40 1b790a40 System_ni+0x4f8fc3
01e5f1c4 7a584c84 00000000 21b09738 01e5f224 System_ni+0x657f5f
01e5f204 7a55db8b 0a62018c 0574ea00 00000000 System_ni+0x144c84
01e5f254 792d6cf6 22124c7c 01e5f270 792f5611 System_ni+0x11db8b
01e5f260 792f5611 00000000 1b790a40 01e5f280 mscorlib_ni+0x216cf6
01e5f270 79e71b4c 00000000 00000000 01e5f300 mscorlib_ni+0x235611
01e5f280 79e821b1 01e5f350 00000000 01e5f320 mscorwks!CallDescrWorker+0x33
01e5f300 79e96501 01e5f350 00000000 01e5f320 mscorwks!CallDescrWorkerWithHandler+0xa3
01e5f444 79e96534 79241ff0 01e5f578 01e5f498 mscorwks!MethodDesc::CallDescr+0x19c
01e5f460 79e96552 79241ff0 01e5f578 01e5f498 mscorwks!MethodDesc::CallTargetWorker+0x1f
01e5f478 79f8a3e1 01e5f498 57d102af 1b790a40 mscorwks!MethodDescCallSite::CallWithValueTypes+0x1a
01e5f644 79f8a536 01e5f6d4 57d1021f 22124cc4 mscorwks!ExecuteCodeWithGuaranteedCleanupHelper+0x9f
01e5f6f4 792f5507 01e5f698 0574ea6c 06cc1310 mscorwks!ReflectionInvocation::ExecuteCodeWithGuaranteedCleanup+0x10f
01e5f710 792e0175 041c7828 01e5f76c 0574ea6c mscorlib_ni+0x235507
01e5f728 792d6c74 041c7828 00000000 1b790a40 mscorlib_ni+0x220175
01e5f740 79e71b4c 77e40000 00000000 01e5f7d0 mscorlib_ni+0x216c74
01e5f750 79e821b1 01e5f820 00000000 01e5f7f0 mscorwks!CallDescrWorker+0x33
01e5f7d0 79e96501 01e5f820 00000000 01e5f7f0 mscorwks!CallDescrWorkerWithHandler+0xa3
01e5f90c 79e96534 7924290c 01e5fa68 01e5f9a0 mscorwks!MethodDesc::CallDescr+0x19c
01e5f928 79e96552 7924290c 01e5fa68 01e5f9a0 mscorwks!MethodDesc::CallTargetWorker+0x1f
01e5f940 79f3d803 01e5f9a0 57d10fc3 1b790a40 mscorwks!MethodDescCallSite::CallWithValueTypes+0x1a
01e5fb28 79e9845f 01e5fe50 1b790a40 00000000 mscorwks!ThreadNative::KickOffThread_Worker+0x192
01e5fb3c 79e983fb 01e5fdc4 01e5fbc4 79f7759b mscorwks!Thread::DoADCallBack+0x32a
01e5fbd0 79e98321 01e5fdc4 57d108e7 1b790a40 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3
01e5fc0c 79fd876a 01e5fdc4 1b790a40 01e5fccc mscorwks!Thread::ShouldChangeAbortToUnload+0x30a
01e5fc1c 79fd96f9 01e5fdc4 01e5fcc0 79f7759b mscorwks!Thread::RaiseCrossContextException+0x434
01e5fccc 79fd878b 00000003 79fd8756 01e5fdc4 mscorwks!Thread::DoADCallBack+0xda
01e5fce8 79e983fb 01e5fdc4 01e5fd70 79f7759b mscorwks!Thread::DoADCallBack+0x310
01e5fd7c 79e98321 01e5fdc4 57d10953 00000000 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3
01e5fdb8 79e984ad 01e5fdc4 00000003 00000000 mscorwks!Thread::ShouldChangeAbortToUnload+0x30a
01e5fde0 79f3d5d4 00000003 79f3d6e9 01e5fe50 mscorwks!Thread::ShouldChangeAbortToUnload+0x33e
01e5fdf8 79f3d6ae 00000003 79f3d6e9 01e5fe50 mscorwks!ManagedThreadBase::KickOff+0x13
01e5fe94 79f92015 1bb9e468 80a5e56d 80865927 mscorwks!ThreadNative::KickOffThread+0x269
01e5ffb8 77e64829 0014d9c0 00000000 00000000 mscorwks!Thread::intermediateThreadProc+0x49
01e5ffec 00000000 79f91fcf 0014d9c0 00000000 kernel32!BaseThreadStart+0x34
A: 

If you can attach a debugger then the misbehaving threads will usually be the thread that comes up when you 'Break All'.

Otherwise I would probably take a bunch of those thread location snapshots and look to see if there are any threads which are consistently not in a wait (i.e. WaitForMultipleObjectEx). That should give you an idea of which threads are misbehaving and what code they're generally running.

And make sure you don't have any code like:

while(1)
  ;

:)

Aaron
I know the 2 threads causing the issue. When I do !runaway they are at the top of the list!runaway 41:3420 0 days 5:31:57.781 48:345c 0 days 1:23:23.421But I'm at a lost for figuring out why those 2 threads are sticking around and what exactly they are doing and what line of code created them....If there's anything else you know that I can do to figure out what created these threads that'd be awesome :-)
Shane
+1  A: 

You can always stop the process with the debugger and check the stack trace a couple of times. If a thread is often not idling and in the same spot, you'll know a bit more about where it spends all time.

In the stuff you pasted I only see the stack trace for one thread, can you get the stack trace for all threads? (Sorry if this is way off, I'm used to doing this in unix mostly)

Mattias Nilsson
yea I've done that... All the other threads are running and exiting and doing what they do just fine.... It's just these 2 threads that are sticking around and using up 100% of the cpu. The is currently load balanced and I have all traffic directed to a different site so the offending site has zero load right now and only these 2 threads are active and using up resources
Shane
A: 

Use ProcDump to get a memory dump when the CPU is spiking high. Then check the call stacks of all the threads. Also run perfmon and keep checking for thread that is using up most of the CPU. Hope this helps

Naveen
Also check for % Time spent on GC in perfmon
Naveen
yea I have a memory dump. The cpu is spiking high constantly... It's just sitting at 99 percent... I used perfmon to check the running threads and there are just 2 threads that are running busily (clrstacks of those posted above). Each thread has the exact same memory dumb and clrstack dump (posted above) and each thread is taking up 50 percent of the resource respectively...But I can't figure out what to do next to get any hints of what these two threads are doing or where they came from
Shane
% Time spent on GC isn't that high it stays around 0.2 which I don't think is that high...
Shane
I see call stack of only thread id 48. What about the other one?
Naveen
Yea they both look identical so I figured I'd just post one of them
Shane
it looks exactly the same as any of the other threads waiting around except for the parts about 1789e944 7a9390a2 System.Net.ConnectionPool.CleanupCallback()1789e954 7a938fc3 System.Net.ConnectionPool.CleanupCallbackWrapper(Timer, Int32, System.Object)1789e984 7aa97f5f System.Net.TimerThread+TimerNode.Fire()1789e9cc 7a584c84 System.Net.TimerThread+TimerQueue.Fire(Int32 ByRef)1789ea0c 7a55db8b System.Net.TimerThread.ThreadProc()So that's where I'm at
Shane
+2  A: 

Alright so I found the issue I did a !clrstack -p and than a !do on the system.net portions and it revealed the offending thread was a System.net.Servicepoint pointed at our smtp server..

Googled around and found that this is the issue http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=337557 also described here http://www.vbforums.com/showthread.php?t=584384 It's an issue with the service point not correctly sending the quit command and disconnecting.. Which they are going to resolve in .net 4.0

Now I just have to code in the workarounds to ensure the service point closes up and that should work out

Thanks for everyones help

Shane
+1 ... thanks for following up!
Adam Liss
no problemand if anyone is curious what the magic fix was... I found 2 ways1) Just use IIS Pickup2) client.ServicePoint.MaxIdleTime = 1; client.ServicePoint.ConnectionLimit = 1;
Shane