views:

671

answers:

4

I have a multithreaded .NET Windows Service that hangs intermittently -- maybe once every two weeks of 24/7 operation. When the hangs occurs the threadpool is completely saturated because calls to our custom tracelistener start blocking for some reason. There aren't any locks in the offending code nor anything blocking according to windbg, but they're definitely blocking somewhere. There aren't any exceptions on the stack either. There is a Thread.Sleep(1) that will occasionally be hit in the BufferedStream.Write code, but my question is what is the ReOpenMetaDataWithMemory, CreateApplicationContext, and DllCanUnloadNow mean?

Nearly all of the 2000 hung up worker threads (not normal operation!) on the ThreadPool have a stack similar to the following:

0:027> !dumpstack
OS Thread Id: 0x1638 (27)
Child-SP         RetAddr          Call Site
000000001d34df58 0000000077d705d6 ntdll!ZwDelayExecution+0xa
000000001d34df60 000006427f88901d kernel32!SleepEx+0x96
000000001d34e000 000006427f454379 mscorwks!DllCanUnloadNowInternal+0xf53d
000000001d34e080 000006427fa34749 mscorwks!CreateApplicationContext+0x41d
000000001d34e0e0 0000064280184902 mscorwks!ReOpenMetaDataWithMemory+0x1ff59
000000001d34e290 0000064280184532 Company_Common_Diagnostics!Company.Common.Diagnostics.BufferedStream.Write(Byte[], Int32, Int32)+0x1b2
000000001d34e300 00000642801831fd Company_Common_Diagnostics!Company.Common.Diagnostics.XmlRollingTraceListener+TraceWriter.Write(System.String)+0x52
000000001d34e350 00000642801b3304 Company_Common_Diagnostics!Company.Common.Diagnostics.XmlRollingTraceListener.InternalWrite(System.Text.StringBuilder)+0x3d
000000001d34e390 0000064274e9d7ec Company_Common_Diagnostics!Company.Common.Diagnostics.XmlRollingTraceListener.TraceTransfer(System.Diagnostics.TraceEventCache, System.String, Int32, System.String, System.Guid)+0xc4
000000001d34e410 00000642801b2f59 System_ni!System.Diagnostics.TraceSource.TraceTransfer(Int32, System.String, System.Guid)+0x2ec
A: 
A: 

You're right.

There are about 2000 hung up threads and the stack trace on them is nearly identical. There is a KeepAlive function in the code that dings every 1 second. There is a monitor.tryenter around the ding code. If it can't acquire the lock it emits a trace. That's what all these threads are doing. The thread that has the lock (18) has the same stack trace pretty much.

Is there a way to see local variables in BufferedStream class to see if it's stuck in a loop?

0:047> kb
RetAddr           : Args to Child                                                           : Call Site
00000000`77d705d6 : ffffffff`fffffffe 00000000`00000000 00000000`00000000 00000000`1ef7f160 : ntdll!ZwDelayExecution+0xa
00000642`7f88901d : 00000000`00000029 00000642`00000001 00000000`00000000 00000000`1c00f820 : kernel32!SleepEx+0x96
00000642`7f454379 : 00000000`00c70128 00000642`7fa2b159 00000000`00000001 00000000`00000001 : mscorwks!EESleepEx+0x2d
00000642`7fa34749 : 06000000`00000001 00000000`1c00f820 00000000`00c6ff08 00000000`00000001 : mscorwks!Thread::UserSleep+0x71
00000642`80184902 : 00000000`00000001 00000642`782f1950 00000000`015695e0 00000000`00000000 : mscorwks!ThreadNative::Sleep+0xf9
00000642`80184532 : 00000000`00c6fe90 00000642`783924e2 00000000`00db11d8 00000000`00000008 : 0x642`80184902
00000642`801831fd : 00000000`00c6fe90 00000000`00000008 00000642`80043680 00000000`00c02d58 : 0x642`80184532
00000642`74e9e494 : ffffffff`fffffffe 00000000`00000001 0000c0c9`0263a71f 00000642`7f361300 : 0x642`801831fd
00000642`8018d3d3 : 00000000`00c02d58 00000000`00000008 00000000`00000000 00000000`00db11d8 : System_ni+0x61e494
00000642`782f173b : 00000000`00c036a0 00000642`7834901c 00000642`78431598 00000000`77ef9971 : 0x642`8018d3d3
00000642`7834d696 : 00000000`00a061b0 00000000`00400080 00000000`00000000 00000642`7f529408 : mscorlib_ni+0x2f173b
00000642`7f602672 : 00000000`00db0cb8 000007ff`7d370000 000007ff`7d370000 000007ff`fffdb000 : mscorlib_ni+0x34d696
00000642`7f50c053 : 00000642`7f39f3f6 00000000`77dbc178 00000000`00000000 00000000`00000000 : mscorwks!CallDescrWorker+0x82
00000642`7f51449a : 00000000`1ef7f9f8 ffffffff`fffffffe 00000000`00000000 00000000`77d6f447 : mscorwks!CallDescrWorkerWithHandler+0xd3
00000642`7f46b023 : 00000000`00000001 00000000`77d6f491 00000000`00000000 00000000`1ee81000 : mscorwks!DispatchCallDebuggerWrapper+0x3e
00000642`7f41729e : 00000000`1ef7fb88 00000000`1c00f801 00000000`00000001 00000000`1c00f820 : mscorwks!DispatchCallNoEH+0x5f
00000642`7f447ee8 : 00000000`00db0cb8 00000000`00db0cb8 00000000`00000001 00000642`7f50946a : mscorwks!AddTimerCallback_Worker+0x92
00000642`7f556aa9 : 00000000`00000001 00000000`00000000 ffffffff`fffffffe 00000000`1ef7fba8 : mscorwks!Thread::DoADCallBack+0x488
00000642`7f43afdd : 00000000`001597d0 00000000`1c00f820 00000000`1ef7fab0 00000000`0019fcb0 : mscorwks!CNgenEntryBind::Create+0x15d
00000642`7f435296 : 00000000`1ef7fba8 ffffffff`ffffffff 00000000`1c00f820 00000642`7f885bdb : mscorwks!MethodTable::IsAbstract+0x49
0:047> !dumpstack
OS Thread Id: 0x15b0 (47)
Child-SP         RetAddr          Call Site
000000001ef7f138 0000000077d705d6 ntdll!ZwDelayExecution+0xa
000000001ef7f140 000006427f88901d kernel32!SleepEx+0x96
000000001ef7f1e0 000006427f454379 mscorwks!EESleepEx+0x2d
000000001ef7f260 000006427fa34749 mscorwks!Thread::UserSleep+0x71
000000001ef7f2c0 0000064280184902 mscorwks!ThreadNative::Sleep+0xf9
000000001ef7f470 0000064280184532 Company_Common_Diagnostics!Company.Common.Diagnostics.BufferedStream.Write(Byte[], Int32, Int32)+0x1b2
000000001ef7f4e0 00000642801831fd Company_Common_Diagnostics!Company.Common.Diagnostics.XmlRollingTraceListener+TraceWriter.Write(System.String)+0x52
000000001ef7f530 0000064274e9e494 Company_Common_Diagnostics!Company.Common.Diagnostics.XmlRollingTraceListener.InternalWrite(System.Text.StringBuilder)+0x3d
000000001ef7f570 000006428018d3d3 System_ni!System.Diagnostics.TraceSource.TraceEvent(System.Diagnostics.TraceEventType, Int32, System.String)+0x2b4
000000001ef7f620 00000642782f173b Company_Common_Routing_Service!Company.Common.RouterService.KeepAlive(System.Object)+0x2a3
000000001ef7f6f0 000006427834d696 mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+0x9b
000000001ef7f740 000006427f602672 mscorlib_ni!System.Threading._TimerCallback.PerformTimerCallback(System.Object)+0x86
000000001ef7f790 000006427f50c053 mscorwks!CallDescrWorker+0x82
000000001ef7f7e0 000006427f51449a mscorwks!CallDescrWorkerWithHandler+0xd3
000000001ef7f880 000006427f46b023 mscorwks!DispatchCallDebuggerWrapper+0x3e
000000001ef7f8e0 000006427f41729e mscorwks!DispatchCallNoEH+0x5f
000000001ef7f960 000006427f447ee8 mscorwks!AddTimerCallback_Worker+0x92
000000001ef7f9f0 000006427f556aa9 mscorwks!Thread::DoADCallBack+0x488
000000001ef7fa40 000006427f43afdd mscorwks!CNgenEntryBind::Create+0x15d
000000001ef7fb10 000006427f435296 mscorwks!MethodTable::IsAbstract+0x49
000000001ef7fb50 000006427f4162bb mscorwks!AddTimerCallbackEx+0xba
000000001ef7fc10 000006427f495fa7 mscorwks!ThreadpoolMgr::AsyncTimerCallbackCompletion+0x53
000000001ef7fc70 000006427f4aad0a mscorwks!UnManagedPerAppDomainTPCount::DispatchWorkItem+0x157
000000001ef7fd10 000006427f41f9a0 mscorwks!ThreadpoolMgr::WorkerThreadStart+0x1ba
000000001ef7fdb0 0000000077d6b6da mscorwks!Thread::intermediateThreadProc+0x78
000000001ef7ff80 0000000000000000 kernel32!BaseThreadStart+0x3a
jsw
Glad you figured it out. As to your question on locals, [!DumpStackObjects] with [!CLRStack -a] will do a pretty good job of describing the locals in corresponding stack frames.
A: 

Figured it out I believe. I got into the BufferStream and saw that it was in a state where anything that called into the TraceListener would just get stuck in a Thread.Sleep(1) loop. I hope this is the fix because I can't for the life of me recreate the issue.

I had usegloballock=false and autoflush=true in the trace configuration. The flush method on the TraceListener was not thread-safe -- the listener is meant to use data buffering, so on occasion the TraceListener would get in a bad state when there was concurrency of flushes and writes. The fix was to simply set autoflush=false. I can't believe I didn't catch this sooner.

jsw
+2  A: 
AaronBa
I updated my last answer. I had my bases covered on the DefaultTraceListener, but I had autoflush=true -- and the Flush method is not thread-safe! Duh.
jsw