Diagnosing application hang in a production .NET desktop program

views:

1553

answers:

+3 Q:

Diagnosing application hang in a production .NET desktop program

I have trouble. One of the users of an application I'm developing is occasionally, but regularly, experiencing an application hang.

When this happens, we find an entry with a source of "Application Hang" in the machine's Event Log, with the informative message "Hanging application [my app], version [the right version], hang module hungapp, version 0.0.0.0, hang address 0x00000000."

I'm logging all unhandled exceptions that my application throws, and there aren't any entries in my log files when this happens.

My current working hypothesis is that this hang is occurring during the application's call to an unsafe legacy API. This wouldn't astonish me; I've been working with this API for years and while I haven't seen it hang before, it's genuinely crappy code. Also, the user's reporting that the program seems to hang at random times. I don't think this is really true. Not that I don't believe her, but that the code that talks to the legacy API is running inside a method called by a BackgroundWorker. If the background thread were making the application hang, this could very much look to the user like it were happening randomly.

So, I have two questions, one specific, one general.

The specific question: I would expect that if a method running on a non-UI thread were to hang, it would just kill the thread. Would it actually kill the whole application?

The general question:

I'm already logging all unhandled exceptions. My program's already set up to use tracing (though I'm going to need to add instrumentation code to trace activity in suspect methods). Are there other things I should be doing? Are there diagnostic tools that allow some kind of post-crash analysis when a .NET application hangs? Are there mechanisms inside the .NET framework that I can invoke to capture more (and more usable) data?

EDIT: On a closer examination of my code, I'm remembering that all of its usage of BackgroundWorker is through a utility class I implemented that wraps the method called in an exception handler. This handler logs the exception and then returns it as a propoerty of the utility object. The completion event handler in the UI thread re-throws the exception (less than ideal, since I lose the call stack, but it's already been logged), causing the UI's main exception handler to report the exception to a message box and then terminate the app.

Since none of that is happening, I'm pretty confident that there's no exception being thrown in the background thread. Well, no .NET exception, anyway.

Further followup:

Mercifully, I've now gotten enough data from the users to be certain that the hang isn't occurring inside the legacy API. This means it's clearly something I'm doing wrong, which means that I can fix it, so, win. It also means that I can isolate the problem through tracing, which is another win. I'm very happy at the answers I got to this question; I"m even happier that I probably don't need them for this problem.

Also: PostSharp is outstanding. If you need to add instrumentation code to an existing application, you almost certainly should be using it.

Thought 1) step into .net framework code (from a KB at my work):

If you’ve installed VS2008 SP1, all you need to do is go to Tools -> Options -> Debugging

Uncheck Enable Just My Code
Check Enable .NET Framework Source Stepping
Check Enable source server support
Under Debugging -> Symbols, add a new location of http://referencesource.microsoft.com/symbols

Now when debugging something that’s got greyed-out framework code in the call stack, just right click the call line and choose Load Symbols.

Thought 2) Setup remote debugging http://msdn.microsoft.com/en-us/library/y7f5zaaa.aspx

Solracnapod 2008-10-14 19:07:47

Remote debugging might be a way to go. I'm sort of amazed that in order to enable debugging on a remote machine I have to use my VS DVD. What's the dominant characteristic of a remote machine? It's *remote*. This one's about 80 miles away.

Robert Rossney 2008-10-14 19:53:38

+3 A:

In answer to your specific question, when a background/worker thread blocks or hangs, the effect on the rest of the application would depend a lot on the synchronization happening between the threads in the app. There's no particular reason why it would necessarily hang the whole app, but it's entirely possible that it would.

One possible way to diagnose this would be to generate a dump of the process while it's hung (assuming someone is around to notice when it happens). This would be done using MiniDumpWriteDump, from dbghelp.dll. It's fairly straightforward to write a simple tool that can dump a process (based on its pid), which could be provided to the customer experiencing the issue. Since this is a managed app, a full memory dump is preferable (MiniDumpWithFullMemory), but a normal dump should still have some useful info. Once you have the dump, you can use windbg or your post-mortem debugger of choice to see what might be going on.

If you go this route, this msdn article is a good starting point for managed dump debugging.

Charlie 2008-10-14 19:07:49

+1 A:

I would suggest adding more detailed logging around the calls you believe are the source of the problem.

If you're on Vista you can use the a new Vista API to have Windows call into your code when your app crashes. This is what's happening when you see MS products like Office/IE say they are "Attempting to recover you data".

Eric Haskins 2008-10-14 19:08:39

I'm already going to add more instrumentation code to the methods I suspect.Regrettably, while I'm on Vista, this user's on XP.

Robert Rossney 2008-10-14 19:48:27

If you have an unhandled execption on a thread you control, it will bring down your entire application. There's no way to "handle" this once the thread dies. You might want to look into how you can use the APM with delegates. This provides a layer of protection from exceptions thrown on other threads, as the exception is captured and brought forward when you call EndInvoke().

As for what else you can do, I second Charlie's answer.

Will 2008-10-14 19:10:05

if possible, replace the background worker thread with a SafeThread and see if that catches the suspected exception. If it doesn't, then the exception being thrown is not a CLR exception and you may be unable to handle it from 'pure' .NET code [SEH from C++ might work though]

EDIT: ok that's not it. maybe this or this might help. Good luck!

Steven A. Lowe 2008-10-14 19:14:07

As I noted in my edit, I'm already using something that's roughly equivalent to a SafeThread, and I'm now pretty sure that whatever is happening isn't a CLR exception.

Robert Rossney 2008-10-14 19:55:31

@[Robert Rossney]: hmmm...no clue then; see my edits for new links. Good luck!

Steven A. Lowe 2008-10-14 20:12:20

Robert, if all these solutions fail you, and you're still thinking the legacy API is the culprit, perhaps the answer is to sandbox the legacy API into its own AppDomain or process.

The .NET 3.5 framework makes this pretty easy to do using the System.AddIn APIs.

Judah Himango 2008-10-14 20:28:32

I'd suggest attaching WinDbg (yeah, one of those hardcore things) and using SOS (Son Of Strike) and SOSEx to analyze deadlocks (!dlk) or manually check sync blocks (!syncblk) to find mutually waiting locks.

Ilya Ryzhenkov 2008-10-14 21:27:39

ansaurus

tags:

views:

answers:

Diagnosing application hang in a production .NET desktop program

related questions