views:

460

answers:

4

I'm working with a somewhat unreliable (Qt/windows) application partly written for us by a third party (just trying to shift the blame there). Their latest version is more stable. Sort of. We're getting fewer reports of crashes, but we're getting lots of reports of it just hanging and never coming back. The circumstances are varied, and with the little information we can gather, we haven't been able to reproduce the problems.

So ideally, I'd like to create some sort of watchdog which notices that the application has locked up, and offers to send a crash report back to us. Nice idea, but there are problems:

  • How does the watchdog know the process has hung? Presumably we instrument the application to periodically say "all ok" to the watchdog, but where do we put that such that it's guarenteed to happen frequently enough, but isn't likely to be on a code path that the app ends up on when it's locked.

  • What information should the watchdog report when a crash happens? Windows has a decent debug api, so I'm confident that all the interesting data is accessible, but I'm not sure what would be useful for tracking down the problems.

+1  A: 

I think a separate app to do the watchdogging is likely to produce more problems than it solves. I'd suggest that instead, you first create handlers to generate minidumps when the app crashes, then add a watchdog thread to the application, which will DELIBERATELY crash if the app goes off the rails. The advantage to the watchdog thread (vs a different app) is that it should be easier for the watchdog to know for sure that the app has gone off the rails.

Once you have the MiniDumps, you can poke around to find out the app's state when it dies. This should give you enough clues to figure out the problem, or at least where to look next.

There's some stuff at CodeProject about MiniDumps, which could be a useful example. MSDN has more information about them as well.

Michael Kohne
You dont have to crash the app in order to create the minidumps. You can call MiniDumpWriteDump() any time.
John Dibling
+4  A: 

You want a combination of a minidump (use DrWatson to create these if you don't want to add your own mini-dump generation code) and userdump to trigger a minidump creation on a hang.

The thing about automatically detecting a hang is that its difficult to decide when somethings hung and when its just slow or blocked by IO wait. I personally prefer to allow the user to crash the app deliberately when they think its hung. Apart from being a lot easier (my apps don't tend to hang often, if at all :) ), it also helps them to "be part of the solution". They like that.

Firstly, check out the classic bugslayer article concerning crashdumps and symbols, which also has some excellent information regarding what's going on with these things.

Second, get userdump which allows you to create the dumps, and instructions for setting it up to generate dumps

When you have the dump, open it in WinDBG, and you will be able to inspect the entire program state - including threads and callstacks, registers, memory and parameters to functions. I think you'll be particularly interested in using the "~*kp" command in Windbg to get the callstack of every thread, and the "!locks" command to show all locking objects. I think you'll find that the hang will be due to a deadlock of synchronisation objects, which will be difficult to track down as all threads tend to wait on a WaitForSingleObject call, but look further down the callstacks to see the application threads (rather than 'framework' threads like background notifications and network routines). Once you've narrowed them down, you can see what calls were being made, possibly add some logging instrumentation to the app to try and give you more information ready for the next time it fails.

Good luck.

Ps. Quick google reminded me of this: Debugging deadlocks. (CDB is the command line equivalent of windbg)

gbjbaanb
+2  A: 

You can use ADPlus from Microsoft's Debugging Tools for Windows to identify the hangs. It will attach to your process and create a dump (mini or full) when the process hangs or crashes.

WinDbg is portable, and does not have to be installed (you do have to configure the symbols, though). You can create a special installation that will launch your app using a batch, which will also run ADPlus after your app starts (ADPlus is a commandline tool, so you should be able to find a way to incorporate it somehow).

BTW, if you do find a way to recognize the hang internally and are able to crash the process, you can register with Windows Error Reporting so that the crash dump will be sent to you (should the user allow it).

eran
+1  A: 

Don't bother with a watchdog. Subscribe to Microsoft's Windows Error Reproting (winqual.microsoft.com). They'll collect the stacktraces for you. In fact, it's quite likely they're already doing so today; they don't share them until you sign up.

MSalters