I'm working with a somewhat unreliable (Qt/windows) application partly written for us by a third party (just trying to shift the blame there). Their latest version is more stable. Sort of. We're getting fewer reports of crashes, but we're getting lots of reports of it just hanging and never coming back. The circumstances are varied, and with the little information we can gather, we haven't been able to reproduce the problems.
So ideally, I'd like to create some sort of watchdog which notices that the application has locked up, and offers to send a crash report back to us. Nice idea, but there are problems:
How does the watchdog know the process has hung? Presumably we instrument the application to periodically say "all ok" to the watchdog, but where do we put that such that it's guarenteed to happen frequently enough, but isn't likely to be on a code path that the app ends up on when it's locked.
What information should the watchdog report when a crash happens? Windows has a decent debug api, so I'm confident that all the interesting data is accessible, but I'm not sure what would be useful for tracking down the problems.