views:

79

answers:

5

My application is deployed at customer sites, that I can not access, and has no internet connection.

There are complains that in serveral sites, once in a week or so, the application become unresponsive, so that the operators need to kill and restart it.

We were unable to observe it in our site.

Is there something I can do that may help me find the problem?

It is a VC2008 Win32 MFC applications.

The application is quite complex, and includes many threads, synchronization mechanisms, database access, HMI, communication channels...

Note: The custmer can send us log files.

Note: The application does not crash. It just hangs. Since I don't know what is the nature of the problem, I have no way to know programmatically that something went wrong (or do I?)

+1  A: 

In similar situation on a non-windows platform we have the capability to gather system dumps. Get a thread dump of the entire system for off-site analysis. This enables us to find deadlocks quite easily. For slow problems rather than stop a single dump is not enough. Then we need a sequence of dumps, and some good luck.

Another, rather messier technique is to have enough trace, and enough fine-grained control of trace in the app. Then turn on some trace and hope to spot where the delays are happening.

djna
+1  A: 

My experience with finding bugs in installations on the other side of the planet shows three helpful techniques: Logging, logging, and logging.

What do those log files say your customers sent you? If they aren't detailed enough, send them a version that logs more. Use binary approximation to home in on the error.

sbi
+1  A: 

To know where the process is hung is better to start with the stack trace at that instant.

Now since your program is installed remotely and you can't access it, you can write a monitoring program which can periodically check the stack of your program and log it. This information along with your logging mechanism will make things easier to identify and debug.

Since I am not a windows programmer, i don't know much about such tools availability in windows, however i think you need something similar to this http://www.codeproject.com/KB/threads/StackWalker.aspx

aeh
+2  A: 

I would start with some questions - is the CPU hogged during these unresponsive times? Is there a specific process that's hogging it? (You can use PerfMon to get the answers). Depending on the answers I would probably proceed by taking a dump of the process at this stage (ProcDump by sysinternals is great for these purposes) and investigate it offline.

On Freund
+3  A: 

I have had great success with ADplus and WinDBG in the past. You may check it out. Especially check out the Hang mode in ADplus.

Chubsdad
+1. ADPlus's hang mode is made for this kind of problem.
the_mandrill