views:

163

answers:

3

I'm working with a single-threaded native c++ application. There is a very hard to reproduce bug that I cannot reproduce locally. I enabled full page heap and debug information in the release executable, and obtained dumps from a client (which has to use the application many days to get the bug).

What the client reports: the application hangs and never recovers. It has to be killed from the task manager. What I see from the dumps: the application is stuck in an infinite loop.

The loop is from walking a double linked list which has become cyclic. There are signs of memory corruption, in that many data members have strange values, like no matching enumerant, values under 0000FFFF or the linked list itself is reported as being 300 million+ in size which is not normal.

The only other information I can get from the dumps is that a socket read operation failed with 0 data read. This causes the walking of the (now cyclic) list.

I have several dumps all hanging in the same infinite loop. I've tried to get the allocation stack trace, but !heap -p -a gives me "ReadMemory error for address eeddccee Use `!address eeddccee' to check validity of the address." for all addresses I try.

Currently I'm looking into fixing the L4 warnings (except I don't know which can be related to this, I have a bunch of C4100, C4511, C4512 which I don't know how to fix; I'm mostly fixing no-brainer's like C4244). DebugDiag did not find anything, except give me a "This thread is not fully resolved and may or may not be a problem. Further analysis of these threads may be required." on the single thread.

From what I see, my options are fixing more warnings, re-reading the code until something jumps at me or learning something new from here.

Is this really a memory corruption? Why does it hang in the same structure every time? How can I find the cause?

+1  A: 

Fixing the warning errors is a good idea - it may help you feel better and will certainly reduce confusion in the build - but it's unlikely to resolve the present issue, so may be better left as an out-of-band task for the future.

Socket read failure with 0 data may imply the socket got closed down. Perhaps you have a timing problem here where socket closedown logic is resulting in concurrent access to some shared data structure that is not properly locked. Take a good look at the socket code to make sure locking is correct and watertight. Make sure that all possible error codes are handled correctly in your sockets API calls (Winsock, presumably?). You can be sure that even the slightest window for concurrent access on a container or "that can't happen" error paths will eventually be hit in your production environment. I know you said the app is single-threaded but Windows has a funny habit of giving you extra threads that you did not start up yourself, for example if you are using DLL services that themselves kick off new threads.

It's hard when you cannot get good production diagnostics, but if you can narrow down the problem to a particular area, try to isolate the failing code in a unit test application that mimics the usage in real life, and stress the heck out of it on your desktop. I have had intermittent bugs like this that even under heavy load in a specialized test app took hours to reproduce the problem. Running in this mode (release build of course) in the debugger may expose the issue more quickly that you would think.

Another option may be to install the Process Dumper on the failing machine and instruct it to dump a full memory image (debuggable as per standard Windbg DMP file) on access violation and process exit. This may provide better information than a minidump postmortem debug. If your client is cooperative they can instruct the dump to be generated when the problem next occurs. This is the closest you can get to a live debug without being on the machine or having remote access to it.

You may want to consider generating extra diagnostics in the socket closedown logic as well to verify whether or not this is the proximate cause of the error condition.

Make sure your client's OS and other system software is up-to-date with all required patches. Maybe this is not even your fault (though it seems likely that you have a bug, to be sure).

Steve Townsend
I'm using Winsock. Also the client is using userdump.exe already, but it manually triggers the dump when the application hangs (it never crashes, just hangs). Also using gflags to toggle full page heap for it. I've also got a test case running 24/7 for weeks on a local PC, but no reproduction.
adrian8400
hangs = 'high CPU', or 'low CPU'? High CPU would be consistent with infinite loop, say on corrupt data structure. In this case dumping the process to obtain sequence of callstacks should be instructive. Low CPU would be consistent with loss of control in event-driven code following mishandled error path eg. socket failure. In this case, dump of process endstate plus more diags to isolate behaviour leading up to failure are best bet imo.
Steve Townsend
I don't have this information (low cpu/high cpu on hang). I'll add it if/when I can get it. But since the client can dump the process easily, it might be that the CPU is not at 100%?
adrian8400
It's vital to understand whether you are in a tight loop or just non-responsive. Approach going forward depends on which.
Steve Townsend
I will get this information the next time the bug reproduces; I already tried to get the callstacks from the full page heap dump but as I've said I get "ReadMemory error for address eeddccee Use `!address eeddccee' to check validity of the address."
adrian8400
A: 

This can be pretty much anything.

If it is heap corruption, try to insert heap checks into the code at strategic places. Make sure you binaries are compiled with the run time checks that Visual C++ compiler offers. If possible, obtain a testcase from your users. If this is not possible, try to get them run debugging binary and/or debug the live application. Fixing the warning is good idea though I find most of VC's level 4 warnings less than useful. Sprinkle your code liberally with assert(like) checks. Make sure all your pre-conditions and post-conditions are checked. Make sure you are really handling each and every return value of all function calls. Also avoid any questionable practices in code like using C-style casts and type punning.

wilx
The debug build is off limits to the client per my knowledge (I'd have to take out some functionality from it to let them run it on a production server). I've added debug information to the release build, but still I'm missing asserts and the heap checks you recommend. I have a testcase but I cannot reproduce it with automated testing. Live debugging is out of the question, the reproduction rate is much too low. C-style casts/type punning are used in some places though, so I'll try to see about that.
adrian8400
A: 

If it is some kind of heap corruption, then Application Verifier could help detect that in your own environment.

Set full page heap validation. If your application has any heap overrun or underrun, it will be caught immediately.

If Application Verifier or some other tool does not easily uncover the problem, then it may come down to deducing what could have led to the problem. Focus on a specific issue such as the circular list. What could cause that? The obvious places to look are at all pieces of code that touch the list (it is possible that some random bad memory write could cause it but more often the culprit is closer to the scene of the crime).

If the list is only accessed through well-defined methods, then your job is easier. If it is through a global pointer that everyone can touch, then it is harder but still possible to examine if you search through all references (any good editor can do that). If you find, for example, an error case that maybe doesn't clean up nicely and fill in a back link correctly, then you might be half way there. You then work backwards from there. What could cause that specific error? And so on. Deducing a "possible" chain of events that can lead to a certain situation can often resolve a problem like this (and can make you feel like a magician in the process especially if it is someone else's bug that you find).

Mark Wilkins
the bug is definitely someone else's, unfortunately. Running Application Verifier in my environment did not help, but since the reproduction rate is very low, I'd probably have to keep at it for many days. I'd prefer to think my way out of the problem instead.
adrian8400
@adrian8400, then I would clean up and modularize as much as possible. Double check every possible return from system calls, like sockets etc.
Amigable Clark Kant