I'm working with a single-threaded native c++ application. There is a very hard to reproduce bug that I cannot reproduce locally. I enabled full page heap and debug information in the release executable, and obtained dumps from a client (which has to use the application many days to get the bug).
What the client reports: the application hangs and never recovers. It has to be killed from the task manager. What I see from the dumps: the application is stuck in an infinite loop.
The loop is from walking a double linked list which has become cyclic. There are signs of memory corruption, in that many data members have strange values, like no matching enumerant, values under 0000FFFF or the linked list itself is reported as being 300 million+ in size which is not normal.
The only other information I can get from the dumps is that a socket read operation failed with 0 data read. This causes the walking of the (now cyclic) list.
I have several dumps all hanging in the same infinite loop. I've tried to get the allocation stack trace, but !heap -p -a gives me "ReadMemory error for address eeddccee Use `!address eeddccee' to check validity of the address." for all addresses I try.
Currently I'm looking into fixing the L4 warnings (except I don't know which can be related to this, I have a bunch of C4100, C4511, C4512 which I don't know how to fix; I'm mostly fixing no-brainer's like C4244). DebugDiag did not find anything, except give me a "This thread is not fully resolved and may or may not be a problem. Further analysis of these threads may be required." on the single thread.
From what I see, my options are fixing more warnings, re-reading the code until something jumps at me or learning something new from here.
Is this really a memory corruption? Why does it hang in the same structure every time? How can I find the cause?