ansaurus

Question

C++/msvc6 application crashes due to heap corruption, any hints?

Answer 1

+1 A:

You can try peppering your code with calls to the debug heap checking routines to see if you can locate the corruption closer to the source (you're using the debug CRT to track down this problem, right?):

http://msdn.microsoft.com/en-us/library/aa271695(VS.60).aspx

Michael Burr 2010-04-07 15:31:54

Thank you for your piece of advice. Currently, we can only use a debug version with very light loads of work which doesn't seem to be the case when the crash happens. Maybe there is some intermediate option?

David Alfonso 2010-04-07 15:49:14

@David: that's a pretty harsh restriction in being able to debug a corruption problem... off the top of my head, I think the next step I'd take is to examine the heap memory to see if the area that's corrupted has some clues as to what's corrupting it - sometimes the data will have a certain pattern or might contain pointer to something that give a clue what was doing the writes. That's a pretty labor intensive technique though, and there might be no payoff.

Michael Burr 2010-04-07 16:23:15

We've been doing some code review and dump analysis, but there are no good news yet. Thank your for your suggestion.

David Alfonso 2010-04-12 14:05:10

Answer 2

+1 A:

Use Application Verifier from debugging tools for windows. Sometimes it helps.

Try to set up VS to download OS debug symbols and make sure that OMIT FRAME POINTERS is off in your application. Perhaps stack trace will be informative.

Highly multithreaded

Long time ago I discovered that there is a limit for thread count per process in WinXP. My test snippet could create only few thoursands of thread. The problem was resolved by thread pool.

EDIT:

For my purposes there was enough just to check “Application Verifier” checkbox in gflags.exe. Unfortunately, I have no experience with other options. As for thread limit, test snippet was simple:

unsigned __stdcall ThreadProc(LPVOID)
{
  _tprintf(_T("Thread started\n"));
  return 0;
}

int _tmain(int argc, _TCHAR* argv[])
{
  while (TRUE)
  {
    unsigned threadId = 0;
    _tprintf(_T("Start thread\n"));
    _beginthreadex( NULL, 0, &ThreadProc, NULL, 0, &threadId);
  }
  return 0;
}

I didn’t wait long this time, but handle count in Task Manager was increasing very fast. My real world application got this effect only in 12 hours. But must say the issue was not in crashing, new threads just not created.

Eugene 2010-04-08 14:56:00

Thank you Eugene! Let me comment your suggestions:- Application Verifier also slows down the application to a great extent. Would you recommend any specific flag?- We can't generate symbols because we're linking with a library which doesn't link if we're using them.- FPO is not being used.- Could you elaborate more on the thread limit? Do you mean thread count at the same time? or in the whole life span of the application? I'll investigate this fact as it seems very promising.

David Alfonso 2010-04-09 07:28:36

Answer 3

+1 A:

The key here is that this only happens on multiprocessor machines (Cores are the same as processors) What happens when a threaded program runs on a single processor is that two threads never execute at the same time. The OS has to time-slice each processor to simulate threads. In a multiprocessor system multiple threads can operate at the same time. You are probably accessing shared resources from different threads at the same time now. These resources can be be connections to external systems and even global variables and data structures even Singleton classes. Unfortunately you now have one of the hardest problems to find. If you can find the memory being corrupted then you need to find who else is using it on a different thread and then synchronize the memory (Semaphore or CriticalSection). Unfortunately there is no easy way to find the problem.

You might be able to set the processor affinity temporarily to only run on one processor until you find the problem. See link http://msdn.microsoft.com/en-us/library/ms684251(VS.85).aspx Here is a method to set affinity on For Windows XP/Vista/7, access Affinity by opening the Windows Task Manager (CTL+ALT+DEL, or right-click on Task Bar), select "Processes" tab, right-click the application process you wish to isolate, then select "Set Affinity." Inside the Processor Affinity dialog, un-check the CPU/cores you do not need to use. This effectively isolates that application to the selected CPUs/cores preventing cashe spanning and reducing process-switching and simplifies your ability to supervise CPU/core allocation for multiple programs.

Romain Hippeau 2010-04-12 01:02:27

Romain, thank you for your suggestion. We might consider trying this in order to slow exception's frequency, as long as it doesn't impact performance.

David Alfonso 2010-04-13 07:18:18

@David Alfonso performance is nothing if it keeps crashing

Romain Hippeau 2010-04-14 13:09:35

You're absolutely right, Romain :-)

David Alfonso 2010-04-14 19:57:37

Answer 4

+1 A:

Can you post what exceptions you are getting?

If this is some memory corruption bug, then the crash occurs sometime after the memory corruption, so that will be challenging to track down the root cause. You should:

Travel (or remotely logon) to the production system, install Visual Studio, have .pdb and .map files ready (and windows' symbols as well), attach debugger to the release-build and wait for the crash. Though if you set it up correctly, you can use the minidump file on your dev machine, where you would already have your app and window's symbols setup. Then you can see which free call is throwing, and try to figure out which object is being freed to see if that object is corrupted somehow and nearby objects in memory.
Somehow find a way to reproduce the bug in your office, can you create high enough volumes to duplicate what the customer is doing?

Your posted callstacks don't look particularly illuminating.

Since you are using VS 6 with SP6, then its STL is OK.

Can you tell if the app on the production system is leaking any resources? Running perfmon can help with this.

Another thing, you're not calling new/delete like very frequently from different threads are you? I've found that if you do this fast enough, you'll crash your app rather quickly (did this on XP). I had to replace new/delete calls in my app with VirtualAlloc (windows Virtual Memory API), that worked great for me. Of course, STL could be allocating from the heap as well.

Chris O 2010-04-12 01:34:44

Hello Chris, I've updated my question with the exception code. We're trying to duplicate the crash in our office but it's proven itself a difficult task. On the other hand, we are doing new/delete rather frequently in different threads so I'm trying to substitute some of them with VirtualAllocs/Frees as you propose. I'll let you know if this improves the situation. Thank you very much for your suggestions!

David Alfonso 2010-04-13 07:46:46

Answer 5

+1 A:

Use a performance profiler that can hook into CPU events, such as VTune. Set it up in sampling mode and tell it to wait for events related to cache line sharing. These are identified by a HITM event from the SNOOP phase.

If you run this on a multi processor machine with a realistic workload then it will find places in your code where there is active contention between threads for a single piece of data. You will need to analyze the profiler hot spots found this way and try to find something that is not being wrapped in an appropriate mutex.

I'm not an expert on CPU architecture or anything, but my understanding is that when the CPUs are about to access a piece of data the system will check if any other CPUs are accessing the same piece of data, this is done by watching the memory fetches and writes coming out of each CPU, a process called snooping. Snooping makes sure that if TWO or more CPUs have the same data in each of their caches that the duplicated copies of the data are removed when one of them is modified. A HIT-Modified event means that the system detected this situation and had to flush one of the CPUs cache lines.

See this document for more information on using VTune like this

http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/

I don't have a copy of VTune in front of me right now so maybe this won't work but it seems like the lowest impact way of getting some data. VTune in sampling mode should not cause a lot of problems with performance.

Brian Sandlin 2010-04-12 02:45:40

Brian, thank you very much for your suggestion. I'll give it a try as soon as I have time and let you know how it does.

David Alfonso 2010-04-12 12:55:02

Answer 6

+1 A:

As your second stack trace shows, your application is corrupting the heap. The header of a heap block is written over and thus the crash occurs in the heap manager when coalescing free blocks, or when going through the free list (in the first stack trace). The code you identified that is currently freeing memory may be a victim of another code overflowing or underflowing a memory block.

The easiest way to debug this kind of crash is to use the debugging help from windows, through pageheap or appverifier, but depending on the application it may slow down too much, or grow the memory usage too high to be usable, which seems to be the case. You may try to use light pageheap, which will have less impact.

You need to identify what part of the application is overflowing. One way to do this is to look at the information contained in the overflown block. If you have a crash in RtlpCoalesceFreeBlocks, I think I remember one of the registers (@esi) is pointing to the start of the corrupted block (I am not on a windows system at the time of this writing and can not check that). Or if you have a dump, using windbg command !heap -a will dump all memory and display corrupted blocks (better log into a file, since the full heap listing can be long). Once corrupted blocks are known, their content may help to identify the code.

Another help can be to enable the stack backtraces (using gflags). This can be done in production as it is lighter than pageheap. It will add some information to heap blocks and may move the crash to another place in your application, but the stack traces will help to identify what code allocated the blocks that are overflowing.

plodoc 2010-04-14 22:07:19

Answer 7

A:

I would focus on getting the issue to happen on a build for which you have proper debugging symbols, at least for your main application. You seem to gloss over this with "sorry we don't have symbols", but when symbols are applied, the stacktraces may show you more information.

What exactly does this mean: "We can't generate symbols because we're linking with a library which doesn't link if we're using them."? This seems odd.

pj4533 2010-04-22 15:23:57

Thank you for your answer pj4533. As a matter of fact, we are moving to a new compiler (VS2008) and we do have symbols now, but the crash doesn't reproduce.We can't generate symbols because we're linking with an static library which doesn't allow debug information (I don't remember the exact flag which causes the error when linking the application).

David Alfonso 2010-04-22 18:31:39

Answer 8

A:

check if you have any thread locking mechanisms that may not work correctly. you can do it by adding delays near the crash zone.

neatsun 2010-05-29 12:55:54

ansaurus

tags:

views:

answers:

C++/msvc6 application crashes due to heap corruption, any hints?

related questions