views:

1235

answers:

8

Guys, could you please recommend a tool for spotting a memory corruption on a production multithreaded server built with c++ and working under linux x86_64? I'm currently facing the following problem : every several hours my server crashes with a segfault and the core dump shows that error happens in malloc/calloc which is definitely a sign of memory being corrupted somewhere.

Actually I have already tried some tools without much luck. Here is my experience so far:

  • Valgrind is a great(I'd even say best) tool but it slows down the server too much making it unusable in production. I tried it on a stage server and it really helped me find some memory related issues but even after fixing them I still get crashes on the production server. I ran my stage server under Valgrind for several hours but still couldn't spot any serious errors.

  • ElectricFence is said to be a real memory hog but I couldn't even get it working properly. It segfaults almost immediately on the stage server in random weird places where Valgrind didn't show any issues at all. Maybe ElectricFence doesn't support threading well?.. I have no idea.

  • DUMA - same story as ElectricFence but even worse. While EF produced core dumps with readable backtraces DUMA shows me only "?????"(and yes server is built with -g flag for sure)

  • dmalloc - I configured the server to use it instead of standard malloc routines however it hangs after several minutes. Attaching a gdb to the process reveals it's hung somewhere in dmalloc :(

I'm gradually getting crazy and simply don't know what to do next. I have the following tools to be tried: mtrace, mpatrol but maybe someone has a better idea?

I'd greatly appreciate any help on this issue.

Update: I managed to find the source of the bug. However I found it on the stage server not production one using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...

+1  A: 

you can try IBM purify, but i am afraid that is not opensource..

Neeraj
Well if nothing else works... But I still believe there should be an OpenSource solution to this.
pachanga
Also purify slows down the application considerably and cannot be used on a production machine.
steve
+4  A: 

Yes, C/C++ memory corruption problems are tough. I also used several times valgrind, sometimes it revealed the problem and sometimes not.

Why examining valgrind output don't tend to ignore it's result too fast. Sometimes after a considerable spended time, you see that valgrind gave you the clue on the first place, but you ignored it.

Another advise is to compare the code changes from previously known stable release. It's not problem if you use some sort of source versioning system (e.g. svn). Examine all memory related functions (e.g. memcpy, memset, sprintf, new, delete/delete[]).

dimba
+1 for ignoring valgrind striking back
LiraNuna
As for examining all memory related functions - I don't use them directly anywhere, all pointers are shared_ptrs or weak_ptrs and all containers are from stl...
pachanga
STL is good but even with STL you can run into memory corruption problem, for example why using invalidated iterator. See http://www.angelikalanger.com/Conferences/Slides/CppInvalidIterators-DevConnections-2002.pdf
dimba
Yep, I know it's always possible to shoot oneself in the foot even with such high-level libraries
pachanga
+2  A: 

The Google Perftools --- which is Open Source --- may be of help, see the heap checker documentation.

Dirk Eddelbuettel
Thanks, going to try it right now
pachanga
Unfortunately heap checker is pretty limited, it can detect only memory leaks and not memory overruns. It could not even detect mismatching new[]/delete :(
pachanga
+4  A: 

Compile your program with gcc 4.1 and the -fstack-protector-all switch. If the memory corruption is caused by stack smashing this should be able to detect it. You might need to play with some of the additional parameters of SSP.

steve
+2  A: 

Have you tried -fmudflap? (scroll up a few lines to see the options available).

David Wilson
Thanks, I also found this link http://gcc.gnu.org/wiki/Mudflap_Pointer_Debugging
pachanga
I'm currently figthing with "error: mudflap cannot track unknown size extern ‘__prime_list’" errors :( Any idea why they can happen? I have no __prime_list symbol anywhere in the code...
pachanga
It does rely on libmudflap to be installed. Maybe it's not?
supercheetah
it's installed for sure
pachanga
+1  A: 

Try this one: http://www.hexco.de/rmdebug/ I used it extensively, its has a low impact in performance(it mostly impacts amount of ram) but the allocation algorithm is the same. Its always proven enough to find any allocation bugs. Your program will crash as soon as the bug occurs, and it will have a detailed log.

daniel
Thanks, I'll have a look at it. I wonder if it works fine in a c++ multithreading app...
pachanga
Yes, threading should have no impact
daniel
+1  A: 

Folks, I managed to find the source of the bug. However I found it on the stage server using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...

pachanga
+1  A: 

I'm not sure if it would have caught your particular bug, but the MALLOC_CHECK_ environment variable (malloc man page) turns on additional checking in the default Linux malloc implementation, and typically doesn't have a significant runtime cost.

Dave Rigby
Thanks, I've tried it as well(MALLOC_CHECK_=3), however, it didn't show my any source of memory corruption since(as I wrote earlier) the memory was corrupted by datarace not by improper usage of malloc/free...
pachanga