Tracing memory corruption on a production linux server

views:

1235

answers:

+7 Q:

Tracing memory corruption on a production linux server

Guys, could you please recommend a tool for spotting a memory corruption on a production multithreaded server built with c++ and working under linux x86_64? I'm currently facing the following problem : every several hours my server crashes with a segfault and the core dump shows that error happens in malloc/calloc which is definitely a sign of memory being corrupted somewhere.

Actually I have already tried some tools without much luck. Here is my experience so far:

Valgrind is a great(I'd even say best) tool but it slows down the server too much making it unusable in production. I tried it on a stage server and it really helped me find some memory related issues but even after fixing them I still get crashes on the production server. I ran my stage server under Valgrind for several hours but still couldn't spot any serious errors.
ElectricFence is said to be a real memory hog but I couldn't even get it working properly. It segfaults almost immediately on the stage server in random weird places where Valgrind didn't show any issues at all. Maybe ElectricFence doesn't support threading well?.. I have no idea.
DUMA - same story as ElectricFence but even worse. While EF produced core dumps with readable backtraces DUMA shows me only "?????"(and yes server is built with -g flag for sure)
dmalloc - I configured the server to use it instead of standard malloc routines however it hangs after several minutes. Attaching a gdb to the process reveals it's hung somewhere in dmalloc :(

I'm gradually getting crazy and simply don't know what to do next. I have the following tools to be tried: mtrace, mpatrol but maybe someone has a better idea?

I'd greatly appreciate any help on this issue.

Update: I managed to find the source of the bug. However I found it on the stage server not production one using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...

+1 A:

you can try IBM purify, but i am afraid that is not opensource..

Neeraj 2009-07-25 19:57:11

Well if nothing else works... But I still believe there should be an OpenSource solution to this.

pachanga 2009-07-25 20:06:16

Also purify slows down the application considerably and cannot be used on a production machine.

steve 2009-07-25 20:13:49

+4 A:

Yes, C/C++ memory corruption problems are tough. I also used several times valgrind, sometimes it revealed the problem and sometimes not.

Why examining valgrind output don't tend to ignore it's result too fast. Sometimes after a considerable spended time, you see that valgrind gave you the clue on the first place, but you ignored it.

Another advise is to compare the code changes from previously known stable release. It's not problem if you use some sort of source versioning system (e.g. svn). Examine all memory related functions (e.g. memcpy, memset, sprintf, new, delete/delete[]).

dimba 2009-07-25 20:04:06

+1 for ignoring valgrind striking back

LiraNuna 2009-07-25 20:05:20

As for examining all memory related functions - I don't use them directly anywhere, all pointers are shared_ptrs or weak_ptrs and all containers are from stl...

pachanga 2009-07-25 20:13:42

STL is good but even with STL you can run into memory corruption problem, for example why using invalidated iterator. See http://www.angelikalanger.com/Conferences/Slides/CppInvalidIterators-DevConnections-2002.pdf

dimba 2009-07-25 20:22:45

Yep, I know it's always possible to shoot oneself in the foot even with such high-level libraries

pachanga 2009-07-25 20:29:35

+2 A:

The Google Perftools --- which is Open Source --- may be of help, see the heap checker documentation.

Dirk Eddelbuettel 2009-07-25 20:15:44

Thanks, going to try it right now

pachanga 2009-07-25 20:18:59

Unfortunately heap checker is pretty limited, it can detect only memory leaks and not memory overruns. It could not even detect mismatching new[]/delete :(

pachanga 2009-07-26 11:14:03

+4 A:

Compile your program with gcc 4.1 and the -fstack-protector-all switch. If the memory corruption is caused by stack smashing this should be able to detect it. You might need to play with some of the additional parameters of SSP.

steve 2009-07-25 22:11:12

+2 A:

Have you tried -fmudflap? (scroll up a few lines to see the options available).

David Wilson 2009-07-25 22:49:59

Thanks, I also found this link http://gcc.gnu.org/wiki/Mudflap_Pointer_Debugging

pachanga 2009-07-26 06:26:33

I'm currently figthing with "error: mudflap cannot track unknown size extern ‘__prime_list’" errors :( Any idea why they can happen? I have no __prime_list symbol anywhere in the code...

pachanga 2009-07-26 07:10:37

It does rely on libmudflap to be installed. Maybe it's not?

supercheetah 2009-07-31 21:41:49

it's installed for sure

pachanga 2009-08-13 14:01:57

+1 A:

Try this one: http://www.hexco.de/rmdebug/ I used it extensively, its has a low impact in performance(it mostly impacts amount of ram) but the allocation algorithm is the same. Its always proven enough to find any allocation bugs. Your program will crash as soon as the bug occurs, and it will have a detailed log.

daniel 2009-07-30 05:59:42

Thanks, I'll have a look at it. I wonder if it works fine in a c++ multithreading app...

pachanga 2009-07-30 09:56:00

Yes, threading should have no impact

daniel 2009-08-01 06:25:06

+1 A:

Folks, I managed to find the source of the bug. However I found it on the stage server using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...

pachanga 2009-07-31 20:33:11

+1 A:

I'm not sure if it would have caught your particular bug, but the MALLOC_CHECK_ environment variable (malloc man page) turns on additional checking in the default Linux malloc implementation, and typically doesn't have a significant runtime cost.

Dave Rigby 2009-08-02 18:27:20

Thanks, I've tried it as well(MALLOC_CHECK_=3), however, it didn't show my any source of memory corruption since(as I wrote earlier) the memory was corrupted by datarace not by improper usage of malloc/free...

pachanga 2009-08-03 04:41:06

ansaurus

tags:

views:

answers:

Tracing memory corruption on a production linux server

related questions