views:

770

answers:

15

I am getting random crashes on my C++ application, it may not crash for a month, and then crash 10 times in a hour, and sometimes it may crash on launch, while sometimes it may crash after several hours of operation (or not crash at all).

I use GCC on GNU/Linux and MingW on Windows, thus I can't use the Visual Studio JIT Debug...

I have no idea on how to proceed, looking randomly on the code would not work, the code is HUGE (and good part was not my work, also it has some good amount of legacy stuff on it), and I also don't have a clue on how to reproduce the crash.

EDIT: Lots of people mentioned that... how I make a core dump, minidump or whateverdump? This is the first time I need postmortem debugging.

EDIT2: Actually, DrMingw captured a call stack, no memory info... Unfortunately, the call stack don't helped me much, because near the end suddenly it go into some library (or something) that I don't have debug info, resulting only into some hexadecimal numbers... So I still need some decent dump that give more information (specially about what was in the memory... specifically, what was in the place that gave the "access violation" error)

Also, my application use Lua and Luabind, maybe the error is being caused by a .lua script, but I have no idea on how to debug that.

+7  A: 

Start the program under debugger (I'm sure there is a debugger together with GCC and MingW) and wait until it crashes under debugger. At the point of crash you will be able to see what specific action is failing, look into assembly code, registers, memory state - this will often help you find the cause of the problem.

sharptooth
gdb runs under mingw
sje397
I can't do that, the performance under the debugger is too slow to make the program useful at all, and it may take a LOOOOONG time before crashing. So it would require me to use GDB all the time, and for this project this is totally unreasonable.
speeder
@speeder: I personally have never seen any difference in speed when running under debugger. I don't mean step-by-step, I mean just run and leave it running until it crashes.
sharptooth
I usually don't either, but my program debug version binary has 140Mb, it also loads more 100Mb of data (good part generated on the fly) and GDB itself when loaded with my program take more 200Mb...This results in the OS going nuts with page files, also my memory is not the greatest one out there (in fact it is quite old, and 2GB in total...)
speeder
@speeder: You could do the following: compile the program with .pdb and with optimizations. This way it will not be as bloated as a full debug version and you will still be able to see the callstack when it crashes under debugger.
sharptooth
+23  A: 

Try Valgrind (it's free, open-source):

The Valgrind distribution currently includes six production-quality tools: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler. It also includes two experimental tools: a heap/stack/global array overrun detector, and a SimPoint basic block vector generator. It runs on the following platforms: X86/Linux, AMD64/Linux, PPC32/Linux, PPC64/Linux, and X86/Darwin (Mac OS X).

Valgrind Frequently Asked Questions

The Memcheck part of the package is probably the place to start:

Memcheck is a memory error detector. It can detect the following problems that are common in C and C++ programs.

  • Accessing memory you shouldn't, e.g. overrunning and underrunning heap blocks, overrunning the top of the stack, and accessing memory after it has been freed.

  • Using undefined values, i.e. values that have not been initialised, or that have been derived from other undefined values.

  • Incorrect freeing of heap memory, such as double-freeing heap blocks, or mismatched use of malloc/new/new[] versus free/delete/delete[]

  • Overlapping src and dst pointers in memcpy and related functions.

  • Memory leaks.

Mitch Wheat
+1. Valgrind can often hand you the line number of your bug for zero effort. It's like magic.
Jason Orendorff
I'll +1 this as well. Having only recently started using this, I find it damn-near indispensable.
paxdiablo
Valgrind is great but unforunately won't catch errors on Windows/MINGW because it does not exist there. Possible replacements: * http://stackoverflow.com/questions/413477/is-there-a-good-valgrind-substitute-for-windows
Luther Blissett
@Luther Blissett : poster is also running on Linux
Mitch Wheat
@Mitch: My comment does not deny that.
Luther Blissett
There is windbg for windows.
Mitch Wheat
My question is exactly because I can't use for example Valgrind all the time...Valgrind makes the program SLOOOOOOW, INCREDIBLY SLOOOOW.And it may take hours to crash, or months... I can't work a entire month with the program running on Valgrind...
speeder
+8  A: 

Where I work, crashing programs usually generates a core dump file that can be loaded in windbg.

We then have an image of the memory at the time the program crashed. There's nothing much you can do with it, but a least it gives you the last call stack. Once you know the function which crashed, you might then be able to track down the problem are at least you might reduce the problem to a more reproductible test-case.

ereOn
Could you give some details? My latest info regarding mingw is that mingw-gcc binaries can't generate core dumps and windbg has very little to say about mingw binaries because they use the stabs debugging format which windbg doesn't understand.
Luther Blissett
@Luther Blissett, unfortunately, the core dumps files seem generated by the system (I work for a very big company and i'm not part of the team that actually set this up). However I'm sure that my test binaries (created with mingw) are "core-dumped" on crashes, and I highly doubt the team in charge added a special case for this.
ereOn
I believe these are called (MS term) "Minidumps." windbg has a setting to read these "post-mortem" and can reveal "stuff."
JustBoo
@JustBoo: Just asked my collegues, and yes, you're right.
ereOn
+4  A: 

Run the application on Linux under valgrind to look for memory errors. Random crashes are usually down to corrupting memory.

Fix every error you find with valgrind's memcheck tool, and then hopefully the crash will go away.

If the whole program takes too long to run under valgrind, then split off functionality into unit tests, and run those under valgrind, hopefully you'll find the memory errors that are causing the problems.

If it doesn't then make sure coredumps are enabled (ulimit -a) and then when it crashes you'll be able to find out where with gdb.

Douglas Leeder
Does valgrind finally run on Windows ? I've been looking for that for years now.
ereOn
@ereOn: Unfortunately no, it does not, but the OP is also using Linux, so it should be an option for him. Only Linux and OS X are really supported right now, though there are unofficial ports for FreeBSD and NetBSD. see http://www.valgrind.org/info/platforms.html
Nicholas Knight
Valgrind is useless for this project, because it run so slow, that until some error happen that valgrind can catch, I might be dead of old age...
speeder
Unfortunately `valgrind` or some other memory checker is the best thing I can suggest. Otherwise you pretty much have to rewrite the application.
Douglas Leeder
+3  A: 

That sounds like something tricky like a race condition.

I'd suggest you create a debug build and use that. You should also make sure that a core dump is created when the program crashes.

The next time the program crashes, you can launch gdb on the coredump and see where the problem lies. It'll probably be a consecutive fault, but this should get you started.

fhd
It **could** be a race condition or anything else that results in undefined behavior. We don't have enough information to do educated guesses.
ereOn
Yeah, that's why the core dump should help. He said he didn't want to look at the source code randomly and I agree. The core dump should get him started.
fhd
+12  A: 

If all else fails (particularly if performance under the debugger is unacceptable), extensive logging. Start with the entry points -- is the app transactional? Log each transaction as it comes in. Log all the constructor calls for your key objects. Since the crash is so intermittent, log calls to all the functions that might not get called every day.

You'll at least start narrowing down where the crash could be.

Nicholas Knight
+1 ... I see this in my current project.
Kedar
I used to do the same, but I noticed that often, logging causes the program to do I/O which *sometimes* prevents some bugs/race conditions to happen. I believe logging is a more effective technique when you have a bug that occurs in a deterministic way.
ereOn
Yeah, ya gotta love those Heisenbugs.
paxdiablo
I *hate* Heisenbugs/Schrödingbugs. Getting rid of them so that behavior is predictable (possibly leading to a crash, but then with a known cause) is very important, since that almost always leads shortly after to fully working code…
Donal Fellows
A: 

You have probably made a memory error where you put some values to not allocated space somehow, it is a good reason for random crashes, for a long time noone tries to use that memory so there will be no errors, you can take a look the places where you allocate memory and check where you extensively use pointers. Other than this, as others pointed out you should use extensive logging, in both screen and files.

LostMohican
+5  A: 

These sorts of bugs are always tricky - unless you can reproduce the error then your only option is to make changes to your application so that extra information is logged, and then wait until the error happens again in the wild.

There is an excellent tool called Process Dumper that you can use to obtain a crash dump of a process that experiences an exception or exits unexpectedly - you could ask users to install that and configure rules for your application.

Alternatively if you don't want to ask users to install other applications you could have your application monitor for exceptions and create a dump itself by calling MiniDumpWriteDump.

The other option is to improve the logging, however figuring out what information to log (without just logging everything) can be tricky, and so it can take several iterations of crash - change logging to hunt down the problem.

As I said, these sorts of bugs are always tricky to diagnose - in my experience it generally involves hours and hours of peering through logs and crash dumps until suddenly you get that eureka moment where everything makes sense - the key is collecting the right information.

Kragen
+6  A: 

First, you are lucky that your process crashes multiple times in a short time-period. That should make it easy to proceed.

This is how you proceed.

  • Get a crash dump
  • Isolate a set of potential suspicious functions
  • Tighten up state checking
  • Repeat

Get a crash dump

First, you really need to get a crash dump.

If you don't get crash dumps when it crashes, start with writing a test that produces reliable crash dumps.

Re-compile the binary with debug symbols or make sure that you can analyze the crash dump with debug symbols.

Find suspicious functions

Given that you have a crash dump, look at it in gdb or your favorite debugger and remember to show all threads! It might not be the thread you see in gdb that is buggy.

Looking at where gdb says your binary crashed, isolate some set of functions you think might cause the problem.

Looking at multiple crashes and isolating code sections that are commonly active in all of the crashes is a real time-saver.

Tighten up state checking

A crash usually happens because some inconsistent state. The best way to proceed is often to tighten the state requirements. You do this the following way.

For each function you think might cause the problem, document what legal state the input or the object must have on entry to the function. (Do the same for what legal state it must have on exit from the function, but that's not too important).

If the function contains a loop, document the legal state it needs to have at the beginning of each loop iteration.

Add asserts for all such expressions of legal state.

Repeat

Then repeat the process. If it still crashes outside of your asserts, tighten the asserts further. At some point the process will crash on an assert and not because of some random crash. At this point you can concentrate on trying to figure out what made your program go from a legal state on entry to the function, to an illegal state at the point where the assert happened.

If you pair the asserts with verbose logging it should be easier to follow what the program does.

A: 

Two more pointers/ideas (besides core dump and valgrind on Linux):

1) Try Nokia's "Qt Creator". It supports mingw and can act as post-mortem debugger.

2) If it's feasible, maybe just run the application in in gdb constantly?

Frank
+3  A: 

The first thing I would do is debug the core dump with gdb (both Windows and Linux). The second would be be running a program like Lint, Prefast (Windows), Clang Analyzer or some other static analysis program (be prepared for a lot of false positives). Third thing would be some kind of runtime check, like Valgrind (or its close variants), Microsoft Application Verifier, or Google Perftools.

And logging. Which doesn't have to go to disk. You could, for instance, log to a global std::list<std::string> which would be pruned to the last 100 entries. When an exception is caught display the contents of that list.

Max Lybbert
+1 for Application Verifier. I'd actually start there, if valgrind is too slow.
leander
A: 

If your application is not Windows specific, you may try compiling and running your program on other platforms such as Linux (different distribution, 32/64 bits, ... if you've the luxury). That may help trigger the bugs of your program. Of course you should use the tools mentioned in other posts such as gdb, valgrind, etc.

tofu
+5  A: 

It sounds like your program is suffering from memory corruption. As already said your best option on Linux is probably valgrind. But here are two other options:

  • First of all use a debug malloc. Nearly all C libraries offer a debug malloc implementation that initialize memory (normal malloc keeps "old" contents in memory), check the boundaries of an allocated block for corruption and so on. And if that's not enough there is a wide choice of 3rd party implementations.

  • You might want to have a look at VMWare Workstation. I have not set it up that way, but from their marketing materials they support a rather interesting way of debugging: Run the debugee in a "recording" virtual machine. When memory corruption occurs set a memory breakpoint at the corrupted address an then turn back time in the VM to exactly that moment when that piece of memory was overwritten. See this PDF on how to setup replay debugging with Linux/gdb. I believe there is a 15 or 30 days demo for Workstation 7, that might be enough to shake out those bugs from your code.

froh42
+3  A: 

You've already heard how to handle this under linux: inspect core dumps and run your code under valgrind. So your first step could be to find the errors under Linux and then check if they vanish under mingw. Since nobody did mention mudflap here, I'll be doing it: Use mudflap if your Linux distribution supplies it. mudflap helps you to catch pointer misuse and buffer overflows by tracking the information where a pointer is actually allowed to point to:

And for Windows: There is a JIT debugger for mingw, called DrMingw:

Luther Blissett
Ooooh... That stuff actually worked! There are one particular crash that I know how to cause (but I don't know how to fix), that I used to test DrMingw Too bad it offer no information about memory, only about the call stack... :(
speeder
+3  A: 
  1. Start Logging. Put logging statements in places where you think the code flaky. focus on testing the code, and repeat until you narrow down the problem to a module or a function.

  2. Put asserts everywhere!

  3. While you are at it, Only put one expression in an assert.

  4. Write a unit test for the code you think is failing. That way you can exercise the code in isolation from the rest of your runtime environment.

  5. Write more automated tests that exercise the problematic code.

  6. Do not add more code on top of the bad code that is failing. That's just a dumb idea.

  7. Learn how to write out mini-dumps and do post-mortem debugging. It looks like others here have explained that quite well.

  8. Exercise the bad code from as many different possible ways as you can to make you can isolate the bug.

  9. Use a debug build. Run the debug build under the debugger if possible.

  10. Trim down your application by removing binaries, modules etc... if possible so that you can have an easier time attempting to reproduce the bug.

C Johnson