I'm writing a C application which is run across a compute cluster (using condor). I've tried many methods to reveal the offending code but to no avail.
Clues:
- On Average when I run the code on 15 machines for 2 days, I get two or three segfaults (signal 11).
- When I run the code locally I do not get a segfault. I ran it for nearly 3 weeks on my home machine.
Attempts:
- I ran the code in valGrind for four days locally with no memory errors.
- I captured the segfault signal by defining my own signal handler so that I can output some of the program state.
- Now when a segfault happens I can print out the current stack using backtrace.
- I can print out variable values.
- I created a variable which is set to the current line number.
- Have also tried commenting chunks of the code out, hoping that if the problem goes away I will discover the segfault.
Sadly the line number outputted is fairly random. I'm not entirely sure what I can do with the stacktrace. Am I correct in assuming that it only records the address of the function in which the segfault occurs?
Suspicions:
- I suspect that the check pointing system which condor uses to move jobs across machines is more sensitive to memory corruption and this is why I don't see it locally.
- That indices are being corrupted by the bug, and that these indices are causing the segfault. This would explain the fact that the segfaults are occurring on fairly random line numbers.
UPDATE
Researching this some more I've found the following links:
LibSegFault - a library for automatically catching and printing state data about segfaults.
Stack unwinding (stack trace) with GCC tutorial on catching segfaults and get the line numbers of the offending instructions.
UPDATE 2
Greg suggested looking at the condor log and to 'correlate the segfaults to when condor restarts the executable from a checkpoint'. Looking at the logs the segfaults all occur immediately after a restart. All of the failures appear to occur when a job switches from one type of machine to another type.
UPDATE 3
The segfault was being caused by differences between hosts, by setting the 'requiremets' field in the condor submit file to problem completely disappeared.
One can set individual machines:
requirements = machine == "hostname1" || machine == "hostname2"
or an entire class of machines:
requirements = classOfMachinesName
See requirements example here