ansaurus

Question

Methods/Tools for solving a Mystery Segfault while running on condor

Answer 1

A:

You've tried most of what I'd think of. The only other thing I'd suggest is start adding a lot of logging code and hope you can narrow down where the error is happening.

Colin 2010-09-10 01:39:38

Answer 2

+2 A:

if you can, compile with debugging, and run under gdb. alternatively, get core dumped and load that into debugger.

mpich has built-in debugger, or you can buy commercial parallel debugger.

Then you can step through the code to see what happening in debugger

http://nmi.cs.wisc.edu/node/1610

http://nmi.cs.wisc.edu/node/1611

aaa 2010-09-10 01:54:57

I can't run it under gdb since I only see the issue when it is running on the cluster and it has to run for two days before the issue occurs.

e5 2010-09-10 02:02:15

@e5 You can use gdb on the cluster, it's very flexible tool.

aaa 2010-09-10 02:04:49

How? Doesn't all code run on condor standard have to be relinked? Care to link to an example of someone doing this?

e5 2010-09-10 02:08:04

@e5 https://nmi.cs.wisc.edu/node/1610, https://nmi.cs.wisc.edu/node/1611

aaa 2010-09-10 02:13:23

@aaa carp thanks so much, maybe you should post this as part of your answer.

e5 2010-09-10 02:18:37

Answer 3

+2 A:

Can you create a core dump when your segfault happens? You can then debug this dump to try to figure out the state of the code when it crashed.

Look at what instruction caused the fault. Was it even a valid instruction or are you trying to execute data? If valid, what memory is it trying to access? Where did this pointer come from. You need to narrow down the location of your fault (stack corruption, heap corruption, uninitialized pointer, accessing invalid memory). If it's a corruption, see if if there's any tell-tale data in the corrupted area (pointers to symbols, data that looks like something in your structures, ...). Your memory allocator may already have built in features to debug some corruption (see MALLOC_CHECK_ on Linux or MallocGuardEdges on Mac OS). A common case for these is using memory that has been free()'d, so logging your malloc() / free() pairs might help.

Variable Length Coder 2010-09-10 01:56:49

I should note that I'm not using malloc or free. The program is fairly simple and just uses some globally defined variables. I will look into getting the core dump, but it is not currently provided.

e5 2010-09-10 02:01:28

Answer 4

A:

The one thing you do not say is how much flexibility you have to solve the problem. Can you, for example, have the system come to a halt and just run your application? Also how important are these crashes to solve?

I am assuming that for the most part you do. This may require a lot of resources.

The short term step is to put tons of "asserts" ( semi handwritten ) of each variable to make sure it hasn't changed when you don't want it to. This can ccontinue to work as you go through the long term process.

Long term-- try running it on a cluster of two ( maybe your home computer and a VM ). Do you still see the segfaults. If not increase the cluster size until you start seeing segfaults.

Run it on a minimum configuration ( to get segfaults ) and record all your inputs till a crash. Automate running the system with the inputs that you recorded, tweaking them until you can consistent get a crash with minimal input.

At that point look around. If you still can't find the bug, then you will have to ask again with some extra data you gathered with those runs.

HandyGandy 2010-09-10 03:07:59

Answer 5

+1 A:

If you have used the condor_compile tool to relink your code with the condor checkpointing code, it does a few things differently than a normal link. Most importantly, it statically links your code, and uses it's own malloc. Another big difference is that condor will then run it on a foreign machine, where the environment may be different enough from what you expect to cause problems.

The executable generated by condor_compile is runnable as a standalone binary outside of the condor system. If you run the binary emitted from condor_compile locally, outside of condor, do you still see the segfaults?

If it doesn't, can you correlate the segfaults to when condor restarts the executable from a checkpoint (the user log will tell you when this happens).

Greg 2010-09-15 00:58:14

The segfaults do not correlate with when condor restarts. Looking at the core file in gdb, the line and operation which causes the segfault is perfectly valid and within the bounds of a global array. I added cookies around all my variables and none of them have been corrupted. Checking all my variables (global and otherwise) I see no corruption.

e5 2010-09-15 03:18:51

Opps! The segfaults do correlate with when condor restarts! I think condor is causing the problems.

e5 2010-09-15 03:40:01

OK, so that's progress. You could also try to checkpoint and restart the executable outside of Condor, to see if it is the checkpointing code going awry. You might also want to post on the condor-users email list for more direct support.

Greg 2010-09-16 02:26:54

Above you say that you have more than one type of machine -- can you explain further? How many machines, how many types? If there's some problematic type of machine, you can constrain the condor job not to run there with the Requirements expresssion.

Greg 2010-09-17 00:59:56

I'm not sure exactly how the machines are different but they have different sounding host names. The host names that break are like y1,y2,y3... whereas the hose names that work are x1,x2,3. Both sets run the same linux kernal (according to nmap -O).

e5 2010-09-17 01:06:10

OK, well at least to get you through this problem, you can use the condor requirements expression to restrict your jobs to running only on those machines e.g.requirements = name == "machine1" || name == "machine2" || name == "machine3"

Greg 2010-09-22 00:35:54

ansaurus

tags:

views:

answers:

Methods/Tools for solving a Mystery Segfault while running on condor

related questions