tags:

views:

1461

answers:

8

I have a C application we have deployed to a customers site. It was compiled and runs on HP-UX. The user has reported a crash and we have obtained a core dump. So far, I've been unable to duplicate the crash in house.

As you would suspect, the core file/deployed executable is completely devoid of any sort of symbols. When I load it up in gdb and do a bt, the best I get is this:

(gdb) bt
#0  0xc0199470 in ?? ()

I can do a 'strings core' on the file, but my understanding is that all I get there is all the strings in the executable, so it seems semi-impossible to track down anything there.

I do have a debug version (compiled with -g) of the executable, which is unfortunately a couple of months newer than the released version. If I try to start gdb with that hub, I see this:

warning: exec file is newer than core file.
Core was generated by `program_name'.
Program terminated with signal 11, Segmentation fault.
__dld_list is not valid according to __dld_flags.

#0  0xc0199470 in ?? ()
(gdb) bt
#0  0xc0199470 in ?? ()

While it would be feasible to compile a debug version and deploy it at the customer's site and then wait for another crash, it would be relatively difficult and undesirable for a number of reasons.

I am quite familiar with the code and have a relatively good idea of where in code it is crashing based on the customer's bug report.

Is there ANY way I can glean any more information from this core dump? Via strings or another debugger or anything? Thanks.

+1  A: 

Do you have the exact source that you used to compile the old version (eg; through a tag in the source tree or something like that)? Maybe you could rebuild using that, and possibly get an insight into where the crash occured?

EightyEight
I do have the exact source, but this particular piece of code hasn't changed much (if at all) from that point to what I have now.
Morinar
A: 

There is not much information here. The binary is stripped.But looking at segmentation fault...you should look for places where there is a possibility that you are overwriting a piece of memory.

This is just a suggestion. There can be many problems.

BTW, if you are not able to reproduce in your local machine then the volume of data on customers' might be a problem.

Vaibhav
+2  A: 

For the future:

  1. Make sure that you always build with an external symbols database (this is not a debug build -- it's a release build, but you store the symbol table separately)
  2. keep it around for versions you deploy

For this situation:

You know the general area, so to see if you are right, go to the stack trace and find the assembly code -- eyeball it and see if you think it matches your source (this is easier if you have some idea what source generated this assembly). If it looks right, then you have some verification on your hypothesis. You might be able to figure out the values of the local variables by looking at the stack (since you know what you passed in and declared).

Lou Franco
How do I find the assembly code and/or get to the stack trace? All of the stack trace I've seen so far I pasted in up above...
Morinar
The command is 'disassemble' -- see this http://www.unknownroad.com/rtfm/gdbtut/gdbadvanced.html
Lou Franco
I did this and got:(gdb) disassembleNo function contains program counter for selected frame.Which seems to me like it favors the smashed stack as suggested by Sufian below.
Morinar
+1  A: 

This type of response from gdb:

(gdb) bt
#0  0xc0199470 in ?? ()

can also happen in the case that the stack was smashed by a buffer overrun, where the return address was overwritten in memory, so the program counter gets set to a seemingly random area.

This is one of the ways that even a build with a corresponding symbol database can cause a symbol lookup error (or strange looking backtraces). If you still get this after you have the symbol table, your problem is likely that your customer's data is causing some issues with your code.

Sufian
This answer seems ridiculously likely to me. I'll definitely look through the code for potentially overrun areas.
Morinar
If debugging with a "duplicate" copy doesn't show anything, it's time to start looking at register and stack dumps to try to infer how you got off into the middle of nowhere. It can also be a blown (or uninitialized) function pointer, allocation overrun, or perhaps an incorrect buffer size or "bad" input blowing a buffer (using sprintf()/sscanf with uncontrolled input, etc).
jesup
I never did figure anything out here, but I'm accepting this as it still seems like the most likely happening.
Morinar
+2  A: 
  1. Always use source control (CVS/GIT/Subversion/etc), even for test releases
  2. Tag all releases
  3. Consider (in the future) making a build with debugging (-g) and strip the executable before shipping. NOTE: Don't make two builds with and without -g; they may well not match up, since -g can on occasion cause different code to be generated even at the same optimization level. In super-performance-critical code you can forgo the -g for critical files - most it won't make a difference to.
  4. If you're really stuck, dump the stack and dump relevant parts of the heap to hex and look at it by hand; perhaps taking an instrumented copy and looking for similar "signatures" in the generated code and on the stack. This is real "old-school" debugging... :-)
jesup
Definitely solid advice. We pretty much do steps 1-3 here, but regardless, they are handled by a completely different set of people (we have a team in charge of those things here) than myself.
Morinar
A: 

Under gdb, "info registers" should give you enough of the execution state at the time of the crash to use with a disassembly of the executable and and relevant shared libraries. I usually use objdump to disassemble, redirect output to a file, then bring up the file in my favorite editor - this is useful for keeping notes as things are figured out. Also gdb's "info target" and "info sharedlib" can be useful for figuring out where shared libraries are loaded.

With register state, stack contents, and disassembly in hand along with a little luck, it should be straightforward (if tedious) to reconstruct the callstack (unless, of course, the stack has been trashed by a buffer overrun or similar catastrophe... might need an Ouija board or crystal ball in that case.)

You might also be able to correlate a a disassembly of the newer version built with -g against the disassembly of the stripped version.

Lance Richardson
+1  A: 

Try running a "pmap" against the core file (if hp/ux has this tool). This should report the starting addresses of all modules in the core file. With this info, you should be able to take the address of the failure location and figure out what library crashed. Further address comparison between the crash address and the addresses of the known functions in the library ("nm" against the library should get that) may help you determine what function crashed.

Even if you do manage to identify the function at the top of the stack, it isn't very likely that this function is the source of the problem... hopefully it has actually crashed in your code and not, say, the standard C string library. Rebuilding the stack trace is the next-best thing at that point.

veefu
A: 

I don't think the core file is supposed to contain symbols. You need to able to build a version of your program that is exactly the same as what you shipped to your customer, but with -g. If you strip your debug executable, it should be identical to the shipped version. Only then can gdb give you anything useful.

sigjuice