views:

178

answers:

2

I'm investigating a deadlock bug. I took a core with gcore, and found that one of my functions seems to have called itself - even though it does not make a recursive function call.

Here's a fragment of the stack from gdb:

Thread 18 (Thread 4035926944 (LWP 23449)):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x005133de in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2  0x00510017 in _L_mutex_lock_182 () from /lib/tls/libpthread.so.0
#3  0x080d653c in ?? ()
#4  0xf7c59480 in ?? () from LIBFOO.so
#5  0x081944c0 in ?? ()
#6  0x081944b0 in ?? ()
#7  0xf08f3b38 in ?? ()
#8  0xf7c3b34c in FOO::Service::releaseObject ()
   from LIBFOO.so
#9  0xf7c3b34c in FOO::Service::releaseObject ()
   from LIBFOO.so
#10 0xf7c36006 in FOO::RequesterImpl::releaseObject ()
   from LIBFOO.so
#11 0xf7e2afbf in BAR::BAZ::unsubscribe (this=0x80d0070, sSymbol=@0xf6ded018)
    at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:176
...more stack

I've elided some of the names: FOO & BAR are namespaces.BAZ is a class.

The interesting part is #8 and #9, the call to Service::releaseObject(). This function does not call itself, nor does it call any function that calls it back... it is not recursive. Why then does it appear in the stack twice?

Is this an artefact created by the debugger, or could it be real?

You'll notice that the innermost call is waiting for a mutex - I think this could be my deadlock. Service::releaseObject() locks a mutex, so if it magically teleported back inside itself, then a deadlock most certainly could occur.

Some background:

This is compiled using g++ v3.4.6 on RHEL4. It's a 64-bit OS, but this is 32-bit code, compiled with -m32. It's optimised at -O3. I can't guarantee that the application code was compiled with exactly the same options as the LIBFOO code.

Class Service has no virtual functions, so there's no vtable. Class RequesterImpl inherits from a fully-virtual interface, so it does have a vtable.

+4  A: 

Stacktraces are unreliable on x86 at any optimization level: -O1 and higher enable -fomit-frame-pointer.

ephemient
So, will gdb interpret a missing frame pointer as a duplicate function in the stack?
alex tingle
Yes, it's quite likely that the apparent second instance of the same function is actually a frameless function.
Mark Bessey
This answer is incorrect: for 32-bit code -fomit-frame-pointer is *not* selected by any -O optimization level.
Employed Russian
MSVC's debugger can often deal with missing frame pointers, though.
Crashworks
Hmm, reading the GCC source code shows that you are indeed correct: x86 leaves frame pointers enabled at all optimization levels. x86_64 does generally (everything non-Mach-O) omit frame pointers at any non-zero optimization level, as do several other architectures; that must be where I got the notion from.
ephemient
+3  A: 

The reason you get "bad" stack is that __lll_mutex_lock_wait has incorrect unwind descriptor (it is written in hand-coded assembly). I believe this was fixed somewhat recently (in 2008), but can't find the exact patch.

Once the GDB stack unwinder goes "off balance", it creates bogus frames (#2 through #8), but eventually stumbles on a frame which uses frame pointer, and produces correct stack trace for the rest of the stack.

Employed Russian
Brilliant. That explains it perfectly. Thank you.
alex tingle