views:

239

answers:

2

Hi

I have a libpthread linked application. The core of the application are two FIFOs shared by four threads ( two threads per one FIFO that is ;). The FIFO class is synchronized using pthread mutexes and it stores pointers to big classes ( containing buffers of about 4kb size ) allocated inside static memory using overloaded new and delete operators ( no dynamic allocation here ).

The program itself usually works fine, but from time to time it segfaults for no visible reason. The problem is, that I can't debug the segfaults properly as I'm working on an embedded system with an old linux kernel (2.4.29) and g++ (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)).

There's no gdb on the system, and I can't run the application elsewhere ( it's too hardware specific ).

I compiled the application with -g and -rdynamic flags, but an external gdb tells me nothing when I examine the core file ( only hex addresses ) - still I can print the backtrace from the program after catching SIGSEGV - it always looks like this:

Backtrace for process with pid: 6279
-========================================-
[0x8065707]
[0x806557a]
/lib/libc.so.6(sigaction+0x268) [0x400bfc68]
[0x8067bb9]
[0x8067b72]
[0x8067b25]
[0x8068429]
[0x8056cd4]
/lib/libpthread.so.0(pthread_detach+0x515) [0x40093b85]
/lib/libc.so.6(__clone+0x3a) [0x4015316a]
-========================================-
End of backtrace

So it seems to be pointing to libpthread...

I ran some of the modules through valgrind, but I didn't find any memory leaks (as I'm barely using any dynamic allocation ).

I thought that maybe the mutexes are causing some trouble ( as they are being locked/unlocked about 200 times a second ) so I switched my simple mutex class:

class AGMutex {

    public:

        AGMutex( void ) {
            pthread_mutex_init( &mutex1, NULL );
        }

        ~AGMutex( void ) {
            pthread_mutex_destroy( &mutex1 );
        }

        void lock( void ) {
            pthread_mutex_lock( &mutex1 );
        }

        void unlock( void ) {
            pthread_mutex_unlock( &mutex1 );
        }

    private:

        pthread_mutex_t mutex1;

};

to a dummy mutex class:

class AGMutex {

    public:

        AGMutex( void ) : mutex1( false ) {
        }

        ~AGMutex( void ) {
        }

        volatile void lock( void ) {
            if ( mutex1 ) {
                while ( mutex1 ) {
                    usleep( 1 );
                }
            }
            mutex1 = true;
        }

        volatile void unlock( void ) {
            mutex1 = false;
        }

    private:

        volatile bool mutex1;

};

but it changed nothing and the backtrace looks the same...

After some oldchool put-cout-between-every-line-and-see-where-it-segfaults-plus-remember-the-pids-and-stuff debugging session it seems that it segfaults during usleep (?).

I have no idea what else could be wrong. It can work for an hour or so, and then suddenly segfault for no apparent reason.

Has anybody ever encountered a similar problem?

+1  A: 

From my answer to How to generate a stacktrace when my gcc C++ app crashes:

    The first two entries in the stack frame chain when you get into the 
    signal handler contain a return address inside the signal handler and
    one inside sigaction() in libc.  The stack frame of the last function
    called before the signal (which is the location of the fault) is lost.

This may explain why you are having difficulties determining the location of your segfault via a backtrace from a signal handler. My answer also includes a workaround for this limitation.

If you want to see how your application actually is laid out in memory (i.e. 0x80..... addresses), you should be able to generate a map file from gcc. This typically done via -Wl,-Map,output.map, which passes -Map output.map to the linker.

You may also have a hardware-specific version of objdump or nm with your toolchain/cross-toolchain that may be helpful in deciphering your 0x80..... addresses.

jschmier
Thanks I will try this tomorrow
zajcev
A: 

Do you have access to Helgrind on your platform? It's a Valgrind tool for detecting POSIX thread errors such as races and threads holding mutexes when they exit.

TheJuice