views:

614

answers:

1

I'm using pthreads in a Windows application. I noticed my program was deadlocking--a quick inspection showed that the following had occurred:

Thread 1 spawned Thread 2. Thread 2 spawned Thread 3. Thread 2 waited on a mutex from Thread 3, which wasn't unlocking.

So, I went to debug in gdb and got the following when backtracing the third thread:

Thread 3 (thread 3456.0x880):
#0  0x7c8106e9 in KERNEL32!CreateThread ()
   from /cygdrive/c/WINDOWS/system32/kernel32.dll
Cannot access memory at address 0x131

It was stuck, deadlocked, somehow, in the Windows CreateThread function! Obviously it couldn't unlock the mutex when it wasn't even able to start executing code. Yet, despite the fact that it was apparently stuck here, pthread_create returned zero (success).

What makes this particularly odd is that the same application on Linux has no such issues. What in the world would cause a thread to hang during the creation process (!?) but return successfully as if it had been created properly?

Edit: in response to the request for code, here's some code (simplified):

The creation of the thread:

if ( pthread_create( &h->lookahead->thread_handle, NULL, (void *)lookahead_thread, (void *)h->thread[h->param.i_threads] ) )
{
    log( LOG_ERROR, "failed to create lookahead thread\n");
    return ERROR;
}
while ( !h_lookahead->b_thread_active )
    usleep(100);
return SUCCESS;

Note that it *waits until b_thread_active is set*, so somehow b_thread_active is being set, so the thread being called has to have done something...

... here's the lookahead_thread function:

void lookahead_thread( mainstruct *h )
{
    h->lookahead->b_thread_active = 1;
    while( !h->lookahead->b_exit_thread && h->lookahead->b_thread_active )
    {
        if ( synch_frame_list_get_size( &h->lookahead->next ) > delay )
            _lookahead_slicetype_decide (h);
        else
            usleep(100);  // Arbitrary number to keep thread from spinning
    }
    while ( synch_frame_list_get_size( &h->lookahead->next ) )
     _lookahead_slicetype_decide (h);
    h->lookahead->b_thread_active = 0;
}

lookahead_slicetype_decide (h); is the thing that the thread does.

The mutex, synch_frame_list_get_size:

int   synch_frame_list_get_size( synch_frame_list_t *slist )
{
    int fno = 0;

    pthread_mutex_lock( &slist->mutex );
    while (slist->list[fno]) fno++;
    pthread_mutex_unlock( &slist->mutex );
    return fno;
}

The backtrace of thread 2:

Thread 2 (thread 332.0xf18):
#0  0x00478853 in pthread_mutex_lock ()
#1  0x004362e8 in synch_frame_list_get_size (slist=0x3ef3a8)
    at common/frame.c:1078
#2  0x004399e0 in lookahead_thread (h=0xd33150)
    at encoder/lookahead.c:288
#3  0x0047c5ed in ptw32_threadStart@4 ()
#4  0x77c3a3b0 in msvcrt!_endthreadex ()
   from /cygdrive/c/WINDOWS/system32/msvcrt.dll
#5  0x7c80b713 in KERNEL32!GetModuleFileNameA ()
   from /cygdrive/c/WINDOWS/system32/kernel32.dll
#6  0x00000000 in ??
+1  A: 

I would try double checking your mutexes in thread 2 and thread 3. Pthreads are implemented for windows using the standard windows api; So there will be slight differences between the windows and linux versions. This is a bizarre problem, but then again, that happens a lot in threading.

Could you try posting a snippet of the code where the locking is done in thread 2, and in the function that thread 3 should start in?

Edit in response to code

Did you ever unlock the mutex in thread 2? Your trace shows it locking a mutex, then creating a thread to do all that work which tries to also lock on the mutex. I'm guessing after thread 2 returns SUCESS it does? Also, why are you using flags and sleeping, perhaps barriers or conditional variables for process synchronization may be more robust.

Another note, is b_thread_active flag marked as volatile? Perhaps the compiler is caching something to not allow it to break out?

Nicholas Mancuso
Added a bunch of specific code.
Dark Shikari
Good catch on the volatile--I'll go test that and see if it is the problem. According to another dev gcc has a habit of inconsistent behavior with regard to non-volatile variables in threaded mode across platforms, which could result in Linux working fine but Windows breaking.
Dark Shikari
And you were right; while there were other bugs in the code, the missing volatile was what caused this particular bug.
Dark Shikari