views:

66

answers:

2

Follow up question to my pervious question: http://stackoverflow.com/questions/3579860/conditional-wait-with-pthreads

I changed my code to use semaphores instead of mutex locks and conditional signals. However, I seem to have run in to a condition that I cannot explain.

Here is the abstract

function thread work {
   while (true)
   sem_wait(new_work)
   if (condition to exit){
      exit
      }
   while (work based condition){
      if (condition to exit)
         exit
      do work
      if (condition to exit){
      exit
      }
   sem_post(work_done)
   set condition to ready
   }
exit
}

function start_thread(){
sem_wait(work_done)
setup thread work
create work
sem_post(new_work)
return to main()
}

function end_thread(){
set condition to exit
sem_post(new_work)
pthread_join(thread)
clean up
}

explanation of the control flow: main thread calls start_thread to create a thread, hand over some work. main and worker continue in parallel. main may finish its work before worker or vice versa. If main finishes its work before worker, worker is no longer valid and must be told to abort what its doing. This is "condition to exit". This function (start_thread) does not create a thread every time its called, only the first time. Rest of the times it updates work for the thread.

The thread is reused and provided new work parameters to reduce the overhead of creating and destroying threads. Once the main decides that it no longer needs the worker thread, it calls the end_thread function. This function will tell the thread it is no longer needed, wait for it to exit and then cleans up the pointers, semaphores and work structure.

The thread will always wait for the semaphore (new_work) before starting its work. I am using sem new_work to signal the thread that new work is now available and it should start. The thread signals the control function (start_thread) that it has finished / aborted the work using the semaphore work_done.

Everything is working great except in some random circumstance. end_thread is waiting at pthread_join and the thread is waiting at sem_wait(new_work).

"condition to exit" is protected by a mutex.

I cant seem to figure out what is causing this condition.

Here is output from a trace

 thread 1: sem NEW count, before wait : 0
 thread 1: sem NEW count, before wait : 0
 end post: sem NEW count, before post : 0
 end post: sem NEW count, after post : 1
 thread 1 exit.
 thread exited, cleanup 1

 Entered initialization for thread: 2

 created a thread: 2
 thread: 2 started.

.....


 thread 2: sem NEW count, before wait : 0
 thread 2: sem NEW count, before wait : 0
 thread 2: sem NEW count, before wait : 0
 end post: sem NEW count, before post : 0
 thread 2 exit.
 end post: sem NEW count, after post : 0
 thread exited, cleanup 2

 Entered initialization for thread: 3

 created a thread: 3
 thread: 3 started.

 .....

 thread 3: sem NEW count, before wait : 0
 thread 3: sem NEW count, before wait : 0
 end post: sem NEW count, before post : 0
 end post: sem NEW count, after post : 1
 thread 3: sem NEW count, before wait : 0

At this point, the thread is waiting at the semaphore and the exit_thread is waiting at pthread_join.

Thank you for your time.

+1  A: 

The POSIX functions on sem_t are interruptible by signals and thus they are interrupted by any signal that your process / thread might receive, in particular due to IO.

  • always investigate the return values of system calls (general not only for sem_wait)
  • in particular put sem_wait in a while loop, check for an error condition. If the error condition is EINTR rerun sem_wait.

the same applies to other sem_ functions. Look their specific error conditions up and handle them specifially.

Jens Gustedt
A: 

I found the bug. I was testing the condition outside the mutex and changing the value of the condition inside the mutex lock. By the randomness of the scheduling, it can happen that the main thread and the worker thread can compete for the lock at the same time and both change the value of the condition. Depending on which thread changed the value of the condition last, the worker thread will continue when its supposed to exit. Both wait at sem_wait for a post that will never come. worker waits for new work while the main thread waits for the worker to exit because it already set the condition to exit.

I moved the test to inside the mutex lock and it works fine now.

Here is the snippet of the modified code

    }
   sem_post(work_done)
   enter mutex lock
   test condition
       set condition to ready if test is satisfied
   exit lock
   }
powerrox