tags:

views:

1012

answers:

3

I am trying to create a wrapper on Linux which controls how many concurrent executions of something are allowed at once. To do so, I am using a system wide counting semaphore. I create the semaphore, do a sem_wait(), launch the child process and then do a sem_post() when the child terminates. That is fine.

The problem is how to safely handle signals sent to this wrapper. If it doesn't catch signals, the command might terminate without doing a sem_post(), causing the semaphore count to permanently decrease by one. So, I created a signal handler which does the sem_post(). But still, there is a problem.

If the handler is attached before the sem_wait() is performed, a signal could arrive before the sem_wait() completes, causing a sem_post() to occur without a sem_wait(). The reverse is possible if I do the sem_wait() before setting up the signal handler.

The obvious next step was to block signals during the setup of the handler and the sem_wait(). This is pseudocode of what I have now:

void handler(int sig)
{
  sem_post(sem);
  exit(1);
}

...
sigprocmask(...);   /* Block signals */
sigaction(...);     /* Set signal handler */
sem_wait(sem);
sigprocmask(...);   /* Unblock signals */
RunChild();
sem_post(sem);
exit(0);

The problem now is that the sem_wait() can block and during that time, signals are blocked. A user attempting to kill the process may end up resorting to "kill -9" which is behaviour I don't want to encourage since I cannot handle that case no matter what. I could use sem_trywait() for a small time and test sigpending() but that impacts fairness because there is no longer a guarantee that the process waiting on the semaphore the longest will get to run next.

Is there a truly safe solution here which allows me to handle signals during semaphore acquisition? I am considering resorting to a "Do I have the semaphore" global and removing the signal blocking but that is not 100% safe since acquiring the semaphore and setting the global isn't atomic but might be better than blocking signals while waiting.

+2  A: 

Are you sure sem_wait() causes signals to be blocked? I don't think this is the case. The man page for sem_wait() says that the EINTR error code is returned from sem_wait() if it is interrupted by a signal.

You should be able to handle this error code and then your signals will be received. Have you run into a case where signals have not been received?

I would make sure you handle the error codes that sem_wait() can return. Although it may be rare, if you want to be 100% sure you want to cover 100% of your bases.

Kekoa
Sorry if I am being unclear. sem_wait() doesn't cause signals to be blocked as you say, but I am using sigprocmask() to block them. But, I think you have the right solution which is to look at the EINTR error code which means the handler should not exit but set some flag to say "time to quit". I'll test that out. Thanks.
Jeremy
That worked. I removed the blocking of signals and changed the signal handler to just set a global meaning it was time to quit. If the sem_wait() returns an error, then I quit which is fine and preserves the semaphore count. If I am waiting for the child, I test the timeToQuit global also. If it is time to quit, I kill the child and do a sem_post() which preserves the semaphore count. Thanks. That should be signal safe, ignoring kill -9.
Jeremy
A: 

Are you sure you are approaching the problem correctly? If you want to wait for a child terminating, you may want to use the waitpid() system call. As you observed, it is not reliable to expect the child to do the sem_post() if it may receive signals.

Juliano
Actually, the child is unaware of the semaphore and the parent is doing a waitpid() in the RunChild() function in the pseudocode which I didn't show. That part was fine and it was handling termination of the child and signals sent to the parent after the sem_take() completed. I was concerned about the critical section between setting the signal handler and the sem_wait() but checking the sem_wait() error code was the simple solution I couldn't see.
Jeremy
A: 

I have similar problem with my application. Application is running on embedded Linux machine. It's basically multiprotocol gateway that connects two industrial Ethernet networks.

We are experiencing some kind of application hang every 2 to 3 days. It seems like both threads are still running but SIGALARM signal is getting lost. I'm not completely sure if this is a case because project is in testing phase and I can't make any application changes until end of a week.SIGALARM is used for counter implementation and it is activated every second.

I have semaphores inside signal handler and inside thread. I have initialized SIGALRM with sigaction and SA_RESTART flag. I suspect this is a reason why we are experiencing counter (or application) hang.

void start_timer(void)
{
   struct sigaction sa;
   struct itimerval timer;

   memset(&sa, 0, sizeof (sa));

   sa.sa_handler = &timer_handler;
   sa.sa_flags = SA_RESTART;  
   sigaction(SIGALRM, &sa, NULL);

   timer.it_value.tv_sec = 1;
   timer.it_value.tv_usec = 0;

   timer.it_interval.tv_sec = 1;
   timer.it_interval.tv_usec = 0;

   setitimer(ITIMER_REAL, &timer, NULL);
}


int main(int  argc, char *argv[])
{
    //  write log entry
    //  check arguments to main
    //  read files containing parameters  of remote devices 

    //  initialise semaphores 
    sem_init(&semCounters, 0, 1);
    sem_init(&semPackets, 0, 1);
    sem_init(&semEvents, 0, 1);

    start_timer();
    pthread_create(&thread1, NULL, firstThread, NULL);
    pthread_create(&thread2, NULL, secondThread, NULL);

    while(1)
    {            
        //  infinite loop
    }

    return 0;
}

Inside a second thread (secondThread) I have multiple math operation with time counters. All counter are protected with semaphores (semCounter). I'm using same counters inside signal handler! Semaphores are in both locations. Below is an example of use reinitialization of protected variable. This code can be found both in signal handler and in thread.

sem_wait(&semCounters);
counter.t4 = 0;
sem_post(&semCounters);

One more thing. I'm creating other threads inside signal handler but I'm not waiting thread to finish processing. Application has been restarted 4 times so far. Longest running time without hang is three days.