views:

506

answers:

5

I have multiple apps compiled with g++, running in Ubuntu. I'm using named semaphores to co-ordinate between different processes.

All works fine except in the following situation: If one of the processes calls sem_wait() or sem_timedwait() to decrement the semaphore and then crashes or is killed -9 before it gets a chance to call sem_post(), then from that moment on, the named semaphore is "unusable".

By "unusable", what I mean is the semaphore count is now zero, and the process that should have incremented it back to 1 has died or been killed.

I cannot find a sem_*() API that might tell me the process that last decremented it has crashed.

Am I missing an API somewhere?

Here is how I open the named semaphore:

sem_t *sem = sem_open( "/testing",
    O_CREAT     |   // create the semaphore if it does not already exist
    O_CLOEXEC   ,   // close on execute
    S_IRWXU     |   // permissions:  user
    S_IRWXG     |   // permissions:  group
    S_IRWXO     ,   // permissions:  other
    1           );  // initial value of the semaphore

Here is how I decrement it:

struct timespec timeout = { 0, 0 };
clock_gettime( CLOCK_REALTIME, &timeout );
timeout.tv_sec += 5;

if ( sem_timedwait( sem, &timeout ) )
{
    throw "timeout while waiting for semaphore";
}
+1  A: 

You should be able to find it from the shell using lsof. Then possibly you can delete it?

Update

Ah yes... man -k semaphore to the rescue.

It seems you can use ipcrm to get rid of a semaphore. Seems you aren't the first with this problem.

Carl Smotricz
Yes, I know about ipcrm, but it doesn't help. If I knew the semaphore had been lost, I could just as easily sem_post() to "get it back". The problem seems to be there is no event triggered to indicate that the application that last decremented it has been killed.
Stéphane
In addition, just noticed on the man page that ipcrm only works on the old SysV semaphores, not POSIX semaphores. Same with ipcs.
Stéphane
+1  A: 

If the process was KILLed then there won't be any direct way to determine that it has gone away.

You could operate some kind of periodic integrity check across all the semaphores you have - use semctl (cmd=GETPID) to find the PID for the last process that touched each semaphore in the state you describe, then check whether that process is still around. If not, perform clean up.

martin clayton
Something along these lines is what I was looking for, but of course for the POSIX semaphores you'd find in #include <semaphore.h>. From what I can tell, the semctl() style of calls are specific to the old SysV semaphores from <sys/sem.h>.
Stéphane
+3  A: 

You'll need to double check but I believe sem_post can be called from a signal handler. If you are able to catch some of the situations that are bringing down the process this might help.

Unlike a mutex any process or thread (with permissions) can post to the semaphore. You can write a simple utility to reset it. Presumably you know when your system has deadlocked. You can bring it down and run the utility program.

Also the semaphone is usually listed under /dev/shm and you can remove it.

SysV semaphores are more accommodating for this scenario. You can specify SEM_UNDO, in which the system will back out changes to the semaphore made by a process if it dies. They also have the ability to tell you the last process id to alter the semaphore.

Duck
Some signals like kill -9 bypasses signal handers, which is the situation I've run into. I do have a signal handler for the ones I can catch, and in a destructor for a scope-based object I do call sem_post() as the stack unwinds. But those few lingering uncatchable signals is what I was hoping to solve.
Stéphane
I think a fair question is to ask who are the users and why are they killing the app that way? You can try the SysV route or even file locks, which should revert when the process dies.
Duck
Actually, that is what I decided to do last night. Since files that have been open() and lockf() are automatically released when applications are killed -9, this method of "communication" actually works more reliably than semaphores considering what I need to coordinate.
Stéphane
+1  A: 

This is a typical problem when managing semaphores. Some programs use a single process to manage the initialization/deletion of the semaphore. Usually this process does just this and nothing else. Your other applications can wait until the semaphore is available. I've seen this done with the SYSV type API, but not with POSIX. Similar to what 'Duck' mentioned, using the SEM_UNDO flag in your semop() call.


But, with the information that you've provided I would suggest that you do not to use semaphores. Especially if your process is in danger of being killed or crashing. Try to use something that the OS will cleanup automagically for you.

Steve Lazaridis
+1  A: 

Turns out there isn't a way to reliably recover the semaphore. Sure, anyone can post_sem() to the named semaphore to get the count to increase past zero again, but how to tell when such a recovery is needed? The API provided is too limited and doesn't indicate in any way when this has happened.

Beware of the ipc tools also available -- the common tools ipcmk, ipcrm, and ipcs are only for the outdated SysV semaphores. They specifically do not work with the new POSIX semaphores.

But it looks like there are other things that can be used to lock things, which the operating system does automatically release when an application dies in a way that cannot be caught in a signal handler. Two examples: a listening socket bound to a particular port, or a lock on a specific file.

I decided the lock on a file is the solution I needed. So instead of a sem_wait() and sem_post() call, I'm using:

lockf( fd, F_LOCK, 0 )

and

lockf( fd, F_ULOCK, 0 )

When the application exits in any way, the file is automatically closed which also releases the file lock. Other client apps waiting for the "semaphore" are then free to proceed as expected.

Thanks for the help, guys.

Stéphane