I have multiple apps compiled with g++, running in Ubuntu. I'm using named semaphores to co-ordinate between different processes.
All works fine except in the following situation: If one of the processes calls sem_wait()
or sem_timedwait()
to decrement the semaphore and then crashes or is killed -9 before it gets a chance to call sem_post()
, then from that moment on, the named semaphore is "unusable".
By "unusable", what I mean is the semaphore count is now zero, and the process that should have incremented it back to 1 has died or been killed.
I cannot find a sem_*()
API that might tell me the process that last decremented it has crashed.
Am I missing an API somewhere?
Here is how I open the named semaphore:
sem_t *sem = sem_open( "/testing",
O_CREAT | // create the semaphore if it does not already exist
O_CLOEXEC , // close on execute
S_IRWXU | // permissions: user
S_IRWXG | // permissions: group
S_IRWXO , // permissions: other
1 ); // initial value of the semaphore
Here is how I decrement it:
struct timespec timeout = { 0, 0 };
clock_gettime( CLOCK_REALTIME, &timeout );
timeout.tv_sec += 5;
if ( sem_timedwait( sem, &timeout ) )
{
throw "timeout while waiting for semaphore";
}
Turns out there isn't a way to reliably recover the semaphore. Sure, anyone can post_sem()
to the named semaphore to get the count to increase past zero again, but how to tell when such a recovery is needed? The API provided is too limited and doesn't indicate in any way when this has happened.
Beware of the ipc tools also available -- the common tools ipcmk
, ipcrm
, and ipcs
are only for the outdated SysV semaphores. They specifically do not work with the new POSIX semaphores.
But it looks like there are other things that can be used to lock things, which the operating system does automatically release when an application dies in a way that cannot be caught in a signal handler. Two examples: a listening socket bound to a particular port, or a lock on a specific file.
I decided the lock on a file is the solution I needed. So instead of a sem_wait()
and sem_post()
call, I'm using:
lockf( fd, F_LOCK, 0 )
and
lockf( fd, F_ULOCK, 0 )
When the application exits in any way, the file is automatically closed which also releases the file lock. Other client apps waiting for the "semaphore" are then free to proceed as expected.
Thanks for the help, guys.