I tried to implement a simple barrier in my code that looks like this:
void waitOnBarrier(int* barrier, int numberOfThreads) {
atomicIncrement(barrier); // atomic increment implemented in assembly
while(*barrier < numberOfThreads);
}
And then there is a barrier usage in the code:
int g_barrier = 0; // a global variable
waitOnBarrier(&g_barrier, someKnownNumberOfThreads);
So far so good, but where should I reset my g_barrier variable back to zero? If I write something like
g_barrier = 0;
right after the waitOnBarrier call, I will have a problem if one of the threads will be released faster than others from the barrier and nullify the g_barrier while all other threads are still performing the loop instructions, so eventually they will get stuck on the barrier forever.
Explanation: waitOnBarrier will compile into something like this (pseudocode):
1: mov rax, numberOfThreads
2: mov rbx, [barrier]
3: cmp rax, rbx
4: jmp(if smaller) to 2
So if we have 2 threads syncing on the barrier, and thread_1 being slow somewhere at instruction 3 or 4, while a faster thread_2 reaches the barrier, passes it and continues to the g_barrier nullification flow. Which means that after thread_1 will reach instruction 2 it will see a zero value at [barrier] and will stuck on the barrier forever!
The question is, how should I nullify the g_barrier, what place for it in the code is "far enough" that I can be sure that by that time all the threads left the barrier? Or is there more correct way to implement a barrier?
Barriers are actually quite difficult to implement, the main reason being that new waiters can begin arriving before all the old waiters have had a chance to execute, which precludes any kind of simple count based implementation. My preferred solution is to have the barrier object itself simply point to a "current barrier instance" that exists on the stack of the first thread arriving at the barrier, and which will also be the last thread to leave (since it cannot leave while other threads are still referencing its stack). A very nice sample implementation in terms of pthread primitives (which could be adapted to C11 locking primitives or whatever you have to work with) is included in Michael Burr's answer to my past question on the topic:
https://stackoverflow.com/a/5902671/379897
Yes it looks like a lot of work, but writing a barrier implementation that actually satisfies the contract of a barrier is non-trivial.