I have a question about CUDA synchronizing. Particularly, I need some clarification about synchronizing in if statements. I mean, if I put a __syncthreads() under the scope of an if statement hit by a fraction of the threads inside the block, what happens? I thought that some threads will remain "forever" waiting for the other threads that won't hit the synchronizing point. So, I wrote and executed some sample code to inspect:
__global__ void kernel(float* vett, int n)
{
int index = blockIdx.x*blockDim.x + threadIdx.x;
int gridSize = blockDim.x*gridDim.x;
while( index < n )
{
vett[index] = 2;
if(threadIdx.x < 10)
{
vett[index] = 100;
__syncthreads();
}
__syncthreads();
index += gridSize;
}
}
Surprisingly enough, I observed that the output was a pretty "normal" (64 elements, blocksize 32):
100 100 100 100 100 100 100 100 100 100 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
100 100 100 100 100 100 100 100 100 100 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
So I modified slightly my code in the following way:
__global__ void kernel(float* vett, int n)
{
int index = blockIdx.x*blockDim.x + threadIdx.x;
int gridSize = blockDim.x*gridDim.x;
while( index < n )
{
vett[index] = 2;
if(threadIdx.x < 10)
{
vett[index] = 100;
__syncthreads();
}
__syncthreads();
vett[index] = 3;
__syncthreads();
index += gridSize;
}
}
And the output was:
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Again, I was wrong: I thought that the threads inside the if statement, after modifying the element of the vector, would remain in a wait state and never get out of the if scope. So... could you please clarify what happened? Does a thread that gets after a synchronizing point unblock the threads waiting at the barrier? If you need to reproduce my situation, I used CUDA Toolkit 5.0 RC with SDK 4.2. Thanks a lot in advance.
In short, the behavior is undefined. So it may sometimes do what you want, or it may not, or (quite likely) will just hang or crash your kernel.
If you are really curious how things work internally, you need to remember that threads do not execute independently, but a warp (group of 32 threads) at a time.
This of course creates a problem with conditional branches where the conditional does not evaluate uniformly throughout the warp. The problem is solved by execution both paths, one after the other, each with those threads disabled that are not supposed to execute that path. IIRC on existing hardware the branch is taken first, then the path is executed where the branch is not taken, but this behavior is undefined and thus not guaranteed.
This separate execution of paths continues up to some point for which the compiler can determine that it is guaranteed to be reached by all threads of the two separate execution paths (the "reconvergence point" or "synchronization point"). When execution of the first code path reaches this point, it is stopped and the second code path is executed instead. When the second path reaches the synchronization point, all threads are enabled again and execution continues uniformly from there.
The situation gets more complicated if another conditional branch is encountered before the synchronization. This problem is solved with a stack of paths that still need to be executed (luckily the growth of the stack is limited as we can have at most 32 different code paths for one warp).
Where the synchronization points are inserted is undefined and even varies slightly between architectures, so again there are no guarantees. The only (unofficial) comment you will get from Nvidia is that the compiler is pretty good at finding optimal synchronization points. However there are often subtle issues that may move the optimal point further down than you might expect, particularly if threads exit early.
Now to understand the behavior of the __syncthreads() directive, (which translates into a bar.sync
instruction in PTX) it is important to realize that this instruction is not executed per thread, but for the whole warp at once (regardless of whether any threads are disabled or not) because only the warps of a block need to be synchronized. The threads of a warp are already executing in sync, and further synchronization will either have no effect (if all threads are enabled) or lead to a deadlock when trying to sync the threads from different conditional code paths.
You can work your way from this description to how your particular piece of code behaves. But keep in mind that all this is undefined, there are no guarantees, and relying on a specific behavior may break your code at any time.
You may want to look at the PTX manual for some more details, particularly for the bar.sync
instruction that __syncthreads()
compiles to. Henry Wong's "Demystifying GPU Microarchitecture through Microbenchmarking" paper, referenced below by ahmad, is also well worth reading. Even though for now outdated architecture and CUDA version, the sections about conditional branching and __syncthreads()
appear to still be generally valid.