I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA.
I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger (512 X 512
) than a single block size.
Here is the code.
template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n)
{
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + tid;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n)
{
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
if (tid < 32)
{
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
However, it seems to me the g_odata[blockIdx.x]
saves the partial sums from all blocks, and, if I want to get the final result, I need to sum all the terms within the g_odata[blockIdx.x]
array.
I am wondering: is there a kernel to do the whole summation? or am I misunderstanding things here? I would really appreciate if anyone can educate me with this. Thanks very much.
Your understanding is correct. The reductions demonstrated here end up with a sequence of block-sums deposited in global memory.
To sum all of these block sums together, requires some form of global synchronization. You must wait until all the blocks are complete before adding their sums together. You have a number of options at this point, some of which are:
If you search around the CUDA tag you can find examples of all these, and discussions of their pros and cons. To see how the main kernel you posted is used for a complete reduction, look at the parallel reduction sample code.