I'm new to CUDA, and I can't understand loop unrolling. I've written a piece of code to understand the technique
__global__ void kernel(float *b, int size)
{
int tid = blockDim.x * blockIdx.x + threadIdx.x;
#pragma unroll
for(int i=0;i<size;i++)
b[i]=i;
}
Above is my kernel function. In main
I call it like below
int main()
{
float * a; //host array
float * b; //device array
int size=100;
a=(float*)malloc(size*sizeof(float));
cudaMalloc((float**)&b,size);
cudaMemcpy(b, a, size, cudaMemcpyHostToDevice);
kernel<<<1,size>>>(b,size); //size=100
cudaMemcpy(a, b, size, cudaMemcpyDeviceToHost);
for(int i=0;i<size;i++)
cout<<a[i]<<"\t";
_getch();
return 0;
}
Does it mean I have size
*size
=10000 threads running to execute the program? Are 100 of them created when loop is unrolled?
No. It means you have called a CUDA kernel with one block and that one block has 100 active threads. You're passing size as the second function parameter to your kernel. In your kernel each of those 100 threads executes the for loop 100 times.
#pragma unroll
is a compiler optimization that can, for example, replace a piece of code like
for ( int i = 0; i < 5; i++ )
b[i] = i;
with
b[0] = 0;
b[1] = 1;
b[2] = 2;
b[3] = 3;
b[4] = 4;
by putting #pragma unroll
directive right before the loop. The good thing about the unrolled version is that it involves less processing load for the processor. In case of for
loop version, the processing, in addition to assigning each i
to b[i]
, involves i
initialization, evaluating i<5
for 6 times, and incrementing i
for 5 times. While in the second case, it only involves filing up b
array content (perhaps plus int i=5;
if i
is used later). Another benefit of loop unrolling is the enhancement of Instruction-Level Parallelism (ILP). In the unrolled version, there would possibly be more operations for the processor to push into processing pipeline without being worried about the for
loop condition in every iteration.
Posts like this explain that runtime loop unrolling cannot happen for CUDA. In your case CUDA compiler doesn't have any clues that size
is going to be 100 so compile-time loop unrolling will not occur, and so if you force unrolling, you may end up hurting the performance.
If you are sure that the size
is 100 for all executions, you can unroll your loop like below:
#pragma unroll
for(int i=0;i<SIZE;i++) //or simply for(int i=0;i<100;i++)
b[i]=i;
in which SIZE
is known in compile time with #define SIZE 100
.
I also suggest you to have proper CUDA error checking in your code (explained here).