For using atomic operations in CUDA, is it necessary to include some CUDA header file? The CUDA programming guide seems to be tightlipped on this.
The code glmax.cu given below is giving me the following compilation error.
gaurish108 MyPractice: nvcc glmax.cu -o glmax
glmax.cu(11): error: identifier "atomicMax" is undefined
1 error detected in the compilation of "/tmp/tmpxft_000010fa_00000000-4_glmax.cpp1.ii".
Here is the code. It is basically calculating the maximum value of an array on the GPU using the atomic operation atomicMax
. Since I am new to CUDA this is I am sure quite a naive code, but I wrote this to help myself understand atomic operations.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
__global__ void global_max(int* values, int* gl_max)
{
int i=threadIdx.x + blockDim.x * blockIdx.x;
int val=values[i];
atomicMax(gl_max,val);
}
int main(void)
{
int array_size=5;
int num_bytes=array_size*sizeof(int);
int *device_array=0;
int *host_array=0;
int *device_max=0;
int *host_max=0;
//Allocate memory on the host
host_array=(int*)malloc(num_bytes);
//Allocate memory on the device
cudaMalloc((void**)&device_array,num_bytes);
cudaMalloc((void**)&device_max,sizeof(int));
//If either memory allocation failed, report an error message
if(host_array == 0 || device_array == 0)
{
printf("couldn't allocate memory\n");
return 1;
}
//Assign a random integer in the interval [0,25] to host_array members
for(int i=0;i<array_size;++i)
{
*(host_array+i)=rand()%26;
}
//Print the host array members
printf("Host Array\n");
for(int i=0;i<array_size;++i)
{
printf("%d ",*(host_array+i));
}
printf("\n");
//Copy array from host to device.
cudaMemcpy(device_array,host_array,num_bytes,cudaMemcpyHostToDevice);
//Configure and launch the kernel which calculates the maximum element in the device array.
int grid_size=1;//Only 1 block of threads is used
int block_size=5;//One block contains only 5 threads
//Device array passed to the kernel as data.
global_max<<<grid_size,block_size>>>(device_array,device_max);
//Transfer the maximum value so calculated into the CPU and print it
cudaMemcpy(host_max,device_max,sizeof(int),cudaMemcpyDeviceToHost);
printf("\nMaximum value is %d\n",*host_max);
// deallocate memory
free(host_array);
cudaFree(device_array);
cudaFree(device_max);
return 0;
}
I don't believe an #include
is necessary. Atomic operations are not available on "Compute Capability" 1.0 (sm_10
) devices, which is what you're asking nvcc
to compile for (by default).
To use atomicMax
in your code, specify at least -arch=sm_11
on the command line:
$nvcc -arch=sm_11 glmax.cu -o glmax
For future reference, you can consult Appendix F of the CUDA C Programming Guide for information on what atomic operations are available on platforms of a particular Compute Capability.
Of course, you'll need an sm_11
-compatible GPU in order to execute the code. My impression is that these are common by now.