Atomic Operations in CUDA? Which header file to include?

smilingbuddha picture smilingbuddha · Nov 3, 2011 · Viewed 9.3k times · Source

For using atomic operations in CUDA, is it necessary to include some CUDA header file? The CUDA programming guide seems to be tightlipped on this.

The code glmax.cu given below is giving me the following compilation error.

gaurish108 MyPractice: nvcc glmax.cu -o glmax
glmax.cu(11): error: identifier "atomicMax" is undefined

1 error detected in the compilation of "/tmp/tmpxft_000010fa_00000000-4_glmax.cpp1.ii".

Here is the code. It is basically calculating the maximum value of an array on the GPU using the atomic operation atomicMax. Since I am new to CUDA this is I am sure quite a naive code, but I wrote this to help myself understand atomic operations.

#include<stdio.h>
#include<stdlib.h>
#include<math.h>

__global__ void global_max(int* values, int* gl_max)
{

  int i=threadIdx.x + blockDim.x * blockIdx.x;
  int val=values[i];

  atomicMax(gl_max,val);

}


int main(void)
{
  int array_size=5;
  int num_bytes=array_size*sizeof(int);
  int *device_array=0;
  int *host_array=0;

  int *device_max=0;
  int *host_max=0;

  //Allocate memory on the host
  host_array=(int*)malloc(num_bytes);

  //Allocate memory on the device
  cudaMalloc((void**)&device_array,num_bytes);
  cudaMalloc((void**)&device_max,sizeof(int));


  //If either memory allocation failed, report an error message
  if(host_array == 0 || device_array == 0)
  {
    printf("couldn't allocate memory\n");
    return 1;
  }

  //Assign a random integer in the  interval [0,25] to host_array members
  for(int i=0;i<array_size;++i)
    {
      *(host_array+i)=rand()%26;
    }

  //Print the host array members
  printf("Host Array\n");
  for(int i=0;i<array_size;++i)
    {
      printf("%d  ",*(host_array+i));
    }
  printf("\n");

  //Copy array from host to device.
  cudaMemcpy(device_array,host_array,num_bytes,cudaMemcpyHostToDevice);

  //Configure and launch the kernel which calculates the maximum element in the device array.
  int grid_size=1;//Only 1 block of threads is used
  int block_size=5;//One block contains only 5 threads

  //Device array passed to the kernel as data. 
  global_max<<<grid_size,block_size>>>(device_array,device_max);

  //Transfer the maximum value so calculated into the CPU and print it
  cudaMemcpy(host_max,device_max,sizeof(int),cudaMemcpyDeviceToHost);
  printf("\nMaximum value is %d\n",*host_max);


  // deallocate memory
  free(host_array);
  cudaFree(device_array);
  cudaFree(device_max);
  return 0;
}

Answer

Jared Hoberock picture Jared Hoberock · Nov 3, 2011

I don't believe an #include is necessary. Atomic operations are not available on "Compute Capability" 1.0 (sm_10) devices, which is what you're asking nvcc to compile for (by default).

To use atomicMax in your code, specify at least -arch=sm_11 on the command line:

$nvcc -arch=sm_11 glmax.cu -o glmax

For future reference, you can consult Appendix F of the CUDA C Programming Guide for information on what atomic operations are available on platforms of a particular Compute Capability.

Of course, you'll need an sm_11-compatible GPU in order to execute the code. My impression is that these are common by now.