CUDA: sum of all elements in array using linearized 2D shared memory

Gswffye picture Gswffye · Aug 21, 2014 · Viewed 8.3k times · Source

I am new to CUDA, and algorithms in general. Can someone tell me if I am doing this correctly or if there is a better way of doing this. One concern is that the input and output of the code should be on the GPU, so that there is no memory copying between the host and device.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdint.h>

#include <iostream>

#define TILE_WIDTH 8

__global__ void gpu_sumElements(int height, int width, float *in, float *out){

    extern __shared__ float cache[];

    int w = blockIdx.x * blockDim.x + threadIdx.x; // Col // width
    int h = blockIdx.y * blockDim.y + threadIdx.y;

    int index = h * width + w;
    int cacheIndex = threadIdx.y * blockDim.x + threadIdx.x;

    float temp = 0;

    if ((w < width) && (h < height)){
        temp += in[index];
        //index += (height * width);
    }

    cache[cacheIndex] = temp;
    __syncthreads();

    int i = (blockDim.x * blockDim.y) / 2;
    while (i != 0){
        if (cacheIndex < i)
            cache[cacheIndex] += cache[cacheIndex + i];
        __syncthreads();
        i /= 2;
    }

    if (cacheIndex == 0)
        out[blockIdx.y * gridDim.x + blockIdx.x] = cache[0];
}


int main(){
                                                                        // Initial Parameters       
    int width = 2363;
    int height = 781;
    float my_sum = 0;

    int block_height = (height - 1) / TILE_WIDTH + 1;
    int block_width = (width - 1) / TILE_WIDTH + 1;

    dim3 dimGrid(block_width, block_height, 1);
    dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1);

    int sharedMemSize = TILE_WIDTH * TILE_WIDTH * sizeof(float);
                                                                        // Initialize host arrays
    float *test_array = new float[height * width];
    float *out = new float[height * width];
    for (int i = 0; i < (height * width); i++)
        test_array[i] = 1.0f;
                                                                        // Initialize device arrays 
    float *gpu_temp_array;
    float *gpu_out;
    cudaMalloc((void **)&gpu_temp_array, (height * width * sizeof(float)));
    cudaMalloc((void **)&gpu_out, (height * width * sizeof(float)));
    cudaMemcpy(gpu_out, test_array, (height * width * sizeof(float)), cudaMemcpyHostToDevice);

                                                                        // Copy these, need them elsewhere
    float sum_height = height; 
    float sum_width = width ;
    dim3 sum_dimGrid = dimGrid;
    int i = (height * width); 

                                                                        // Launch kernel, get & print results
    while (i != 0){
        gpu_sumElements<<<sum_dimGrid, dimBlock, sharedMemSize>>>(sum_height, sum_width, gpu_out, gpu_temp_array);
        cudaMemcpy(gpu_out, gpu_temp_array, (sum_height * sum_width * sizeof(float)), cudaMemcpyDeviceToDevice);
        cudaMemset(gpu_temp_array, 0, (height * width * sizeof(float)));

        sum_height = ceil(sum_height/TILE_WIDTH);
        sum_width = ceil(sum_width/TILE_WIDTH);;
        sum_dimGrid.x = (sum_width - 1) / TILE_WIDTH + 1; 
        sum_dimGrid.y = (sum_height - 1) / TILE_WIDTH + 1;
        i /= TILE_WIDTH*TILE_WIDTH;
    }

    cudaMemcpy(out, gpu_out, (height * width * sizeof(float)), cudaMemcpyDeviceToHost);
    std::cout << out[0] << std::endl << std::endl;  

    delete[] test_array;
    delete[] out;
    cudaFree(gpu_out);
    cudaFree(gpu_temp_array);

    system("pause");
    return 0;
}

Answer

Robert Crovella picture Robert Crovella · Aug 21, 2014

In general, parallel reduction using multiple kernel launches to produce one (final) result is usually not necessary. The process of generating a well-organized parallel reduction that only requires two kernel launches for arbitrary data sizes is well documented by the cuda sample code and accompanying PDF.

To create a parallel reduction that uses only a single kernel launch, there are at least two common approaches:

  1. Using the so-called "threadfence reduction" method. This is also captured in a CUDA sample code. In this approach, the final reduction stage is carried out by keeping track of "kernel draining". Specifically, each threadblock updates a "completed count" variable (atomically) as it finishes its work. Since the number of threadblocks launched is known, it's possible for a threadblock to determine whether it is the last threadblock to finish. If it is, that threadblock adds up all the intermediate results produced by the other threadblocks, which are now written to global memory. The "threadfence" moniker is due to the fact that each threadblock must ensure that its partial result is available in global memory before exiting (using a threadfence intrinsic). This method can handle "arbitrary" reductions.

  2. Have (a single thread in) each threadblock atomically update a final kernel-wide result, using its own partial result. This is conveniently realizable only for reductions for which a corresponding atomic function is provided, e.g. sum reduction, max finding, min finding, etc.

Either of the above methods will benefit from the basic techniques covered in the CUDA parallel reduction sample code, in particular reducing the number of threadblocks to the minimum value which still allows full utilization of the GPU. This optimization allows for the minimum number of atomic operations. With these optimizations in mind, the reduction can be faster, and "simpler" (e.g. a single kernel call, with out much host management of intermediate results), than a corresponding 2-kernel or multi-kernel reduction.