Limiting register usage in CUDA: __launch_bounds__ vs maxrregcount

Question 1

Limiting register usage in CUDA: __launch_bounds__ vs maxrregcount

cuda gpu nvidia nvcc

Kelsius · Jun 22, 2017 · Viewed 8.1k times · Source

Answer

Answer

The preface of this question is that, quoting the CUDA C Programming Guide,

the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance.

Now, __launch_bounds__ and maxregcount limit register usage by two different mechanisms.

__launch_bounds__

nvcc decides the number of registers to be used by a __global__ function through balancing the performance and the generality of the kernel launch setup. Saying it differently, such a choice of the number of used registers "guarantees effectiveness" for different numbers of threads per block and of blocks per multiprocessor. However, if an approximate idea of the maximum number of threads per block and (possibly) of the minimum number of blocks per multiprocessor is available at compile-time, then this information can be used to optimize the kernel for such launches. In other words

#define MAX_THREADS_PER_BLOCK 256
#define MIN_BLOCKS_PER_MP     2

__global__ void
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)
fooKernel(int *inArr, int *outArr)
{
    // ... Computation of kernel
}

informs the compiler of a likely launch configuration, so that nvcc can select the number of registers for such a launch configuration in an "optimal" way.

The MAX_THREADS_PER_BLOCK parameter is mandatory, while the MIN_BLOCKS_PER_MP parameter is optional. Also note that if the kernel is launched with a number of threads per block larger than MAX_THREADS_PER_BLOCK, the kernel launch will fail.

The limiting mechanism is described in the Programming Guide as follows:

If launch bounds are specified, the compiler first derives from them the upper limit L on the number of registers the kernel should use to ensure that minBlocksPerMultiprocessor blocks (or a single block if minBlocksPerMultiprocessor is not specified) of maxThreadsPerBlock threads can reside on the multiprocessor. The compiler then optimizes register usage in the following way:

If the initial register usage is higher than L, the compiler reduces it further until it becomes less or equal to L, usually at the expense of more local memory usage and/or higher number of instructions;

Accordingly, __launch_bounds__ can lead to register spill.

maxrregcount

maxrregcount is a compiler flag that simply hardlimits the number of employed registers to a number set by the user, at variance with __launch_bounds__, by forcing the compiler to rearrange its use of registers. When the compiler can't stay below the imposed limit, it will simply spill it to local memory which is in fact DRAM. Even this local variables are stored in global DRAM memory variables can be cached in L1, L2.

Question 2

From the NVIDIA CUDA C Programming Guide:

Register usage can be controlled using the maxrregcount compiler option or launch bounds as described in Launch Bounds.

From my understanding (and correct me if I'm wrong), while -maxrregcount limits the number of registers the entire .cu file may use, the __launch_bounds__ qualifier defines the maxThreadsPerBlock and minBlocksPerMultiprocessor for each __global__ kernel. These two accomplish the same task, but in two different ways.

My usage requires me to have 40 registers per thread to maximize the performance. Thus, I can use -maxrregcount 40. I can also force 40 registers by using __launch_bounds__(256, 6) but this causes load & store register spills.

What is the difference between the two to cause these register spills?

Limiting register usage in CUDA: __launch_bounds__ vs maxrregcount

Answer

Related questions