From the NVIDIA CUDA C Programming Guide:
Register usage can be controlled using the
maxrregcount
compiler option or launch bounds as described in Launch Bounds.
From my understanding (and correct me if I'm wrong), while -maxrregcount
limits the number of registers the entire .cu
file may use, the __launch_bounds__
qualifier defines the maxThreadsPerBlock
and minBlocksPerMultiprocessor
for each __global__
kernel. These two accomplish the same task, but in two different ways.
My usage requires me to have 40
registers per thread to maximize the performance. Thus, I can use -maxrregcount 40
. I can also force 40
registers by using __launch_bounds__(256, 6)
but this causes load & store register spills.
What is the difference between the two to cause these register spills?
The preface of this question is that, quoting the CUDA C Programming Guide
,
the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance.
Now, __launch_bounds__
and maxregcount
limit register usage by two different mechanisms.
__launch_bounds__
nvcc
decides the number of registers to be used by a __global__
function through balancing the performance and the generality of the kernel launch setup. Saying it differently, such a choice of the number of used registers "guarantees effectiveness" for different numbers of threads per block and of blocks per multiprocessor. However, if an approximate idea of the maximum number of threads per block and (possibly) of the minimum number of blocks per multiprocessor is available at compile-time, then this information can be used to optimize the kernel for such launches. In other words
#define MAX_THREADS_PER_BLOCK 256
#define MIN_BLOCKS_PER_MP 2
__global__ void
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)
fooKernel(int *inArr, int *outArr)
{
// ... Computation of kernel
}
informs the compiler of a likely launch configuration, so that nvcc
can select the number of registers for such a launch configuration in an "optimal" way.
The MAX_THREADS_PER_BLOCK
parameter is mandatory, while the MIN_BLOCKS_PER_MP
parameter is optional. Also note that if the kernel is launched with a number of threads per block larger than MAX_THREADS_PER_BLOCK
, the kernel launch will fail.
The limiting mechanism is described in the Programming Guide
as follows:
If launch bounds are specified, the compiler first derives from them the upper limit
L
on the number of registers the kernel should use to ensure thatminBlocksPerMultiprocessor
blocks (or a single block ifminBlocksPerMultiprocessor
is not specified) ofmaxThreadsPerBlock
threads can reside on the multiprocessor. The compiler then optimizes register usage in the following way:
- If the initial register usage is higher than
L
, the compiler reduces it further until it becomes less or equal toL
, usually at the expense of more local memory usage and/or higher number of instructions;
Accordingly, __launch_bounds__
can lead to register spill.
maxrregcount
maxrregcount
is a compiler flag that simply hardlimits the number of employed registers to a number set by the user, at variance with __launch_bounds__
, by forcing the compiler to rearrange its use of registers. When the compiler can't stay below the imposed limit, it will simply spill it to local memory which is in fact DRAM
. Even this local variables are stored in global DRAM
memory variables can be cached in L1, L2.