I can not figure it out myself, what is the best way to ensure the memory used in my kernel is constant. There is a similar question at http://stackoverflow...r-pleasant-way. I am working with GTX580 and compiling only for 2.0 capability. My kernel looks like
__global__ Foo(const int *src, float *result) {...}
I execute the following code on host:
cudaMalloc(src, size);
cudaMemcpy(src, hostSrc, size, cudaMemcpyHostToDevice);
Foo<<<...>>>(src, result);
the alternative way is to add
__constant__ src[size];
to .cu file, remove src pointer from the kernel and execute
cudaMemcpyToSymbol("src", hostSrc, size, 0, cudaMemcpyHostToDevice);
Foo<<<...>>>(result);
Are these two ways equivalent or the first one does not guarantee the usage of constant memory instead of global memory? size changes dynamically so the second way is not handy in my case.
The second way is the only way to ensure that the array is compiled to CUDA constant memory and accessed correctly via the constant memory cache. But you should ask yourself how the contents of that array are going to be accessed within a block of threads. If every thread will access the array uniformly, then there will be a performance advantage in using constant memory, because there is a broadcast mechanism from the constant memory cache (it also saves global memory bandwidth because constant memory is stored in offchip DRAM and the cache reduces the DRAM transaction count). But if access is random, then there can be serialisation of access to local memory which will negatively effect performance.
Typical things which might be good fits for __constant__
memory would be model coefficients, weights, and other constant values which need to be set at runtime. On Fermi GPUs, the kernel argument list is stored in constant memory, for example. But if the contents are access non-uniformly and the type or size of members isn't constant from call to call, then normal global memory is preferable.
Also keep in mind that there is a limit of 64kb of constant memory per GPU context, so is it not practical to store very large amounts of data in constant memory. If you need a lot of read-only storage with cache, it might be worth trying binding the data to a texture and see what the performance is like. On pre-Fermi cards, it usually yields a handy performance gain, on Fermi the results can be less predictable compared to global memory because of the improve cache layout in that architecture.