I downloaded CUDA 6.0 RC and tested the new unified memory by using "cudaMallocManaged" in my application.However, I found this kernel is slowed down.
Using cudaMalloc followed by cudaMemcpy is faster (~0.56), compared to cudaMallocManaged (~0.63).Is this expected?
One of the website claims that cudaMallocManged is for "faster prototyping of cuda kernel", so I was wondering which is a better option for application in terms of performance?
Thanks.
cudaMallocManaged()
is not about speeding up your application (with a few exceptions or corner cases, some are suggested below).
Today's implementation of Unified Memory and cudaMallocManaged
will not be faster than intelligently written code written by a proficient CUDA programmer, to do the same thing. The machine (cuda runtime) is not smarter than you are as a programmer. cudaMallocManaged
does not magically make the PCIE bus or general machine architectural limitations disappear.
Fast prototyping refers to the time it takes you to write the code, not the speed of the code.
cudaMallocManaged
may be of interest to a proficient cuda programmer in the following situations:
cudaMallocManaged
could be faster than a naive or inefficient zero-copy approach.cudaMallocManaged
may be of interest to a non-proficient CUDA programmer in that it allows you to get your feet wet with CUDA along a possibly simpler learning curve.
Although Maxwell is mentioned in the comments, CUDA UM will offer major new features with the Pascal generation of GPUs, in some settings, for some GPUs. In particular, Unified Memory in these settings will no longer be limited to the available GPU device memory, and the memory handling granularity will drop to the page level even when the kernel is running. You can read more about it here.