Is "cudaMallocManaged" slower than "cudaMalloc"?

cuda gpgpu nvidia hpc

Genutek · Feb 24, 2014 · Viewed 13.1k times · Source

I downloaded CUDA 6.0 RC and tested the new unified memory by using "cudaMallocManaged" in my application.However, I found this kernel is slowed down.

Using cudaMalloc followed by cudaMemcpy is faster (~0.56), compared to cudaMallocManaged (~0.63).Is this expected?

One of the website claims that cudaMallocManged is for "faster prototyping of cuda kernel", so I was wondering which is a better option for application in terms of performance?

Thanks.

Answer

cudaMallocManaged() is not about speeding up your application (with a few exceptions or corner cases, some are suggested below).

Today's implementation of Unified Memory and cudaMallocManaged will not be faster than intelligently written code written by a proficient CUDA programmer, to do the same thing. The machine (cuda runtime) is not smarter than you are as a programmer. cudaMallocManaged does not magically make the PCIE bus or general machine architectural limitations disappear.

Fast prototyping refers to the time it takes you to write the code, not the speed of the code.

cudaMallocManaged may be of interest to a proficient cuda programmer in the following situations:

You're interested in quickly getting a prototype together -i.e. you don't care about the last ounce of performance.
You are dealing with a complicated data structure which you use infrequently (e.g. a doubly linked list) which would otherwise be a chore to port to CUDA (since deep copies using ordinary CUDA code tend to be a chore). It's necessary for your application to work, but not part of the performance path.
You would ordinarily use zero-copy. There may be situations where using cudaMallocManaged could be faster than a naive or inefficient zero-copy approach.

cudaMallocManaged may be of interest to a non-proficient CUDA programmer in that it allows you to get your feet wet with CUDA along a possibly simpler learning curve.

Although Maxwell is mentioned in the comments, CUDA UM will offer major new features with the Pascal generation of GPUs, in some settings, for some GPUs. In particular, Unified Memory in these settings will no longer be limited to the available GPU device memory, and the memory handling granularity will drop to the page level even when the kernel is running. You can read more about it here.

Is "cudaMallocManaged" slower than "cudaMalloc"?

Answer

Related questions