cudaMemcpy too slow

Callahan picture Callahan · Sep 15, 2011 · Viewed 14.4k times · Source

I use cudaMemcpy() one time to copy exactly 1GB of data to the device. This takes 5.9s. The other way round it takes 5.1s. Is this normal?
Does the function itself have so much overhead before copying? Theoretical there should be a throughput of at least 4GB/s for the PCIe bus.
There are no memory transfers overlapping because the Tesla C870 just does not support it. Any hints?

EDIT 2: my test program + updated timings; I hope it is not too much to read!
The cutCreateTimer() functions wont compile for me: 'error: identifier "cutCreateTimer" is undefined' - this could be related to the old cuda version (2.0) installed on the machine

 __host__ void time_int(int print){
static struct timeval t1; /* var for previous time stamp */
static struct timeval t2; /* var of current time stamp */
double time;
if(gettimeofday(&t2, 0) == -1) return;
if(print != 0){
  time = (double) (t2.tv_sec - t1.tv_sec) + ((double) (t2.tv_usec - t1.tv_usec)) / 1000000.0;
  printf(...);
}
t1 = t2;
}

main:
time(0);
void *x;
cudaMallocHost(&x,1073741824);
void *y;
cudaMalloc(&y, 1073741824);
time(1);
cudaMemcpy(y,x,1073741824, cudaMemcpyHostToDevice);
time(1);
cudaMemcpy(x,y,1073741824, cudaMemcpyDeviceToHost);
time(1);

Displayed timings are:
0.86 s allocation
0.197 s first copy
5.02 s second copy
The weird thing is: Although it displays 0.197s for first copy it takes much longer if I watch the program run.

Answer

osgx picture osgx · Sep 15, 2011

Yes, This is normal. cudaMemcpy() does a lot of checks and works (if host memory was allocated by usual malloc() or mmap()). It should check that every page of data is in memory, and move the pages (one-by-one) to the driver.

You can use cudaHostAlloc function or cudaMallocHost for allocating memory instead of malloc. It will allocate pinned memory which is always stored in RAM and can be accessed by GPU's DMA directly (faster cudaMemcpy()). Citing from first link:

Allocates count bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy().

Only limiting factor is that total amount of pinned memory in system is limited (not more than RAM size; it is better to use not more than RAM - 1Gb):

Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.