Is there a way in CUDA to allocate memory dynamically in device-side functions ? I could not find any examples of doing this.
From the CUDA C Programming manual:
B.15 Dynamic Global Memory Allocation
void* malloc(size_t size); void free(void* ptr);
allocate and free memory dynamically from a fixed-size heap in global memory.
The CUDA in-kernel
malloc()
function allocates at least size bytes from the device heap and returns a pointer to the allocated memory or NULL if insufficient memory exists to fulfill the request. The returned pointer is guaranteed to be aligned to a 16-byte boundary.The CUDA in-kernel
free()
function deallocates the memory pointed to byptr
, which must have been returned by a previous call tomalloc()
. Ifptr
isNULL
, the call to free() is ignored. Repeated calls to free() with the same ptr has undefined behavior.The memory allocated by a given CUDA thread via
malloc()
remains allocated for the lifetime of the CUDA context, or until it is explicitly released by a call tofree()
. It can be used by any other CUDA threads even from subsequent kernel launches. Any CUDA thread may free memory allocated by another thread, but care should be taken to ensure that the same pointer is not freed more than once.
According to http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf you should be able to use malloc() and free() in a device function.
Page 122
B.15 Dynamic Global Memory Allocation void* malloc(size_t size); void free(void* ptr); allocate and free memory dynamically from a fixed-size heap in global memory.
The example given in the manual.
__global__ void mallocTest()
{
char* ptr = (char*)malloc(123);
printf(“Thread %d got pointer: %p\n”, threadIdx.x, ptr);
free(ptr);
}
void main()
{
// Set a heap size of 128 megabytes. Note that this must
// be done before any kernel is launched.
cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
mallocTest<<<1, 5>>>();
cudaThreadSynchronize();
}
You need the compiler paramter -arch=sm_20 and a card that supports >2x architecture.