In linux systems, pthreads library provides us a function (posix_memalign) for cache alignment to prevent false sharing. And to choose a specific NUMA node of the arhitecture we can use libnuma library. What I want is something needing both two. I bind certain threads to some certain processors and I want allocate local data structures for each thread from the corresponding NUMA node in order to decrease delay in memory operations for the threads. How can I do this?
The numa_alloc_*() functions in libnuma allocate whole pages of memory, typically 4096 bytes. Cache lines are typically 64 bytes. Since 4096 is a multiple of 64, anything that comes back from numa_alloc_*() will already be memaligned at the cache level.
Beware the numa_alloc_*() functions however. It says on the man page that they are slower than a corresponding malloc(), which I'm sure is true, but the much bigger problem I've found is that simultaneous allocations from numa_alloc_*() running on lots of cores at once suffer massive contention problems. In my case replacing malloc() with numa_alloc_onnode() was a wash (everything I gained by using local memory was offset by increased allocation/free time); tcmalloc was faster than either. I was performing thousands of 12-16kb mallocs on 32 threads/cores at once. Timing experiments showed that it wasn't numa_alloc_onnode()'s single-threaded speed that was responsible for the large amount of time my process spent performing the allocations, which left lock/contention issues as the likely cause. The solution I've adopted is to numa_alloc_onnode() large chunks of memory once, and then distribute it to the threads running on each node as needed. I use gcc atomic builtins to allow each thread (I pin threads to cpus) to grab from the memory allocated on each node. You can cache-line-size align the distributions as they are made, if you want : I do. This approach beats the pants off of even tcmalloc (which is thread aware but not NUMA aware - at least the Debain Squeeze version doesn't seem to be). The downside of this approach is that you can't free the individual distributions (well, not without a lot more work, anyway), you can only free the whole underlying on-node allocations. However, if this is temporary on-node scratch space for a function call, or you can otherwise specify exactly when that memory is no longer needed, then this approach works very well. It helps if you can predict how much memory you need to allocate on each node too, obviously.
@nandu : I wont post the full source - it's long and in places tied to something else I do which makes it less than perfectly transparent. What I will post is a slightly shortened version of my new malloc() function to illustrate the core idea :
void *my_malloc(struct node_memory *nm,int node,long size)
{
long off,obytes;
// round up size to the nearest cache line size
// (optional, though some rounding is essential to avoid misalignment problems)
if ((obytes = (size % CACHE_LINE_SIZE)) > 0)
size += CACHE_LINE_SIZE - obytes;
// atomically increase the offset for the requested node by size
if (((off = __sync_fetch_and_add(&(nm->off[node]),size)) + size) > nm->bytes) {
fprintf(stderr,"Out of allocated memory on node %d\n",node);
return(NULL);
}
else
return((void *) (nm->ptr[node] + off));
}
where struct node_memory is
struct node_memory {
long bytes; // the number of bytes of memory allocated on each node
char **ptr; // ptr array of ptrs to the base of the memory on each node
long *off; // array of offsets from those bases (in bytes)
int nptrs; // the size of the ptr[] and off[] arrays
};
and nm->ptr[node] is set up using the libnuma function numa_alloc_onnode().
I usually store allowable node information in the structure too, so my_malloc() can check node requests are sensible without making function calls; I also check nm exists and that size is sensible. The function __sync_fetch_and_add() is a gcc builtin atomic function; if you're not compiling with gcc you'll need something else. I use atomics because in my limited experience they are much faster than mutexes in high thread/core count conditions (as on 4P NUMA machines).