I have observed that when the linux futexes are contended, the system spends A LOT of time in the spinlocks. I noticed this to be a problem even when futexes are not used directly, but also when calling malloc/free, rand, glib mutex calls, and other system/library calls that make calls to futex. Is there ANY way of getting rid of this behavior?
I am using CentOS 6.3 with kernel 2.6.32-279.9.1.el6.x86_64. I also tried the latest stable kernel 3.6.6 downloaded directly from kernel.org.
Originally, the problem occurred on a 24-core server with 16GB RAM. The process has 700 threads. The data collected with "perf record" shows that the spinlock is called from the futex called from __lll_lock_wait_private and __lll_unlock_wake_private, and is eating away 50% of the CPU time. When I stopped the process with gdb, the backtraces showed the calls to __lll_lock_wait_private __lll_unlock_wake_private are made from malloc and free.
I was trying to reduce the problem, so I wrote a simple program that shows it's indeed the futexes that are causing the spinlock problem.
Start 8 threads, with each thread doing the following:
//...
static GMutex *lMethodMutex = g_mutex_new ();
while (true)
{
static guint64 i = 0;
g_mutex_lock (lMethodMutex);
// Perform any operation in the user space that needs to be protected.
// The operation itself is not important. It's the taking and releasing
// of the mutex that matters.
++i;
g_mutex_unlock (lMethodMutex);
}
//...
I am running this on an 8-core machine, with plenty of RAM.
Using "top", I observed that the machine is 10% idle, 10% in the user mode, and 90% in the system mode.
Using "perf top", I observed the following:
50.73% [kernel] [k] _spin_lock
11.13% [kernel] [k] hpet_msi_next_event
2.98% libpthread-2.12.so [.] pthread_mutex_lock
2.90% libpthread-2.12.so [.] pthread_mutex_unlock
1.94% libpthread-2.12.so [.] __lll_lock_wait
1.59% [kernel] [k] futex_wake
1.43% [kernel] [k] __audit_syscall_exit
1.38% [kernel] [k] copy_user_generic_string
1.35% [kernel] [k] system_call
1.07% [kernel] [k] schedule
0.99% [kernel] [k] hash_futex
I would expect this code to spend some time in the spinlock, since the futex code has to acquire the futex wait queue. I would also expect the code to spend some time in the system, since in this snippet of code there is very little code running in the user space. However, 50% of time spent in the spinlock seems to be excessive, especially when this cpu time is needed to do other useful work.
I've ran into similar issues as well. My experience is that you may see a performance hit or even deadlocks when locking and unlocking a lot, depending on the libc version and a lot of other obscure things (e.g. calls to fork() like here).
This guy solved his performance problems by switching to tcmalloc, which may be a good idea anyway depending on the use case. It could be worth a try for you as well.
For me, I saw a reproducible deadlock when I had multiple threads doing lots of locking and unlocking. I was using a Debian 5.0 rootfs (embedded system) with a libc from 2010, and the issue was fixed by upgrading to Debian 6.0.