High system CPU usage when contending futex

Alex Fiddler picture Alex Fiddler · Nov 8, 2012 · Viewed 12.8k times · Source

I have observed that when the linux futexes are contended, the system spends A LOT of time in the spinlocks. I noticed this to be a problem even when futexes are not used directly, but also when calling malloc/free, rand, glib mutex calls, and other system/library calls that make calls to futex. Is there ANY way of getting rid of this behavior?

I am using CentOS 6.3 with kernel 2.6.32-279.9.1.el6.x86_64. I also tried the latest stable kernel 3.6.6 downloaded directly from kernel.org.

Originally, the problem occurred on a 24-core server with 16GB RAM. The process has 700 threads. The data collected with "perf record" shows that the spinlock is called from the futex called from __lll_lock_wait_private and __lll_unlock_wake_private, and is eating away 50% of the CPU time. When I stopped the process with gdb, the backtraces showed the calls to __lll_lock_wait_private __lll_unlock_wake_private are made from malloc and free.

I was trying to reduce the problem, so I wrote a simple program that shows it's indeed the futexes that are causing the spinlock problem.

Start 8 threads, with each thread doing the following:

   //...
   static GMutex *lMethodMutex = g_mutex_new ();
   while (true)
   {
      static guint64 i = 0;
      g_mutex_lock (lMethodMutex);
      // Perform any operation in the user space that needs to be protected.
      // The operation itself is not important.  It's the taking and releasing
      // of the mutex that matters.
      ++i;
      g_mutex_unlock (lMethodMutex);
   }
   //...

I am running this on an 8-core machine, with plenty of RAM.

Using "top", I observed that the machine is 10% idle, 10% in the user mode, and 90% in the system mode.

Using "perf top", I observed the following:

 50.73%  [kernel]                [k] _spin_lock
 11.13%  [kernel]                [k] hpet_msi_next_event
  2.98%  libpthread-2.12.so      [.] pthread_mutex_lock
  2.90%  libpthread-2.12.so      [.] pthread_mutex_unlock
  1.94%  libpthread-2.12.so      [.] __lll_lock_wait
  1.59%  [kernel]                [k] futex_wake
  1.43%  [kernel]                [k] __audit_syscall_exit
  1.38%  [kernel]                [k] copy_user_generic_string
  1.35%  [kernel]                [k] system_call
  1.07%  [kernel]                [k] schedule
  0.99%  [kernel]                [k] hash_futex

I would expect this code to spend some time in the spinlock, since the futex code has to acquire the futex wait queue. I would also expect the code to spend some time in the system, since in this snippet of code there is very little code running in the user space. However, 50% of time spent in the spinlock seems to be excessive, especially when this cpu time is needed to do other useful work.

Answer

Antti picture Antti · Nov 8, 2012

I've ran into similar issues as well. My experience is that you may see a performance hit or even deadlocks when locking and unlocking a lot, depending on the libc version and a lot of other obscure things (e.g. calls to fork() like here).

This guy solved his performance problems by switching to tcmalloc, which may be a good idea anyway depending on the use case. It could be worth a try for you as well.

For me, I saw a reproducible deadlock when I had multiple threads doing lots of locking and unlocking. I was using a Debian 5.0 rootfs (embedded system) with a libc from 2010, and the issue was fixed by upgrading to Debian 6.0.