I have a Python daemon running in production. It employs between 7 and 120 threads. Recently the smallest instance (7 threads) started to show hangs while all other instances never showed this kind of problem. Attaching strace to the python process shows that all threads are calling futex FUTEX_WAIT_PRIVATE, so they are probably trying to lock something.
How would you debug such a problem?
Note that this is a production system running from flash memory, so disk writes are constrained, too.
The observation was slightly incorrect. One thread wasn't calling futex, but instead swapping while holding the gil. Since the machine in question is low hardware this swapping took very long and seemed to be a deadlock. The underlying problem is a memory leak. :-(