My system is a CentOS 6.3
(running Kernel version 2.6.32-279.el6.x86_64
).
I have a loadable kernel module which is a driver that manages a PCIe card.
If I manually insert the driver using insmod
while the OS is up and running, the driver loads successfully and is operational.
However, if I try to install the driver using rpm and then reboot the system, during startup the OS gets stuck spitting out the following "soft lockup" message for ALL the CPU cores, except for one core that is in "soft lockup" in one of the threads created by my driver.
BUG: soft lockup - CPU#X stuck for 67s! [migration/8:36]
.......(same above message for all cores except one)
BUG: soft lockup - CPU#10 stuck for 67s! [mydriver_thread/8:36]
(one core is locked up in one of the threads in my driver).
I searched the net quite a bit for info on this kernel msg / bug, and there are quite a bit of posts about it, none on what causes it or how to debug. Any help with the following questions would really be appreciated:
I am not able to log into the system, I think it's because all the cores are in a "soft lockup" state, and hence cannot trigger a kernel dump from shell prompt. I enabled SysRq, and tried to trigger a kernel dump with SysRq key combo, but no luck. It seems the system is not responding to keyboard (not even responding to CapsLock button). Any suggestions on how I can trigger a kernel dump in this circumstance?
I can imagine the possibly of my driver thread causing "soft lockup". But how can the "migration" thread (a kernel thread) be in a "soft lockup" just because of my driver?
From browsing the net, the "migration" thread is used to move tasks from one cpu to another. Can someone please help me understand what this thread exact does? And how it can be affected by other threads, if at all.
I had a very similar problem on my desktop. It would soft lockup very frequently - about once a day or so.
It turns out it was because I was running on an Intel Haswell. It seems that the Haswell/Broadwell series of Intel processors have a bug which can cause system instability. This bug was fixed in a microcode update.
Check if CentOS offers an intel-microcode package, and install it. Make sure you configure grub to load it as the initial ramdisk before it loads initramfs.
Personally, I upgraded my microcode by booting into Windows and running a BIOS Update. You can check if the micrcode was actually updated by comparing the output of grep 'microcode' /proc/cpuinfo
before and after the update.