On two occasions, the server went down when I finished training model using 4x1080ti. Why did the server crash?
I get sysylog and find something wrong about Nvidia-driver or GPUs.
Syslogs: (and the nvidia-bug-report.log)
[the second one]
Sep 6 21:11:41 gpu-8-server-intesight kernel: [31429.221258] NVRM: RmInitAdapter failed! (0x30:0xffff:682)
Sep 6 21:11:41 gpu-8-server-intesight kernel: [31429.221337] NVRM: rm_init_adapter failed for device bearing minor number 0
Sep 6 21:13:54 gpu-8-server-intesight kernel: [31562.154256] NVRM: RmInitAdapter failed! (0x30:0xffff:682)
Sep 6 21:13:54 gpu-8-server-intesight kernel: [31562.154306] NVRM: rm_init_adapter failed for device bearing minor number 1
[the first one]
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990374] NVRM: GPU at PCI:0000:04:00: GPU-bc54db68-a3cb-54e9-7287-b95c69e41cf1
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990375] NVRM: GPU Board Serial Number:
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990376] NVRM: Xid (PCI:0000:04:00): 79, GPU has fallen off the bus.
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990377] NVRM: GPU at 0000:04:00.0 has fallen off the bus.
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990377] NVRM: GPU is on Board .
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990655] NVRM: A GPU crash dump has been created. If possible, please run
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990655] NVRM: nvidia-bug-report.sh as root to collect this data before
Sep 6 02:48:40 gpu-8-server-intesight kernel: [557998.990655] NVRM: the NVIDIA kernel module is unloaded.
Sep 6 02:48:41 gpu-8-server-intesight kernel: [557999.884383] NVRM: GPU at 0000:04:00.0 has fallen off the bus.
Sep 6 02:48:41 gpu-8-server-intesight kernel: [557999.901942] NVRM: A GPU crash dump has been created. If possible, please run
Sep 6 02:48:41 gpu-8-server-intesight kernel: [557999.901942] NVRM: nvidia-bug-report.sh as root to collect this data before
Sep 6 02:48:41 gpu-8-server-intesight kernel: [557999.901942] NVRM: the NVIDIA kernel module is unloaded.
Sep 6 02:48:41 gpu-8-server-intesight kernel: [558000.356948] NVRM: RmInitAdapter failed! (0x30:0xffff:682)
Sep 6 02:48:41 gpu-8-server-intesight kernel: [558000.444379] NVRM: rm_init_adapter failed for device bearing minor number 0
Sep 6 02:48:45 gpu-8-server-intesight kernel: [558004.604173] NVRM: request_irq() failed (-22)
Sep 6 02:48:48 gpu-8-server-intesight kernel: [558007.497475] NVRM: RmInitAdapter failed! (0x23:0x56:468)
Sep 6 02:48:48 gpu-8-server-intesight kernel: [558007.497489] NVRM: rm_init_adapter failed for device bearing minor number 0
Sep 6 02:48:50 gpu-8-server-intesight kernel: [558008.878985] NVRM: request_irq() failed (-22)
Sep 6 02:48:53 gpu-8-server-intesight kernel: [558011.735642] NVRM: RmInitAdapter failed! (0x23:0x56:468)
Sep 6 02:48:53 gpu-8-server-intesight kernel: [558011.735658] NVRM: rm_init_adapter failed for device bearing minor number 0
Sep 6 02:48:54 gpu-8-server-intesight kernel: [558013.108772] NVRM: request_irq() failed (-22)
Sep 6 02:48:55 gpu-8-server-intesight kernel: [558013.757168] BUG: unable to handle kernel paging request at 0000000132081000
Sep 6 02:48:55 gpu-8-server-intesight kernel: [558013.757173] IP: [] kmem_cache_alloc+0x77/0x1f0
Sep 6 02:48:55 gpu-8-server-intesight kernel: [558013.757175] PGD 10357d8067 PUD 0
We have had this issue. From what I can tell you have a very similar setup with multiple GPUs and an X99 motherboard. We managed to mitigate the error by setting pcie_aspm=off
in the boot kernel parameters. I you search for "aspm" in the nvidia bug report logs that you have provided, you will notice the following:
[ 0.167842] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[ 0.278085] acpi PNP0A03:03: FADT indicates ASPM is unsupported, using BIOS configuration
[ 0.282583] acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration
[ 2.795337] r8169 0000:0a:00.0: can't disable ASPM; OS doesn't have ASPM control
We still have some issues at the moment with our GPU server, but it's likely that this will help.
I originally found this idea on this thread
UPDATE: We still get the occasional RmInitAdapter
message but we don't have any stability issues anymore. For the record we're now running Nvidia's 387.34 driver and we have the following boot parameters:
pcie_aspm=off rcutree.rcu_idle_gp_delay=1
As a side note, we also have a newer quad-GPU box based on a X299 motherboard and we have similar issues.
Related: