Error in SLURM cluster - Detected 1 oom-kill event(s): how to improve running jobs

CafféSospeso picture CafféSospeso · Sep 20, 2018 · Viewed 11.8k times · Source

I'm working in a SLURM cluster and I was running several processes at the same time (on several input files), and using the same bash script.

At the end of the job, the process was killed and this is the error I obtained.

slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup.

My guess is that there is some issue with memory. But how can I know more about? Did I not provide enough memory? or as user I was requesting more than what I have access to?

Any suggestion?

Answer

Kyle picture Kyle · Dec 30, 2018

Here OOM stands for "Out of Memory". When Linux runs low on memory, it will "oom-kill" a process to keep critical processes running. It looks like slurmstepd detected that your process was oom-killed. Oracle has a nice explanation of this mechanism.

If you had requested more memory than you were allowed, the process would not have been allocated to a node and computation would not have started. It looks like you need to request more memory.