I'm working in a SLURM cluster and I was running several processes at the same time (on several input files), and using the same bash script.
At the end of the job, the process was killed and this is the error I obtained.
slurmstepd: error: Detected 1 oom-kill event(s) in step 1090990.batch cgroup.
My guess is that there is some issue with memory. But how can I know more about? Did I not provide enough memory? or as user I was requesting more than what I have access to?
Any suggestion?
Here OOM stands for "Out of Memory". When Linux runs low on memory, it will "oom-kill" a process to keep critical processes running. It looks like slurmstepd
detected that your process was oom-killed. Oracle has a nice explanation of this mechanism.
If you had requested more memory than you were allowed, the process would not have been allocated to a node and computation would not have started. It looks like you need to request more memory.