How can I get detailed job run info from SLURM (e.g. like that produced for "standard output" by LSF)?

Christopher Bottoms picture Christopher Bottoms · Apr 28, 2015 · Viewed 8.6k times · Source

When using bsub with LSF, the -o option gave a lot of details such as when the job started and ended and how much memory and CPU time the job took. With SLURM, all I get is the same standard output that I'd get from running a script without LSF.

For example, given this Perl 6 script:

warn  "standard error stream";
say  "standard output stream";

Submitted thus:

sbatch -o test.o%j -e test.e%j -J test_warn --wrap 'perl6 test.p6'

Resulted in the file test.o34380:

Testing standard output

and the file test.e34380:

Testing standard Error  in block <unit> at test.p6:2


With LSF, I'd get all kinds of details in the standard output file, something like:

Sender: LSF System <lsfadmin@my_node>
Subject: Job 347511: <test> Done

Job <test> was submitted from host <my_cluster> by user <username> in cluster <my_cluster_act>.
Job was executed on host(s) <my_node>, in queue <normal>, as user <username> in cluster <my_cluster_act>.
</home/username> was used as the home directory.
</path/to/working/directory> was used as the working directory.
Started at Mon Mar 16 13:10:23 2015
Results reported at Mon Mar 16 13:10:29 2015

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
perl6 test.p6

------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :    0.19 sec.
    Max Memory :    0.10 MB
    Max Swap   :    0.10 MB

    Max Processes  :         2
    Max Threads    :         3

The output (if any) follows:

standard output stream

PS:

Read file <test.e_347511> for stderr output of this job.

Update:

One or more -v flags to sbatch gives more preliminary information, but doesn't change the standard output.

Answer

Christopher Bottoms picture Christopher Bottoms · Apr 29, 2015

For recent jobs, try

sacct -l

Look under the "Job Accounting Fields" section of the documentation for descriptions of each of the three dozen or so columns in the output.

For just the job ID, maximum RAM used, maximum virtual memory size, start time, end time, CPU time in seconds, and the list of nodes on which the jobs ran. By default this just gives info on jobs run the same day (see --starttime or --endtime options for getting info on jobs from other days):

sacct --format=jobid,MaxRSS,MaxVMSize,start,end,CPUTimeRAW,NodeList

This will give you output like:

       JobID  MaxRSS  MaxVMSize               Start                 End CPUTimeRAW NodeList
------------ ------- ---------- ------------------- ------------------- ---------- --------
36511                           2015-04-29T11:34:37 2015-04-29T11:34:37          0  c50b-20
36511.batch     660K    181988K 2015-04-29T11:34:37 2015-04-29T11:34:37          0  c50b-20
36514                           2015-04-29T12:18:46 2015-04-29T12:18:46          0  c50b-20
36514.batch     656K    181988K 2015-04-29T12:18:46 2015-04-29T12:18:46          0  c50b-20


Use --state COMPLETED for checking previously completed jobs. When checking a state other than RUNNING, you have to give a start or end time.

sacct --starttime 08/01/15 --state COMPLETED --format=jobid,MaxRSS,MaxVMSize,start,end,CPUTImeRaw,NodeList,ReqCPUS,ReqMem,Elapsed,Timelimit

You can also get work directory about the job using scontrol:

scontrol show job 36514

Which will give you output like:

JobId=36537 JobName=sbatch
UserId=username(123456) GroupId=my_group(678)
......
WorkDir=/path/to/work/dir

However, by default, scontrol can only access that information for about five minutes after the job finishes, after which it is purged from memory.