What does the --ntasks or -n tasks does in SLURM?

Charlie Parker picture Charlie Parker · Aug 28, 2016 · Viewed 20.4k times · Source

I was using SLURM to use some computing cluster and it had the -ntasks or -n. I have obviously read the documentation for it (http://slurm.schedmd.com/sbatch.html):

sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.

the specific part I do not understand what it means is:

run within the allocation will launch a maximum of number tasks and to provide for sufficient resources.

I have a few questions:

  1. I guess my first question is what does the word "task" mean and the difference is with the word "job" in the SLURM context. I usually think of a job as the running the bash script under sbatch as in sbatch my_batch_job.sh. Not sure what task means.
  2. If I equate the word task with job then I thought it would have ran the same identical bash script multiple times according to the argument to -n, --ntasks=<number>. However, I obviously tested it out in the cluster, ran a echo hello with --ntask=9 and I expected sbatch would echo hello 9 times to stdout (which is collected in slurm-job_id.out, but to my surprise, there was a single execution of my echo hello script Then what does this command even do? It seems it does nothing or at least I can't see whats suppose to be doing.

I do know the -a, --array=<indexes> option exists for multiple jobs. That is a different topic. I simply want to know what --ntasks is suppose to do, ideally with an example so that I can test it out in the cluster.

Answer

Alexis Lucattini picture Alexis Lucattini · Dec 13, 2018

The --ntasks parameter is useful if you have commands that you want to run in parallel within the same batch script. This may be two separate commands separated by an & or two commands used in a bash pipe (|).

For example

Using the default ntasks=1

#!/bin/bash

#SBATCH --ntasks=1

srun sleep 10 & 
srun sleep 12 &
wait

Will throw the warning:

Job step creation temporarily disabled, retrying

The number of tasks by default was specified to one, and therefore the second task cannot start until the first task has finished. This job will finish in around 22 seconds. To break this down:

sacct -j515058 --format=JobID,Start,End,Elapsed,NCPUS

        JobID               Start                 End    Elapsed      NCPUS
------------ ------------------- ------------------- ---------- ----------
515058       2018-12-13T20:51:44 2018-12-13T20:52:06   00:00:22          1
515058.batch 2018-12-13T20:51:44 2018-12-13T20:52:06   00:00:22          1
515058.0     2018-12-13T20:51:44 2018-12-13T20:51:56   00:00:12          1
515058.1     2018-12-13T20:51:56 2018-12-13T20:52:06   00:00:10          1

Here task 0 started and finished (in 12 seconds) followed by task 1 (in 10 seconds). To make a total user time of 22 seconds.

To run both of these commands simultaneously:

#!/bin/bash

#SBATCH --ntasks=2

srun --ntasks=1 sleep 10 & 
srun --ntasks=1 sleep 12 &
wait

Running the same sacct command as specified above

    sacct -j 515064 --format=JobID,Start,End,Elapsed,NCPUS
    JobID               Start                 End    Elapsed      NCPUS
    ------------ ------------------- ------------------- ---------- ----------
    515064       2018-12-13T21:34:08 2018-12-13T21:34:20   00:00:12          2
    515064.batch 2018-12-13T21:34:08 2018-12-13T21:34:20   00:00:12          2
    515064.0     2018-12-13T21:34:08 2018-12-13T21:34:20   00:00:12          1
    515064.1     2018-12-13T21:34:08 2018-12-13T21:34:18   00:00:10          1

Here the total job taking 12 seconds. There is no risk of jobs waiting for resources as the number of tasks has been specified in the batch script and therefore the job has the resources to run this many commands at once.

Each task inherits the parameters specified for the batch script. This is why --ntasks=1 needs to be specified for each srun task, otherwise each task uses --ntasks=2 and so the second command will not run until the first task has finished.

Another caveat of the tasks inheriting the batch parameters is if --export=NONE is specified as a batch parameter. In this case --export=ALL should be specified for each srun command otherwise environment variables set within the sbatch script are not inherited by the srun command.

Additional notes:
When using bash pipes, it may be necessary to specify --nodes=1 to prevent commands either side of the pipes running on separate nodes.
When using & to run commands simultaneously, the wait is vital. In this case, without the wait command, task 0 would cancel itself, given task 1 completed successfully.