Slurm¶

Slurm is the workload manager we use here at I2S. It is also used at most other universities, national labs, and at many private companies, so it is worth taking the time to understand. It is responsible for allocating resources so jobs can be executed, as well as managing the queue and ensuring that resources remain balanced between users.

Some useful Slurm commands can be found below, as well as under Helpful Commands

The Slurm scheduler is communicated with regardless of whether you use Open OnDemand or run Slurm commands. OnDemand runs "on top" of Slurm and provides a GUI for some of Slurm's functionality.

srun¶

srun is used to submit a job to the cluster. When using srun, you are able to specify the requirements of the job such as node count, processor count, and specific hardware features (more info on the cluster hardware here).

srun can also be used to quickly get a shell on a compute node. To do that from one of the front nodes, you can run the following command:

srun -p intel -N 1 -n 1 -c 4 --mem 8G -t 120 --pty /bin/bash

Explanation

Options:

-p intel: Selects the intel partition
-N 1: Tells Slurm we want 1 node allocated to this job. This defaults to 1 if omitted.
-n 1: Sets the amount of tasks to 1. (How many instances of this command will run) This defaults to 1 if omitted.
-c 4: Sets the amount of cpus per task to 4. This defaults to 1 if omitted.
--mem 8G: Amount of memory per node, in this case we chose 8 gibibytes. This defaults to 2G/ncpus if omitted.
-t 120: Max time of the job, in this case it is 120 minutes. You can also specify using hours:minutes:seconds (24:00:00 = 24hours). The default time value depends on partition.
--pty: Tell Slurm we want the task to behave like a terminal. This is often used when running a shell
/bin/bash: The last option in an srun invocation is the program that srun will execute on the requested node. In this case, bash is specified to start an interactive shell session.

The intel partition is the default partition, which has a default time limit of 24 hours. In cases where you only need one node from the intel partition and aren't using MPI, the command can be stripped down significantly.

srun -c 4 --mem 8G -t 120 --pty /bin/bash

Alternatively, you can use our slogin wrapper, which invokes srun with predetermined arguments (and a 24 hour time limit):

[o936k099@front1 ~]$ slogin
srun -p intel -N 1 -n 1 -c 8 --mem 32G --pty /bin/bash
[o936k099@n036 ~]$

sbatch¶

sbatch is used to submit jobs to the cluster using a script file. Below is an example job submission script:

#!/bin/bash
#SBATCH -p intel
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH --mem=1GB
#SBATCH -t 00:20:00 
#SBATCH -J test_job
#SBATCH -o slurm-%j.out

echo "Job ${SLURM_JOB_ID} ran on ${HOSTNAME}"

Example output:

[o936k099@front1 ~]$ sbatch test_job.sh
[o936k099@front1 ~]$ cat slurm-47491.out
Job 47491 ran on n036
[o936k099@front1 ~]$

For the most part, the arguments used for srun can also be used inside an sbatch script. This script requests one node with one core, and 1GB of memory. -J is used to specify the job name that appears in the job queue, while -o specifies the log file name for the job. %j in the job output file name is replaced with the Slurm job id when the scheduler processes the script. The variable SLURM_JOB_ID used in the example output is an environment variable set by the Slurm scheduler for each job.

To run this example script, copy its contents into a file in your home directory (test_job.sh for example) and run the command sbatch test_job.sh. The job output log will be saved in the same directory as the job submission script, and should contain similar output to the example above.

sbatch job scripts can run programs directly, as show above, but it is also possible to use srun within job submission scripts to run programs. Using srun in a job script allows for fine-grained resource control over parallel tasks run in a job script. An example is shown below:

#!/bin/bash
#SBATCH -p intel
#SBATCH -N 1
#SBATCH -n 2
#SBATCH -c 1
#SBATCH --mem=2GB
#SBATCH -t 00:20:00 
#SBATCH -J test_job
#SBATCH -o slurm-%j.out

srun -n 1 --mem=1G echo "Task 1 ran" &
srun -n 1 --mem=1G echo "Task 2 ran" &

wait

When the sbatch script is submitted, both srun invocations will run at the same time, splitting the resources requested at the top of the script file. This method is useful for launching a small number of related jobs at once from the same script, but does not scale well with a large number of jobs. The Job Arrays section below goes into more depth on running large numbers of parallel jobs on the cluster.

When using srun within a job submission script, you need to specify what portion of the resources each srun invocation is allocated. If more resources are requested by srun than are made available by the #SBATCH parameters, then some jobs may wait to run, or attempt to share resources with already running jobs. In the example above, two tasks and 2GB of memory are requested. In the srun commands below the resource request, we specify how much memory and how many tasks are allocated to each job.

The sbatch options shown in these example scripts are just the tip on the iceberg in terms of what is available. For the full listing of sbatch parameters, see the official Slurm sbatch documentation

sbatch Job Arrays¶

Submitting a large number of cluster jobs at once has two general approaches. The first is to submit jobs to the scheduler using srun in a loop on the command line. The preferable, and more powerful approach uses job arrays to submit large blocks of jobs all at once with the sbatch command.

The --array parameter for sbatch allows the scheduler to queue up hundreds to thousands of jobs with the same resource requests. This method is much less taxing on the cluster scheduler, and simplifies the process of submitting a large number of jobs all at once. These arrays usually consist of the same program fed different parameters dictated by the job array indicies.

An example job array script is shown below:

#!/bin/bash
#SBATCH -p intel
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH --mem=1G
#SBATCH -t 00:20:00
#SBATCH -J test_job
#SBATCH -o logs/%A_%a.out
#SBATCH --array=1-4

echo Job ${SLURM_ARRAY_TASK_ID} used $(awk "NR == ${SLURM_ARRAY_TASK_ID} {print \$0}" ${SLURM_SUBMIT_DIR}/parameters)

Parameters file:

line 1 parameters
line 2 parameters
line 3 parameters
line 4 parameters

Example output:

[o936k099@front1 ~]$ sbatch array_test.sh
[o936k099@front1 ~]$ cd logs/
[o936k099@front1 logs]$ ls
49219_1.out  49219_2.out  49219_3.out  49219_4.out
[o936k099@front1 logs]$ cat *
Job 1 used line 1 parameters
Job 2 used line 2 parameters
Job 3 used line 3 parameters
Job 4 used line 4 parameters
[o936k099@front1 logs]$

In this example, the %A and %a symbols in the job log file path are replaced by the scheduler with the job array id and job array index respectively for each job in the array. The --array option specifies the creation of a job array which consists of four identical jobs with indices ranging from 1 to 4. Each job in the array is created with the same resource request at the top of the file, and runs the same bash command at the bottom of the script file. The echo command prints out the SLURM_ARRAY_TASK_ID (or job array index) environment variable of each job, along with one line from a file called "parameters". The awk command within the echo selects the line in the parameters file with the line number that matches the job array index value. This technique can be used to feed in specific parameters to different jobs within a job array.

Another way of generating program parameters for job arrays is through arithmetic. For example, if you wanted to define a minimum and maximum value a job needed to loop through based on its index, in your job script, you may include something like this:

MAX=$(echo "${SLURM_ARRAY_TASK_ID} * 1000" | bc)
MIN=$(echo "$({SLURM_ARRAY_TASK_ID} - 1) * 1000" | bc)

for (( i=$MIN; i<$MAX; i++ )); do
  # Perform calculations...
done

GPU Jobs¶

You'll need to specify the GPU partition and request the appropriate GPU resources. To specify the GPU partition, include the GPU partition in your job submission with --partition=gpu or -p gpu and use the --gres (Generic RESource) field to specify your GPU requirements.

Available GRES devices on I2S cluster

a100: GPU partition
l40s: MMICC partition
v100: GPU partition
p100: GPU partition
titanrtx: GPU partition
titanxp: GPU partition

For example, to request 2 NVIDIA A100 GPUs, you would use --gres=gpu:a100:2 in your sbatch script or srun command. The format follows the pattern --gres=gpu:type:count, where you specify the GPU model and number of units needed. Make sure to also request sufficient CPU cores and memory to support your GPU workload.

srun GRES request in Slurm

srun -p gpu --gres=gpu:a100:1 -c 48 --mem 256G -t 8:00:00 --pty /bin/bash: Requests a GPU node with 1 NVIDIA A100, 48 cores, and 256G of RAM with 8 hour time limit.
srun -p gpu --gres=gpu:v100:2 -c 24 --mem 64G -t 24:00:00 --pty /bin/bash: Requests a GPU node with 2 NVIDIA V100s, 24 cores, and 64G of RAM with 24 hour time limit.
srun -p gpu --gres=gpu:titanxp:2 -c 6 --mem 32G -t 24:00:00 --pty /bin/bash: Requests a GPU node with 2 NVIDIA Titan Xp's, 6 cores, and 32G of RAM with a 24 hour time limit.

For longer running jobs, we recommend using sbatch instead of srun to avoid problems with SSH disconnects. You can add the -p and --gres arguments inside your sbatch script like this:

#!/bin/bash

#SBATCH -c 1
#SBATCH --mem=1G
#SBATCH -t 00:20:00 
#SBATCH -J test_job
#SBATCH -o slurm-%j.out

#SBATCH -p gpu
#SBATCH --gres="gpu:titanxp:2"

echo "Job ${SLURM_JOB_ID} ran on ${HOSTNAME}"
nvidia-smi

Constraints¶

Slurm Job constraints allow precise specification for what hardware a job should run on. CPU architectures and instruction sets can be requested, as well as the networking type, node manufacturer, and memory. Specifying hardware constraints is done with the -C option:

srun -C "intel&ib" --pty /bin/bash

In this example, the & symbol between the two constraints specifies that both should be fulfilled for the job to run. The | symbol can also be used to specify that either one or the other constraint can be fulfilled. Additionally, square-brackets can be used to group together constraints. Here is an example combining all three:

#SBATCH -C "[intel&ib]|[amd&eth_10g]"

Available Constraints¶

Instruction sets

sse3
sse4_1
sse4_2
sse4a
avx

CPU Manufacturer/Cores

intel: Intel CPU
amd: AMD CPU
intel16: Node with 16 Intel Cores
intel20: Node with 20 Intel Cores

Networking

ib: Any IB
ib_qdr: QDR InfiniBand (40Gbps)
ib_fdr: FDR Infiniband (56Gbps)
eth_1g: 1G Ethernet
eth_10g: 10G Ethernet
noib: No InfiniBand

Manufacturer/CPU Manufacturer/Cores/Memory

del_int_16_64: Dell - Intel - 16 Cores - 64G
del_int_16_256: Dell - Intel - 16 Cores - 256G
del_int_20_256: Dell - Intel - 20 Cores - 64G
del_int_16_512: Dell - Intel - 16 Cores - 512G
del_int_20_128: Dell - Intel - 20 Cores - 128G

scontrol¶

scontrol allows us to see configuration info regarding the cluster, a specific node, or a specific job. You will most often use scontrol to modify existing jobs and see detailed information about a given job.

Scontrol Usage

scontrol hold <jobid>: Suspends the job by putting it in the HOLD state.
scontrol resume <jobid>: Resumes the job if it is currently in the HOLD state.
scontrol show job <jobid>: Shows detailed information about the job.