Cluster Documentation

From ITTC Help
Jump to navigation Jump to search

Introduction

The Advanced Computing Facility (ACF) is located at the Information and Telecommunication Technology Center (ITTC) in Nichols Hall, and provides a 20-fold increase in power to support a diverse range of research. The facility houses high performance computing (HPC) resources and, thanks to a $4.6 million renovation grant from the NIH, has the capability of supporting over 24,000 processing cores. A unique feature of the ACF is a sophisticated computer-rack cooling system that shuttles heat from computing equipment into the Nichols Hall boiler room, resulting in an expected 15% reduction in building natural gas use. Additionally, when outdoor temperatures drop below 45 degrees, a "dry-cooler" will kick in, slashing electricity consumption by allowing cooling compressors to be powered down.

The ITTC Research Cluster is located in the ACF, and provides HPC resources to members of the center. The cluster uses the Slurm workload manager, which is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters. The cluster is composed of a variety of hardware types, with core counts ranging from 8 to 20 cores per node. In addition, there is specialized hardware including Nvidia graphics cards for GPU computing, Infiniband for low latency/high throughput parallel computing, and large memory systems with up to 512 GB of RAM.

Getting Help

If you have any questions about the ITTC Research Cluster, feel free to email clusterhelp@ittc.ku.edu for assistance.

Job Submission Guide

PBS/Torque and Slurm

A translation for common PBS/Torque commands to Slurm commands can be found here. This provides a quick guide for those who are familiar with PBS/Torque, but new to the Slurm scheduler.

Submitting Jobs

To submit jobs to the cluster, you can either write a script and submit it using sbatch:

[username@login1 ~]$ sbatch script.sh

Or, you can submit jobs interactively from the command line using srun:

[username@login1 ~]$ srun echo Hello World!

Job scripts use parameters (denoted by #SBATCH) in the script file to requested job resources, while interactive jobs request resources with command line parameters. When no resources are requested, a default set is automatically allocated for the job.

This default resource set includes :
  • The job's name is set the same as the script file name, or, if the job was started with srun, then the job name is the same as the first command (in the case of the example above, the name would be set to 'echo').
  • The job is scheduled in the default intel queue.
  • The job is allocated 1 core on 1 node with 2GB of memory.
  • The job is allocated 1 day to run.
  • The job redirects stdout and stderr to the same output file if the job is submitted with sbatch. If srun is used, then both will be printed to the screen.
  • The job's output file name takes the form "slurm-jobid.out", and is created in the same directory as the job script.

srun

srun can be used to run any single task on a cluster node, but it is most useful for launching interactive GUI or bash sessions. Here is an srun example run on login1:

[username@login1 ~]$ srun -p intel -N 1 -n 1 -c 4 --mem 4G --pty /bin/bash
[username@n097 ~]$ 
The options used in this example are all detailed below:
-p
Specifies a partition, or queue to create the job in. The current cluster partitions available are intel, amd, bigm, and gpu. For more information on the cluster queues, see the partitions section below.
-N
This sets the number of requested nodes for the interactive session.
-n
Specifies the number of tasks or processes to run on each allocated node.
-c
Sets the number of requested cpus per task
--mem
This specifies the requested memory per node. Memory amounts can be given in Kilobytes (K), Megabytes (M), and Gigabytes (G).
--pty
This option puts the srun session in pseudo-terminal mode. It is recommended to use this option if you are running an interactive shell session.
/bin/bash
The last option in an srun invocation is the program that srun will execute on the requested node. In this case, bash is specified to start an interactive shell session.

srun is used to submit both interactive and non-interactive jobs. When it is run directly on the command line as shown above, an interactive session is started on a cluster node. When it is used in a job submission script, it starts a non-interactive session.

sbatch

sbatch is used to submit jobs to the cluster using a script file. Below is an example job submission script:

#!/bin/bash
#SBATCH -p intel
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH --mem=1GB
#SBATCH -t 00:20:00 
#SBATCH -J test_job
#SBATCH -o slurm-%j.out
 
echo "Job ${SLURM_JOB_ID} ran on ${HOSTNAME}"

Example output:

[username@login1 ~]$ sbatch test_job.sh
[username@login1 ~]$ cat slurm-47491.out
Job 47491 ran on n097
[username@login1 ~]$

This script requests one node with one core, and 1GB of memory. -J is used to specify the job name that appears in the job queue, while -o specifies the log file name for the job. %j in the job output file name is replaced with the Slurm job id when the scheduler processes the script. The variable SLURM_JOB_ID used in the example output is an environment variable set by the Slurm scheduler for each job.

To run this example script, copy its contents into a file in your home directory (test_job.sh for example). Log in to either login1.ittc.ku.edu or login2.ittc.ku.edu with your ITTC credientials, and run the command sbatch test_job.sh. The job output log will be saved in the same directory as the job submission script, and should contain similar output to the example above.

sbatch job scripts can run programs directly, as show above, but it is also possible to use srun within job submission scripts to run programs. Using srun in a job script allows for fine-grained resource control over parallel tasks run in a job script. An example is shown below:

#!/bin/bash
#SBATCH -p intel
#SBATCH -N 1
#SBATCH -n 2
#SBATCH -c 1
#SBATCH --mem=2GB
#SBATCH -t 00:20:00 
#SBATCH -J test_job
#SBATCH -o slurm-%j.out

srun -n 1 --mem=1G echo "Task 1 ran" &
srun -n 1 --mem=1G echo "Task 2 ran" &

wait

When the sbatch script is submitted, both srun invocations will run at the same time, splitting the resources requested at the top of the script file. This method is useful for launching a small number of related jobs at once from the same script, but does not scale well with a large number of jobs. The Job Array section below goes into more depth on running large numbers of parallel jobs on the cluster.

When using srun within a job submission script, you need to specify what portion of the resources each srun invocation is allocated. If more resources are requested by srun than are made available by the #SBATCH parameters, then some jobs may wait to run, or attempt to share resources with already running jobs. In the example above, two tasks and 2GB of memory are requested. In the srun commands below the resource request, we specify how much memory and how many tasks are allocated to each job.

The sbatch options shown in these example scripts are just the tip on the iceberg in terms of what is available. For the full listing of sbatch parameters, see the official Slurm sbatch documentation

Here is a brief list of other common options that may be useful:
-C
Specifies a node constraint. This can be used to specify cpu architecture, and instruction set.
-D
Specifies the path to the log file destination directory. This can be an absolute path, or a relative path from the job submission script directory.
--gres
Used to request GPU resources. See this example for more information on running GPU jobs.
--cores-per-socket
Sets the requested number of cores per cpu socket.
--mem-per-cpu
This specifies the memory allocated to each cpu in the interactive session. It has the same memory specification syntax as --mem.
--mail-type
Sets when the user is to be mailed job notifications. NONE, BEGIN, END, FAIL, REQUEUE, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, and TIME_LIMIT_50 are all valid options
--mail-user
Specifies the user account to email when job notification emails are sent.

Job Arrays

Submitting a large number of cluster jobs at once has two general approaches. The first is to submit jobs to the scheduler using srun in a loop on the command line. The preferable, and more powerful approach uses job arrays to submit large blocks of jobs all at once with the sbatch command.

The --array parameter for sbatch allows the scheduler to queue up hundreds to thousands of jobs with the same resource requests. This method is much less taxing on the cluster scheduler, and simplifies the process of submitting a large number of jobs all at once. These arrays usually consist of the same program fed different parameters dictated by the job array indicies.

An example job array script is shown below:

#!/bin/bash
#SBATCH -p intel
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH --mem=1G
#SBATCH -t 00:20:00
#SBATCH -J test_job
#SBATCH -o logs/%A_%a.out
#SBATCH --array=1-4

echo Job ${SLURM_ARRAY_TASK_ID} used $(awk "NR == ${SLURM_ARRAY_TASK_ID} {print \$0}" ${SLURM_SUBMIT_DIR}/parameters)

Example output:

[username@login1 ~]$ sbatch array_test.sh
[username@login1 ~]$ cd logs/
[username@login1 logs]$ ls
49219_1.out  49219_2.out  49219_3.out  49219_4.out
[username@login1 logs]$ cat *
Job 1 used line 1 parameters
Job 2 used line 2 parameters
Job 3 used line 3 parameters
Job 4 used line 4 parameters
[username@login1 logs]$

Parameters file:

line 1 parameters
line 2 parameters
line 3 parameters
line 4 parameters

In this example, the %A and %a symbols in the job log file path are replaced by the scheduler with the job array id and job array indicie respectively for each job in the array. The --array option specifies the creation of a job array which consists of four identical jobs with indices ranging from 1 to 4. Each job in the array is created with the same resource request at the top of the file, and runs the same bash command at the bottom of the script file. The echo command prints out the SLURM_ARRAY_TASK_ID (or job array indicie) environment variable of each job, along with one line from a file called "parameters". The awk command within the echo selects the line in the parameters file with the line number that matches the job array indicie value. This technique can be used to feed in specific parameters to different jobs within a job array.

Another way of generating program parameters for job arrays is through arithmetic. For example, if you wanted to define a minimum and maximum value a job needed to loop through based on its indicie value, in your job script, you may include something like this:

MAX=$(echo "${SLURM_ARRAY_TASK_ID} * 1000" | bc)
MIN=$(echo "$({SLURM_ARRAY_TASK_ID} - 1) * 1000" | bc)

for (( i=$MIN; i<$MAX; i++ )); do
  # Perform calculations...
done

Cluster Partitions

Cluster partitions, or queues, are sets of nodes in the cluster grouped by their features. Currently, there are four partition in the ITTC cluster: intel, amd, bigm, and gpu. The intel and amd partitions are made up of nodes that contain exclusively intel and amd cpus respectively. The bigm queue is made up of nodes with RAM from 256 to 500GB, and the gpu partition contains nodes with Nvidia gpu co-processors. Partitions can be specified in a job script with the -p option:

#SBATCH -p intel

They can also be specified in interactive sessions:

srun -p intel -N 1 -n 1 --pty /bin/bash

Partitions allow for high-level constraints on job hardware, but lack fine-grained control over things like cpu and gpu architecture.

Job Constraints

Job constraints allow precise specification for what hardware a job should run on. Cpu architectures and instruction sets can be requested, as well as the networking type, node manufacturer, and memory. Specifying hardware constraints is done with the -C option:

#SBATCH -C "intel"

Multiple constraints can also be specified at once:

srun -C "intel&ib" --pty /bin/bash

In this example, the & symbol between the two constraints specifies that both should be fulfilled for the job to run. The | symbol can also be used to specify that either one or the other constraint can be fulfilled. Additionally, square-brackets can be used to group together constraints. Here is an example combining all three:

#SBATCH -C "[intel&ib]|[amd&eth_10g]"

Available constraints:

    Instruction Set
    • sse3
    • sse4_1
    • sse4_2
    • sse4a
    • avx
    CPU Brand/Cores
    • intel
    • amd
    • intel8
    • amd8
    • intel12
    • intel16
    • intel20
    Networking
    • ib
    • ib_ddr
    • ib_qdr
    • noib
    • eth_10g
    Manufacturer/CPU Brand/Cores/Memory
    • del_int_8_16
    • del_int_8_24
    • del_int_12_24
    • asu_int_12_32
    • sup_int_12_32
    • asu_int_12_128
    • del_int_16_64
    • del_int_16_256
    • del_int_20_256
    • del_int_16_512
    • del_int_20_128
    • del_amd_8_16

GPU Jobs

Instead of using hardware constraints, GPUs are specified with Generic Resource (gres) requests. Below is an example of an interactive GPU job request:

srun -p gpu --gres="gpu:k20:2" --pty /bin/bash

This request specifies two Nvidia K20 GPUs in the GPU queue for the interactive session, along with the default job resources. The --gres option allows the specification of a the GPU model and number through a colon-delimited list. Below is a job script example:

#SBATCH -p gpu
#SBATCH --gres="gpu:k40:1"

The GPU partition must be specified when requesting GPUs, otherwise the scheduler will reject the job. Whenever a job is started on a GPU node, the environment variable CUDA_VISIBLE_DEVICES is set to contain a comma-delimited list of the GPUs allocated to the current job. Information about these GPUs can be viewed by running nvidia-smi.

Here is example output from the srun example above:

[username@login1 ~]$ srun -p gpu --gres="gpu:k20:2" --pty /bin/bash
[username@g002 ~]$ echo $CUDA_VISIBLE_DEVICES
1,2
[username@g002 ~]$ nvidia-smi
Fri Jan 20 16:23:01 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          Off  | 0000:02:00.0     Off |                    0 |
| N/A   30C    P0    47W / 225W |      0MiB /  4742MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          Off  | 0000:03:00.0     Off |                    0 |
| N/A   29C    P0    47W / 225W |      0MiB /  4742MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20m          Off  | 0000:83:00.0     Off |                    0 |
| N/A   28C    P0    48W / 225W |      0MiB /  4742MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20m          Off  | 0000:84:00.0     Off |                    0 |
| N/A   28C    P0    51W / 225W |      0MiB /  4742MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[username@g002 ~]$ 

Currently, there are seven different GPU models available in the cluster:

    Gpu Models:
    • k20
    • k40
    • k80
    • titanxp
    • p100
    • v100s
    • titanrtx

MPI Jobs

If you want to use multiple processors with MPI, you need to request multiple tasks with the -n option. You will also need to load a version of MPI (ex: OpenMPI). An example MPI job script is shown below:

#!/bin/bash

#SBATCH -p intel
#SBATCH -n 4
#SBATCH --mem=1GB
#SBATCH -t 00:05:00
#SBATCH -J mpi_example
#SBATCH -o slurm-%j.out

module load OpenMPI
mpirun $HOME/helloworld

The script above will launch four tasks of helloworld. Below is the example output:

Hello world from processor n097.local, rank 0 out of 4 processors
Hello world from processor n097.local, rank 1 out of 4 processors
Hello world from processor n097.local, rank 2 out of 4 processors
Hello world from processor n097.local, rank 3 out of 4 processors

GUI Access

X11 forwarding

Access to a GUI running on the cluster may be accomplished with X11 forwarding. Data from the remote application is sent over ssh to an X server running locally. Each additional ssh connection between the local machine and the cluster must be started with X11 forwarding enabled. To request an interactive shell with X11 forwarding, you can use the "--x11" option. The following steps assume that the local machine has an X server running.

  1. Login via ssh to login1 or login2. Make sure your local ssh client has X11 forwarding enabled. If you are using ssh on the command line, add the "-X" flag to your ssh command.
  2. Start an interactive session with X11 forwarding. Be sure to request the number of cores, amount of memory, and walltime to complete your job. Syntax:
    srun --x11 -N 1 -n 2 --mem=4096mb -t 8:00:00 --pty /bin/bash
  3. After starting an interactive session with X11 forwarding, you can now launch graphical programs from the terminal.

RDP

Remote Desktop Protocol provides a user with a graphical interface to connect to another computer over a network connection. You will need a RDP client installed and will need to be connected to the KU Anywhere VPN. You can RDP to either login1.ittc.ku.edu or login2.ittc.ku.edu.


General Cluster Information

Software Environment

All cluster nodes run CentOS version 7 with GCC version 4.8.5. Cluster applications are installed as modules in the /nfs/apps/7/arch/generic.

Environment Modules

Cluster software is made available through environment modules. A list of available modules can be viewed by running:

module avail

Modules shown in the list can be loaded with the following command:

module load module_name

In order to persist loaded modules between interactive sessions, you need to add module load commands for the applications you want loaded to your ~/.bash_profile file if you are using bash, or ~/.cshrc if you are using tcsh or csh.

To view all loaded modules in your current shell session, use the module list command. To unload all currently loaded modules, you can use the module purge command. For more information on the module command and its options, see the documentation for further detail.

Filesystems

Below is a list of filesystems available on the cluster:

Path Description Default Quota
/users Stores private home directories. Avoid running cluster jobs out of this directory. 5GB
/work Shared group storage. 1TB
/scratch Private working storage to run cluster jobs. 1TB
/tmp Local storage on cluster nodes. N/A

Debugging

The cluster has a number of tools at your disposal for debugging submitted Slurm jobs. The most basic debugging information available is from the log files generated by running your job, which contain the STDERR and STDOUT output from the job. Log files are located within the submit directory with the filename slurm-<job id>.out, such as slurm-49321.out.

You can retrieve detailed job information using the command scontrol show jobid -dd <jobid>. Likewise, if you want to view detailed job information while the job is running, add the --output option to srun in your job batch file. For an unbuffered stream of STDOUT, which is quite useful for debugging, add the -u or --unbuffered to srun in your job batch file.


Helpful Commands

The Slurm scheduler has a number of utilities for finding information on the status of your jobs. Below are listed a few of the most useful commands and options for quickly finding this information.

Useful Slurm commands:
sacct
Lists information on finished and currently running jobs, including job status and exit codes.
sacct -u <username>
Lists information on currently running and recently finished jobs for the specified user.
sacct -S <start-date> -s <state>
Lists all jobs that started before the start date or time that are in the specified state.
scancel -u <username> -t <state>
Cancels all of the jobs for the specific user that are in the specified state.
scontrol hold <jobid>
Suspends the specifed job by putting it in a 'HOLD' state.
scontrol resume <jobid>
Resumes the specified job from the 'HOLD' state.
scontrol show job <jobid>
Shows detailed queue and resource allocation information for the specified job.
sinfo
Displays information on all of the cluster partitions, including the nodes available in them.
sinfo -T
Shows information on cluster node reservations, including reservation period, name, and reserved nodes.
squeue
Displays the short-form information for all currently running and queued jobs.
squeue -u <username> -l
Lists the long-form information about currently running jobs for a specific user.
squeue -u <username> -t <state>
Lists information about a specific users jobs that are in the specified state.
sview
If X11 forwarding is enabled, this command launches a graphical interface for viewing cluster information.

Citing the Cluster

If you would like to cite the ITTC research cluster in your work, feel free to use or adapt the following citation:

The authors wish to acknowledge Wesley Mason, Michael Hulet and the rest of
the Information and Telecommunication Technology Center (ITTC) staff at The
University of Kansas for their support with our high performance computing.

Cluster Hardware

Visit the Cluster Hardware page for a complete listing of all of the nodes in the cluster and their hardware configurations.