Cluster Documentation
Introduction
The Advanced Computing Facility (ACF) is located at the Information and Telecommunication Technology Center (ITTC) in Nichols Hall, and provides a 20-fold increase in power to support a diverse range of research. The facility houses high performance computing (HPC) resources and, thanks to a $4.6 million renovation grant from the NIH, has the capability of supporting over 24,000 processing cores. A unique feature of the ACF is a sophisticated computer-rack cooling system that shuttles heat from computing equipment into the Nichols Hall boiler room, resulting in an expected 15% reduction in building natural gas use. Additionally, when outdoor temperatures drop below 45 degrees, a "dry-cooler" will kick in, slashing electricity consumption by allowing cooling compressors to be powered down.
The ITTC Research Cluster is located in the ACF, and provides HPC resources to members of the center. The cluster uses the Slurm workload manager, which is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters. The cluster is composed of a variety of hardware types, with core counts ranging from 8 to 20 cores per node. In addition, there is specialized hardware including Nvidia graphics cards for GPU computing, Infiniband for low latency/high throughput parallel computing, and large memory systems with up to 512 GB of RAM.
Job Submission Guide
PBS/Torque and Slurm
A translation for common PBS/Torque commands to Slurm commands can be found here [1]. This provides a quick guide for those who are familiar with PBS/Torque, but new to the Slurm scheduler.
srun
srun can be used to run any single task on a cluster node, but it is most useful for launching interactive GUI or bash sessions. Here is an example srun run on login1:
The options used in this example all detailed below: -C : specifies a node constraint. This can be used to specify nodes -J, --job-name : Specifies the job name that appears in the job queue. -- --cores-per-socket : sets the requested number of cores per cpu socket requested. --mem : This allow the specification of the desired memory per node. memory can be specified in Kilobytes (K), Megabytes (M), and Gigabytes (G). --mem-per-cpu : This specifies the memory allocated to each cpu in the interactive session. It has the same memory specification syntax as --mem.
--mail-type : Sets when the user is to be mailed job notifications. NONE, BEGIN, END, FAIL, REQUEUE, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, and TIME_LIMIT_50 are all valid options --mail-user : specifies the user account to email when job notification emails are sent.