Cluster Documentation

From ITTC Help
Jump to navigation Jump to search

Introduction

The Advanced Computing Facility (ACF) is located at the Information and Telecommunication Technology Center (ITTC) in Nichols Hall, and provides a 20-fold increase in power to support a diverse range of research. The facility houses high performance computing (HPC) resources and, thanks to a $4.6 million renovation grant from the NIH, has the capability of supporting over 24,000 processing cores. A unique feature of the ACF is a sophisticated computer-rack cooling system that shuttles heat from computing equipment into the Nichols Hall boiler room, resulting in an expected 15% reduction in building natural gas use. Additionally, when outdoor temperatures drop below 45 degrees, a "dry-cooler" will kick in, slashing electricity consumption by allowing cooling compressors to be powered down.

The ITTC Research Cluster is located in the ACF, and provides HPC resources to members of the center. The cluster uses the Slurm workload manager, which is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters. The cluster is composed of a variety of hardware types, with core counts ranging from 8 to 20 cores per node. In addition, there is specialized hardware including Nvidia graphics cards for GPU computing, Infiniband for low latency/high throughput parallel computing, and large memory systems with up to 512 GB of RAM.

Job Submission Guide

Other useful options
-C
Specifies a node constraint. This can be used to specify gpu nodes. See [ this ] example for more information on running gpu jobs.
-J
Specifies the job name that appears in the job queue.
--
--cores-per-socket
Sets the requested number of cores per cpu socket.
--mem-per-cpu
This specifies the memory allocated to each cpu in the interactive session. It has the same memory specification syntax as --mem.
--mail-type
Sets when the user is to be mailed job notifications. NONE, BEGIN, END, FAIL, REQUEUE, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, and TIME_LIMIT_50 are all valid options
--mail-user
specifies the user account to email when job notification emails are sent.

PBS/Torque and Slurm

A translation for common PBS/Torque commands to Slurm commands can be found here. This provides a quick guide for those who are familiar with PBS/Torque, but new to the Slurm scheduler.

srun

Srun can be used to run any single task on a cluster node, but it is most useful for launching interactive GUI or bash sessions. Here is a srun example run on login1:

[username@login1 ~]$ srun -p intel -N 1 -n 1 -c 4 --mem 2G --pty /bin/bash
[username@n097 ~]$ 
The options used in this example are all detailed below
-p
Specifies a partition, or queue to create the job in. The current cluster partitions available are intel, amd, bigm, and gpu.
-N
This sets the number of requested nodes of the interactive session.
-n
Specifies the number of tasks or processes to run on each allocated node.
-c
Sets the number of requested cpus per task
--mem
This specifies the requested memory per node. Memory amounts can be given in Kilobytes (K), Megabytes (M), and Gigabytes (G).
--pty
This option sets the srun session in pseudo-terminal mode. It is recommended to use this option if you are running an interactive shell session.
/bin/bash
The last option in a srun invocation is the program that srun will execute on the requested node. In this case, bash is specified to start an interactive shell session.

Srun supports nearly all of the same options as sbatch, so all of the additional commands listed under sbatch will also work on the commandline with srun. Srun can be used within a job submission script, or on the command line of a login node to run single jobs. When submitting non-interactive jobs, however, sbatch is the better command to use.

sbatch

Sbatch is used to submit batch jobs to the cluster. Unlike srun, sbatch uses a job submission script to specify resource requests. Below is an example sbatch submission script:

#!/bin/bash -l
#SBATCH -p debug    
#SBATCH -N 64       
#SBATCH -t 00:20:00 
#SBATCH -J my_job   
#SBATCH -L SCRATCH
#SBATCH -C 
 
srun -n 2048 ./mycode.exe     # an extra -c 2 flag is optional for fully packed pure MPI
#or
srun -n 2048 -m block:block ./mycode.exe   # to use block distribution among sockets


Other useful options
-C
Specifies a node constraint. This can be used to specify gpu nodes. See [ this ] example for more information on running gpu jobs.
-J
Specifies the job name that appears in the job queue.
--
--cores-per-socket
Sets the requested number of cores per cpu socket.
--mem-per-cpu
This specifies the memory allocated to each cpu in the interactive session. It has the same memory specification syntax as --mem.
--mail-type
Sets when the user is to be mailed job notifications. NONE, BEGIN, END, FAIL, REQUEUE, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, and TIME_LIMIT_50 are all valid options
--mail-user
specifies the user account to email when job notification emails are sent.

GUI Access

X11 forwarding

Access to a GUI running on the cluster may be accomplished with X11 forwarding. Data from the remote application is sent over ssh to an X server running locally. Each additional ssh connection between the local machine and the cluster must be started with X11 forwarding enabled. To request an interactive shell with X11 forwarding, you can run "srun.x11". The following steps assume that the local machine has an X server running.

1. Login via ssh to login1 or login2. Make sure your local ssh client has X11 forwarding enabled. If you are using ssh on the command line, add the "-Y" flag to your ssh command.
2. Start an interactive session with X11 forwarding. Be sure to request the number of cores, amount of memory, and walltime to complete your job. Syntax:

srun.x11 -N 1 -n 2 --mem=4096mb -t 8:00:00 

NoMachine

NoMachine is a remote desktop application that is available for Linux, Windows, and OSX. NoMachine requires that you are connected to the KU or ITTC network; remote users will need to use the KU Anywhere VPN.

Here is a step-by-step guide to setting up NoMachine:

  1. First, you will need to install NoMachine on you computer, which is available here
  2. After installing NoMachine, run the client and click 'Continue' on the start-up screen. From the options available, click 'Create a new custom Connection'. Change the Protocol setting from 'NX' to 'SSH' and click continue. In the 'Host' field, put login1.ittc.ku.edu or login2.ittc.ku.edu, and in the 'Port' field, put 22, then continue to the next screen.
    For the authentication method, select the 'Use the NoMachine login' radial button and continue. The next screen prompts for an alternative server key; hit continue without specifying one. Lastly, select the 'Don't use a proxy' radial button before continuing on the final page.
  3. After creating your new connection configuration, you will be prompted for a name to save the configuration under. After saving your connection, you should be able to see it in a list of created connections(s) on the main page. Double click on the one you just created to try connecting.
    If you configured the settings properly, you will be prompted to enter your ITTC credentials. If you are not prompted to enter your credentials, you may have entered information incorrectly in the connection creation process, or you computer may not be on the correct network.
    Make sure you are connected to the KU or ITTC network through the KU Anywhere VPN if you are connecting to the cluster remotely. If you are unable to solve your connection issue, email [1] for assistance in setting up your remote connection.
  4. Assuming you are able to connect, the first time you connect, you will be asked to 'verify the host authenticity'. Click 'yes' to continue with the connection. You will now be asked to select a desktop environment to use for the connection. The GNOME desktop is recommended.
    After selecting the environment, read through the NoMachine welcome screen and continue on to the desktop.

Application Support

Helpful Commands

Debugging

Profiling