Introduction to Lewis and Clark Clusters

This page describes several ways to manage resources and also submit and monitor jobs in a high performance computing (HPC) environment. This is the topic for our "Intro to Linux & Lewis/Clark Cluster" training. As basic Linux is a requirement of using the cluster, the first part of the training will focus on basic Linux commands and the second half of the training focuses on applying these to use on the clusters.

"Intro to Linux & Lewis/Clark Cluster 2021 PDF"
"Intro to Linux & Lewis/Clark Cluster 2021 Video"
Supplemental material on Lewis & Clark Cluster in directory: /group/training/

To be notified about our coming training, please subscribe to our Announcement List.

Older Trainings: - "Intro to Basic Linux 2020 PDF" - "Intro to Basic Linux 2020 Video" - "Intro to Lewis and Clark Clusters 2020 PDF" - "Intro to Lewis and Clark Clusters 2020 Video"

Recommended External Trainings: - Software Carpentry workshop, "The Unix Shell"

RSS Clusters

Clark cluster

A small-scale cluster for teaching and learning
No need for registration and available to all MU members by username and PawPrint
No cost for MU members

Lewis cluster

A large-scale cluster for requesting high amount of resources
Great for parallel programming
GPU resources
No cost for MU members for general usage
Investment option is available to receive more resource (more fairshare)

partitions

Please see Partition Policy for more information.

Clark and Lewis clusters are accessible by SSH:

Clark: ssh <username>@clark.rnet.missouri.edu
Lewis: ssh <username>@lewis.rnet.missouri.edu

See Getting Started to learn more about accessing RSS clusters.

Slurm

Slurm is an open source and highly scalable cluster management and job scheduling system for large and small Linux clusters. As a cluster workload manager, Slurm has three key functions.

First, it allocates access to resources (compute nodes) to users for some duration of time so they can perform work
Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes
Finally, it arbitrates contention for resources by managing a queue of pending work (from Slurm overview)

Moreover,

All RSS clusters use Slurm
All Slurm commands start with letter “s"
Resource allocation depends on your fairshare i.e. priority in the queue

All users connect to Clark and Lewis clusters through the login nodes.

[user@lewis4-r630-login-node675 ~]$
[user@clark-r630-login-node907 ~]$

All jobs must be run using Slurm submitting tools to prevent running on the Lewis login node. Jobs that are found running on the login node will be immediately terminated followed up with a notification email to the user.

login-node

Cluster Information

Slurm is a resource management system and has many tools to find available resources in the cluster. Following are set of Slurm commands for that purpose:

sinfo -s # summary of cluster resources (-s --summarize)
sinfo -p <partition-name> -o %n,%C,%m,%z # compute info of nodes in a partition (-o --format)
sinfo -p Gpu -o %n,%C,%m,%G # GPUs information in Gpu partition (-p --partition)
sjstat -c # show computing resources per node
scontrol show partition <partition-name> # partition information
scontrol show node <node-name> # node information
sacctmgr show qos format=name,maxwall,maxsubmit # show quality of services
./ncpu.py # show number of available CPUs and GPUs per node

For example the following shows output for sinfo -s and sjstat -c commands:

[user@clark-r630-login-node907 ~]$ sinfo -s 
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
r630-hpc3    up 2-00:00:00          0/4/0/4  clark-r630-hpc3-node[908-911]
hpc3         up 2-00:00:00          0/4/0/4  clark-r630-hpc3-node[908-911]
General*     up    2:00:00          0/4/0/4  clark-r630-hpc3-node[908-911]

[user@clark-r630-login-node907 ~]$ sjstat -c
Scheduling pool data:
-------------------------------------------------------------
Pool        Memory  Cpus  Total Usable   Free  Other Traits 
-------------------------------------------------------------
r630-hpc3  122534Mb    24      4      4      4 
hpc3       122534Mb    24      4      4      4 
General*   122534Mb    24      4      4      4

For instance above sinfo -s shows partition hpc3 has 4 idle nodes (free) and users can run jobs up 2 days. And, sjstat -c shows that partition hpc3 has 4 nodes with 24 cpus and 122GB of memory on each node. In general, CPUS/NODES(A/I/O/T) count of CPUs/nodes in the form "available/idle/other/total” and S:C:T counts number of “sockets, cores, threads”.

ncpu.py is a program developed by RSS team that shows CPU and GPU availability at each partition. ncpu.py can be found at /group/training/hpc-intro/alias/.

Users Information

Users can use Slurm to find more information about their accounts, fairshare and quality of services (QOS) and several Unix commands to find their storage quotas.

sshare -U # show your fairshare and accounts (-U --Users)
sacctmgr show assoc user=$USER format=acc,user,share,qos,maxj # your QOS 
groups # show your groups
df -h /home/$USER # home storage quota (-h --human-readable)
lfs quota -hg $USER /storage/hpc # data/scratch storage quota (-g user/group)
lfs quota -hg <group-name> /storage/hpc # data/scratch group storage quota
./userq.py # show user’s fairshare, accounts, groups and QOS

Note that:

Resource allocation depends on your fairshare. If your fairshare is 0.000000 you have used the cluster more than your fair share and will be de-prioritized by the queuing software
Users have 5GB at their home directory /home/$USER and 100GB at /storage/hpc/data/$USER
Do not use home directory for running jobs, storing data or virtual environments
Clark users have 100G on their home storage. The above methods does not apply for Clark
The RSS team reserves the right to delete anything in /scratch and /local/scratch at any time for any reason
There are no backups of any storage. The RSS team is not responsible for data integrity and data loss. You are responsible for your own data and data backup
Please review Storage Policy for more information

userq.py is a program developed by RSS team that shows user’s fairshare, accounts, groups and QOS. userq.py can be found at /group/training/hpc-intro/alias/.

Job Submission

All jobs must be run using srun or sbatch to prevent running on the Lewis login node. In general, users can request resources and run tasks interactively or create a batch file and submit their jobs. The following graphs shows job submission workflow:

job-submission

For running jobs interactively, we can use srun to request required resources through Slurm and run jobs interactively. For instance:

srun <slurm-options> <software-name/path>
srun --pty /bin/bash # requesting a pseudo terminal of bash shell to run jobs interactively
srun -p hpc3 --pty /bin/bash # requesting a pseudo terminal of bash shell on hpc3 partition 
srun -p Interactive --qos interactive --pty /bin/bash # requesting a p.t. of bash shell in Interactive Node on Lewis (-p --partition)
srun -p <partition-name> -n 4 --mem 16G --pty /bin/bash # req. 4 tasks and 16G memory (-n --ntasks)
srun -p Gpu --gres gpu:1 -N 1 --ntasks-per-node 8 --pty /bin/bash # req. 1 GPU and 1 node for running 8 tasks on Gpu partition (-N --nodes)

For submitting jobs, we can create a batch file, which is a shell script (#!/bin/bash) including Slurm options (#SBATCH) and computational tasks, and use:

sbatch <batch-file>

After job completion we will receive outputs i.e. slurm-jobid.out.

Slurm has several options that help users manage their jobs requirement, such that:

-p --partition <partition-name>       --pty <software-name/path>
--mem <memory>                        --gres <general-resources>
-n --ntasks <number of tasks>         -t --time <days-hours:minutes>
-N --nodes <number-of-nodes>          -A --account <account>
-c --cpus-per-task <number-of-cpus>   -L --licenses <license>
-w --nodelist <list-of-node-names>    -J --job-name <jobname>

Also, Slurm has several environmental variables that contain details such as job id, job name, host name and more. For instance:

$SLURM_JOB_ID
$SLURM_JOB_NAME
$SLURM_JOB_NODELIST
$SLURM_CPUS_ON_NODE
$SLURM_SUBMIT_HOST
$SLURM_SUBMIT_DIR

For example let's consider the following Python code, called test.py:

#!/usr/bin/python3
import os

os.system("""
echo hostname: $(hostname)
echo number of processors: $(nproc)
echo data: $(date)
echo job id: $SLURM_JOB_ID
echo submit dir: $SLURM_SUBMIT_DIR
""")

print("Hello world”)

To run the above code, we can use srun to run test.py interactively such that:

srun -p Interactive --qos interactive -n 4 --mem 8G --pty bash # in Lewis
# srun -p hpc3 -n 4 --mem 8G --pty bash # in Clark

python3 test.py

Or create a batch file, called jobpy.sh, such that:

#!/bin/bash

#SBATCH -p hpc3
#SBATCH -n 4
#SBATCH --mem 8G

python3 test.py

And use sbatch to submit the batch file:

sbatch jobpy.sh

Job Arrays

If you have a plan to submit large number of jobs at the same time in parallel, Slurm job array can be very helpful. Note that in most of the parallel programming methods we submit a single script to run but in a way that each iteration of running the same code return different results.

Job arrays use the following environment variables:

SLURM_ARRAY_JOB_ID keeps the first job ID of the array
SLURM_ARRAY_TASK_ID keeps the job array index value
SLURM_ARRAY_TASK_COUNT keeps the number of tasks in the job array
SLURM_ARRAY_TASK_MAX keeps the highest job array index value
SLURM_ARRAY_TASK_MIN keeps the lowest job array index value

Example 1 - Python

The following shows a simple example of using Slurm job array and Python to submit jobs in parallel. Each of these jobs get a seperate memory and CPU allocation to run tasks in parallel. Let's create the following Python script called array-test.py:

#!/usr/bin/python3

import os

## Computational function
def comp_func(i):
    host = os.popen("hostname").read()[:-1] # find host's name
    ps = os.popen("cat /proc/self/stat | awk '{print $39}'").read()[:-1] # find cpu's id
    return "Task ID: %s, Hostname: %s, CPU ID: %s" % (i, host, ps)

## Iterate to run the computational function
task_id = [int(os.getenv('SLURM_ARRAY_TASK_ID'))]
for t in task_id:
    print(comp_func(t))

The following is a batch file called array-job.sh:

#!/usr/bin/bash

#SBATCH --partition hpc3
#SBATCH --job-name array
#SBATCH --array 1-9
#SBATCH --output output-%A_%a.out

python3 ./array-test.py

You can submit the batch file by:

sbatch array-job.sh

The output is:

[user@clark-r630-login-node907 hpc-intro]$ cat output-*
Task ID: 1, Hostname: clark-r630-hpc3-node908, CPU ID: 0
Task ID: 2, Hostname: clark-r630-hpc3-node908, CPU ID: 2
Task ID: 3, Hostname: clark-r630-hpc3-node908, CPU ID: 4
Task ID: 4, Hostname: clark-r630-hpc3-node908, CPU ID: 6
Task ID: 5, Hostname: clark-r630-hpc3-node908, CPU ID: 8
Task ID: 6, Hostname: clark-r630-hpc3-node908, CPU ID: 10
Task ID: 7, Hostname: clark-r630-hpc3-node908, CPU ID: 12
Task ID: 8, Hostname: clark-r630-hpc3-node908, CPU ID: 14
Task ID: 9, Hostname: clark-r630-hpc3-node908, CPU ID: 16

Example 2 - R

Let's use job arrays to rerun the following R script (array-test.R) simultaneously with different seeds (starting point for random numbers):

#!/usr/bin/env R

system("echo Date: $(date)", intern = TRUE)

args = commandArgs(trailingOnly = TRUE)
myseed = as.numeric(args[1])

set.seed(myseed)
print(runif(3))

The following is a batch file called array-job-r.sh:

#!/usr/bin/bash

#SBATCH --partition Lewis
#SBATCH --job-name r-seed
#SBATCH --array 1-12%4 # %4 will limit the number of simultaneously running tasks from this job array to 4
#SBATCH --output Rout-%A_%a.out

module load r
Rscript array-test.R ${SLURM_ARRAY_TASK_ID}

You can submit the batch file by:

sbatch array-job-r.sh

The output is:

[user@lewis4-r630-login-node675 test-array]$ cat Rout-*
[1] "Date: Tue Feb 16 18:21:15 CST 2021"
[1] 0.5074782 0.3067685 0.4269077

[1] "Date: Tue Feb 16 18:21:15 CST 2021"
[1] 0.2772497942 0.0005183129 0.5106083730

[1] "Date: Tue Feb 16 18:21:15 CST 2021"
[1] 0.06936092 0.81777520 0.94262173

[1] "Date: Tue Feb 16 18:21:15 CST 2021"
[1] 0.7103224 0.2461373 0.3896344

Monitoring Jobs

The following Slurm commands can be used to monitor jobs:

sacct -X # show your jobs in the last 24 hours (-X --allocations)
sacct -X -S <yyyy-mm-dd> # show your jobs since a date (-S --starttime)
sacct -X -S <yyyy-mm-dd> -E <yyyy-mm-dd> -s <R/PD/F/CA/CG/CD> # show running/pending/failed/cancelled/completing/completed jobs in a period of time (-s --state)
sacct -j <jobid> # show more details on selected jobs (-j --jobs)
squeue -u <username> # show a user jobs (R/PD/CD) in the queue (-u --user)
squeue -u <username> --start # show estimation time to start pending jobs
scancel <jobid> # cancel jobs
./jobstat.py <day/week/month/year> # show info about running, pending and completed jobs of a user within a time period (default is week)

jobstat.py is a program developed by RSS team that shows information about running, pending and completed jobs of a user within a time period (default is one week). jobstat.py can be found at /group/training/hpc-intro/alias/.

Monitor CPU and Memory

Completed jobs

We can use the following options to find efficiency of the completed jobs:

sacct -j <jobid> -o User,Acc,AllocCPUS,Elaps,CPUTime,TotalCPU,AveDiskRead,AveDiskWrite,ReqMem,MaxRSS # info about CPU and virtual memory for compeleted jobs (-j --jobs)
seff <jobid> # show job CPU and memory efficiency

For more information on what fields to include see man sacct.

Example output:

[user@lewis4-r630-login-node675 ~]$ sacct -j 10785018 -o User,Acc,AllocCPUS,Elaps,CPUTime,TotalCPU,AveDiskRead,AveDiskWrite,ReqMem,MaxRSS
     User    Account  AllocCPUS    Elapsed    CPUTime   TotalCPU    AveDiskRead   AveDiskWrite     ReqMem     MaxRSS 
--------- ---------- ---------- ---------- ---------- ---------- -------------- -------------- ---------- ---------- 
     user    general         16   00:48:39   12:58:24  01:49.774         66.58M         44.75M       64Gn       216K 

[user@lewis4-r630-login-node675 ~]$ seff 10785018
Job ID: 10785018
Cluster: lewis4
User/Group: user/rcss
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:01:50
CPU Efficiency: 0.24% of 12:58:24 core-walltime
Memory Utilized: 3.38 MB (estimated maximum)
Memory Efficiency: 0.01% of 64.00 GB (64.00 GB/node)

Running jobs

Slurm also, provides sstat tool to monitor *running jobs efficiency. In order for this to work, jobs using SBATCH must utilize the srun command within the job file.

sstat <jobid> -o AveCPU,AveDiskRead,AveDiskWrite,MaxRSS # info about CPU and memory for runing jobs (srun only)

Moreover, we can use top command to find how much CPU and memory are using by a running job. To do that, we need to attach to the node that our job is running and use top command by using:

srun --jobid <jobid> --pty /bin/bash 

top -u $USER

For Memory usage, the number you are interested in is RES. In the below example, python3 program is using about 5.6Mb memory and 0% of requested CPUs.

  PID USER     PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14278 rcss     20   0  124924   5612   2600 S   0.0  0.0   0:00.04 python3
14279 rcss     20   0  124924   5612   2600 S   0.0  0.0   0:00.03 python3

Modules

In order to use a software, we need to load the corresponding module first. The following commands let us manage modules in our workflow:

module avail # available modules
module show # show modules info 
module list # list loaded modules
module load # loaded modules
module unload # unload loaded modules
module purge # unload all loaded modules

Never load modules in the login node. It makes login node slow for all users and many modules don’t work in the login node.

For example to use R interactively, first need to request resources by srun and then use module load R:

srun -p Interactive --qos interactive --mem 4G --pty /bin/bash
# srun -p hpc3 --mem 4G --pty bash # in Clark

module load R
R

If you are looking for using a licensed software (available in cluster) make sure you call the license when requesting resources. For instance to use MATLAB:

srun -p Interactive --qos interactive --mem 4G -L matlab --pty /bin/bash

module load matlab
matlab -nodisplay

Review our Software documentation to find more details about running a software in our clusters. Review Environment Modules User Guide for more details.

What Is Next

Virtual Environments