SLURM Overview

Slurm is for cluster management and job scheduling. All RSS clusters use Slurm. This document gives an overview of how to run jobs, check job status, and make changes to submitted jobs. To learn more about specific flags or commands please visit slurm's website.

RSS offers a training session about Slurm. Please check our Training to learn more.

All jobs must be run using srun or sbatch to prevent running on the Lewis login node. Jobs that are running found running on the login node will be immediately terminated followed up with a notification email to the user.

Interactive SLURM job

Interactive jobs are typically a few minutes. This a basic example of an interactive job using srun and -n to use one cpu:

srun -n 1 hostname

An example of the command with the output:

[jgotberg@lewis4-r710-login-node223 ~]$ srun -n 1 hostname
lewis4-s2600tpr-hpc4-node249

We can use srun for calling programs interactively:

srun -n 1 --mem 4G --pty /bin/bash
module load R
R

Calling an interactive program with a GUI using X11. Verify X11 forwarding is enabled or that your ssh session has the -Y flag.

To learn more about module load please read Environment Modules User Guide.

Batch SLURM job

Batch jobs run multiple jobs and multiple tasks. They typically take a few hours to a few days to complete. Most of the time you will use a SBATCH file to launch your jobs. This examples shows how to put our SLURM options in the file saving_the_world.sh and then submit the job to the queue. To learn more about the partitions available for use on Lewis and the specifics of each partition please read our Partition Policy.

Listing of saving_the_world.sh:

#! /bin/bash

#SBATCH -p Lewis  # use the Lewis partition
#SBATCH -J saving_the_world  # give the job a custom name
#SBATCH -o results-%j.out  # give the job output a custom name
#SBATCH -t 0-02:00  # two hour time limit

#SBATCH -N 2  # number of nodes
#SBATCH -n 2  # number of cores (AKA tasks)

# Commands here run only on the first core
echo "$(hostname), reporting for duty."

# Commands with srun will run on all cores in the allocation
srun echo "Let's save the world!"
srun hostname

Once the SBATCH file is ready to go start the job with:

sbatch saving_the_world.sh

Output is found in the file results-<job id here>.out. Example below:

[jgotberg@lewis4-r710-login-node223 training]$ sbatch saving_the_world.sh
Submitted batch job 844
[jgotberg@lewis4-r710-login-node223 training]$ cat results-844.out
lewis4-s2600tpr-hpc4-node248, reporting for duty.
Let's save the world!
Let's save the world!
lewis4-s2600tpr-hpc4-node248
lewis4-s2600tpr-hpc4-node249

Job Status

sacct is a good practice to check the status of your jobs. The sacct command has many options for the information it can provide. Below is the command to display the full list with the full list output.

man sacct

To check the status jobs use:

sacct

To find jobs were issued during February 2016 for instance:

sacct -S 2016-02-01 -E 2016-03-01

And to see custome fileds applied to the Feb 2016 report:

sacct -S 2016-02-01 -E 2016-03-01 -o JobID,JobName,AllocCPUS,Partition,NNodes

Output:

       JobID    JobName  AllocCPUS  Partition   NNodes
------------ ---------- ---------- ---------- --------
188716             free          1    HighMem        1
188718             free          1    HighMem        1
188720             free          1    general        1

See expanded values with the job id and output formatting:

sacct -j 195939 -o User,Acc,AllocCPUS,Elaps,CPUTime,TotalCPU,AveDiskRead,AveDiskWrite,ReqMem,MaxRSS 

Output:

     User    Account  AllocCPUS    Elapsed    CPUTime   TotalCPU    AveDiskRead   AveDiskWrite     ReqMem     MaxRSS 
--------- ---------- ---------- ---------- ---------- ---------- -------------- -------------- ---------- ---------- 
    rcss    general         16   00:48:39   12:58:24  01:49.774         66.58M         44.75M       64Gn       216K 

You can use -o All to see all available fields. Available fields to choose from:

AllocCPUS         AllocGRES         AllocNodes        AllocTRES
Account           AssocID           AveCPU            AveCPUFreq
AveDiskRead       AveDiskWrite      AvePages          AveRSS
AveVMSize         BlockID           Cluster           Comment
ConsumedEnergy    ConsumedEnergyRaw CPUTime           CPUTimeRAW
DerivedExitCode   Elapsed           Eligible          End
ExitCode          GID               Group             JobID
JobIDRaw          JobName           Layout            MaxDiskRead
MaxDiskReadNode   MaxDiskReadTask   MaxDiskWrite      MaxDiskWriteNode
MaxDiskWriteTask  MaxPages          MaxPagesNode      MaxPagesTask
MaxRSS            MaxRSSNode        MaxRSSTask        MaxVMSize
MaxVMSizeNode     MaxVMSizeTask     MinCPU            MinCPUNode
MinCPUTask        NCPUS             NNodes            NodeList
NTasks            Priority          Partition         QOS
QOSRAW            ReqCPUFreq        ReqCPUFreqMin     ReqCPUFreqMax
ReqCPUFreqGov     ReqCPUS           ReqGRES           ReqMem
ReqNodes          ReqTRES           Reservation       ReservationId
Reserved          ResvCPU           ResvCPURAW        Start
State             Submit            Suspended         SystemCPU
Timelimit         TotalCPU          TRESAlloc         TRESReq
UID               User              UserCPU           WCKey
WCKeyID

Monitoring a running job

To view information about CPU, task, node, resident set size, and virtual memory for your job use the sstat command. The information provided from running sstat can help tuning of future jobs similar to the current job.

In order for this to work jobs using SBATCH must utilize the srun command within the job file.

#!/bin/bash
#SBATCH -n 1
#SBATCH -J monitor

echo $HOSTNAME
srun dd if=/dev/urandom bs=4k count=40k | md5sum

Example output:

[user@lewis3]$ sbatch ./monitor.sh
Submitted batch job 214827
[user@lewis3]$ sstat 214827
   JobID  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS ...
   ------------ ---------- -------------- -------------- ---------- ---------- ...
   214827.0        205104K        c13b-14              0    107940K       848K ...

Running Job Length

To view your status in the queue, if your job has started running, and how long your just has been running for use the squeue command. Running squeue by itself provides a list of all jobs submitted to the queue.

Common Reasons in Nodelist:

Reason Meaning
Priority Job is waiting to run after higher priority jobs based on fairshare
lewis4-hardware-partition-nodenumber Job is running on that particular node
QOSGrpNodeLimit Job's QOS has reached its node limit
PartitionTimeLimit Requested Time is greater than time allowed
Resources Job is waiting for appropriate compute nodes to become available

Example:

squeue -A *your_username_here*

If your job needs to run more than two days, please contact us to be added to the long QOS. We will set up a consultation to review your job and verify that no further tuning can be done.