Workload Manager (SLURM)

A workload manager is a piece of software that transforms a set of networked computers into a HPC cluster. The workload manager has several responsibilites: Resource Manager and Scheduler. In the case of SLURM both tasks are assumeb by the same piece of software. Is the workload manager what makes an HPC cluster to look like a supercomputer rather than a set of independent machines on a datacenter.

Resource manager has several subtasks associated with it. From one side, it keeps track of the resources present in the cluster and the availability on a given point in time. Individual computers could be added or removed from the pool of resources. Their load is recorded periodically to determine the feasibility of executing more jobs. It also keeps records for accounting purposes or profiling.

The scheduler side of a workload manager takes care of the jobs submitted to the cluster. Process the list of resources requested for each job and priorizes the execution according to some criteria or constrains imposed on the job or the current state of the cluster.

In this section we will cover in more detail the commands, variables and directives used by SLURM to help users to submit, monitor and control the jobs on the cluster. The configuration and administration of SLURM is out of scope for this section.

Understanding Partitions

The compute nodes on an HPC cluster are logically segmented into partitions. A partition is just a list of compute nodes where jobs submitted to it can execute. A compute node could belong to several partitions. A partition also include rules that must be followed before a job can be admitted and rules that declare how jobs will run on the cluster and conditions on when and how jobs can start execution.

All jobs regardless of if they are batch jobs or interactive jobs they are always submitted to some partition. To know the list of partitions on the cluster execute:

$ sinfo -s
PARTITION       AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
standby*           up    4:00:00      88/77/2/167 taicm[001-009],tarcl100,
                                              tarcs[100,200-206,300-304],tbdcx001,
                                              tbmcs[001-011,100-103],tbpcm200,tbpcs001,
                                              tcbcx100,tcdcx100,tcgcx300,tcocm[100-104],
                                              tcocs[001-064,100],tcocx[001-003],tcscm300,
                                              tjscl100,tjscm001,tmmcm[100-108],tngcm200,
                                              tpmcm[001-006],tsacs001,tsdcl[001-002],
                                              tsscl[001-002],ttmcm[100-101],tzecl[100-107],
                                              tzecs[100-115]
comm_small_day     up 1-00:00:00        64/0/1/65 tcocs[001-064,100]
comm_small_week    up 7-00:00:00        64/0/1/65 tcocs[001-064,100]
comm_med_day       up 1-00:00:00          4/1/0/5 tcocm[100-104]
comm_med_week      up 7-00:00:00          4/1/0/5 tcocm[100-104]
comm_xl_week       up 7-00:00:00          0/3/0/3 tcocx[001-003]
comm_gpu_inter     up    4:00:00         5/6/0/11 tbegq[200-202],tbmgq[001,100],tcogq[001-006]
comm_gpu_week      up 7-00:00:00          1/5/0/6 tcogq[001-006]
aei0001            up   infinite          2/6/1/9 taicm[001-009]
alromero           up   infinite        14/0/0/14 tarcl100,tarcs[100,200-206,300-304]
be_gpu             up   infinite          2/1/0/3 tbegq[200-202]
bvpopp             up   infinite          0/1/0/1 tbpcs001
cedumitrescu       up   infinite          0/1/0/1 tcdcx100
cfb0001            up   infinite          0/1/0/1 tcbcx100
cgriffin           up   infinite          1/0/0/1 tcgcx300
chemdept           up   infinite          0/4/0/4 tbmcs[100-103]
chemdept-gpu       up   infinite          1/0/0/1 tbmgq100
cs00048            up   infinite          0/1/0/1 tcscm300
jaspeir            up   infinite          0/2/0/2 tjscl100,tjscm001
jbmertz            up   infinite        3/14/0/17 tbmcs[001-011,100-103],tbmgq[001,100]
mamclaughlin       up   infinite          0/9/0/9 tmmcm[100-108]
ngarapat           up   infinite          0/1/0/1 tngcm200
pmm0026            up   infinite          0/6/0/6 tpmcm[001-006]
sbs0016            up   infinite          0/2/0/2 tsscl[001-002]
spdifazio          up   infinite          0/2/0/2 tsdcl[001-002]
tdmusho            up   infinite          0/2/0/2 ttmcm[100-101]
vyakkerman         up   infinite          0/1/0/1 tsacs001
zbetienne          up   infinite        1/23/0/24 tzecl[100-107],tzecs[100-115]
zbetienne_large    up   infinite          0/8/0/8 tzecl[100-107]
zbetienne_small    up   infinite        1/15/0/16 tzecs[100-115]

The first column is the name of the partition. The star (*) after default indicates that this will be the selected partition if not specified during the submission of the job either via command line arguments or submission directives in the submission file. All the partitions starting with comm_ are community partitions, meaning that anyone with an user account can submit on them. Faculty can purchase compute nodes that will receive its own partition with no limits on the amount of time the job can run on their own partition. Community queues use a name schema that makes easy to identify its specifications.

User Commands

Word

Meaning

_small

Compute nodes associated to the partition offer 96GB of RAM

_med

Compute nodes associated to the partition offer 192GB of RAM

_xl

Compute nodes associated to the partition offer 768GB of RAM

_gpu

Compute nodes associated to the partition include GPU cards

_day

The partition has a walltime of maximum 1 day

_week

The partition has a walltime of maximum 1 week

comm_gpu_inter

Partition for interactive jobs with GPUs. The walltime is 4 hours

The second column is the availability of the partition. At this point all partitions are enable and that is indicated with the word up.

The third column is the amount of time a job submitted to that partition can run, this is also known as the walltime. Jobs submitted without any limit of time explicitly indicated will received the walltime of the partition. Jobs that declare a larger walltime than the maximum allowed by the partition will be rejected inmediately after submission.

The fourth column provides a summary of the condition of nodes associated to the partition. The format is indicated as (A/I/O/T) and means (allocated/idle/other/total). Allocated are nodes executing one or more jobs. Idle are compute nodes currently inactive but enable to execute jobs. Other include nodes that are not allowed to execute jobs, due to maintnance or some other condition. The sum of all these 3 states is the total number of compute nodes associated with the partition.

The fifth column is the nodelist. This is a compacted listing of all the machines associated with each partition. The compact form is particularly useful in large HPC clusters with hundreds of compute nodes. Notice for example that compute nodes tcocs[001-064,100] appear as nodes associated to standby, comm_small_day and comm_small_week.

Sockets, CPU cores, and Hyperthreading

On a desktop computer or laptop, you will find a single processor also called Central Processing Unit (CPU). The CPU is the main chip responsible for most computational calculations taking place on the machine. Different from Desktop computers and laptops, on HPC, compute nodes it is often the case to find two or four CPU chips. Each CPU is located in what is called a socket. A dual socket node is then a node with two CPU chips. Those CPUs are in general identical and the Operating System will distribute the workload among them.

Modern CPUs are made of multiple cores. A CPU core is a completely functional processing unit and several CPU cores are printed on a single chip. We called these CPUs multicore and almost all CPUs today are multicore.

Some CPUs are capable of “logically dividing” each CPU core into two hardware threads, a technology called Hyperthreading. Hardware threads are designed to hide the latencies of the memory and feed the compute units fast enough to keep them busy all the time. Hyperthreading can be activated or deactivated depending on the cluster or its workload. Depending on the code running on the node hyperthreading can benefit or harm the performance.

Submitting batch jobs

A batch job is a job that has not expectations of running inmediately and will not be interactively operated when start running. This is the kind of jobs an HPC cluster is preferentially built for. You simply write on a text file the list of resources you need for the job and the list of steps to execute in the form of script and the job will be put in execution when resources became available. The text file is called a submission script and it has two roles. From one side, it has the script that will be put in execution on the compute node associated to the job. From the other side, it has a set of lines starting with #SBATCH. Those lines start with # meaning that they are ignored by the shell when running the script. Those lines are important for SLURM as it will interpret those lines and compile a list of requirements and configurations associated to the job. The lines starting with #SBATCH will not interpret shell variables or environment variables. These lines will contain resource requests such as the number of compute nodes, number of CPU cores, memory requested, partition. They also could contain the name of the job, specification for sending emails when the job start, ends of fails, where the output of the script wil go. These lines could also include other configurations that will be used before, during and after the job enters in execution.

Our first example will be very simple. Consider a submission script for a job called PI. The job will compute the value of pi using the the arbitrary precision calculator bc. The command to be executed will be:

echo $(echo "scale=65; 4*a(1)" | bc -l)

This is a simple execution that takes a fraction of a second on any modern computer. However, our purpose here is to use it to demonstrate how to submit a job that will be executed on a compute node. In practical cases the execution will require several hours or even days and need multiple CPU cores or multiple compute nodes. The submission script could be written like this:

#!/bin/bash

#SBATCH -J PI
#SBATCH -N 1
#SBATCH -c 1
#SBATCH -n 1
#SBATCH -p standby
#SBATCH -t 4:00:00

echo "The first 65 digits of PI are:"
echo $(echo "scale=65; 4*a(1)" | bc -l)

echo ""
echo "Job ID:              $SLURM_JOB_ID"
echo "Job Name:            $SLURM_JOB_NAME"
echo "Number of Nodes:     $SLURM_JOB_NUM_NODES"
echo "Number of CPU cores: $SLURM_CPUS_ON_NODE"
echo "Number of Tasks:     $SLURM_NTASKS"
echo "Partition:           $SLURM_JOB_PARTITION"

Assuming that this text is written on a file called runjob.pbs. Submit the job using the command:

trcis001:~$ sbatch runjob.pbs

The job will most likely execute after a few seconds. A file with a name such as slurm-<jobid>.out is created. For example, if the job ID were 122014. The output produce by the submission script will contain:

trcis001:~$ cat slurm-122014.out
The first 65 digits of PI are:
3.14159265358979323846264338327950288419716939937510582097494459228

Job ID:              122014
Job Name:            PI
Number of Nodes:     1
Number of CPU cores: 1
Number of Tasks:     1
Partition:           standby

Let us now understand the lines present on this first submission script. The first line is called a shebang This is used to indicate which interpreter will be used for the lines on the script. In this case we are saying that bash which is a commom shell interpreter must be used for the script.

The next 6 lines all start with $SBATCH:

#SBATCH -J PI
#SBATCH -N 1
#SBATCH -c 1
#SBATCH -n 1
#SBATCH -p standby
#SBATCH -t 4:00:00

The lines will each in order, set the name of the job (-J), the number of compute nodes (-N), the number of CPU cores per task (-c), the number of tasks (-n), the partition selected for the job (-p) and the limit of time for the job (-t). Except for the job name which is undefined by default, all other values here correspont to the default values. The job name is optional and the other 5 lines can be removed from the submission script and the script will take assume the default values. For this simple case all that result being one: One node, one task, one cpu per task and one cpu per node. The concepts nodes, tasks and cpus per task or node will be covered below.

The next 2 lines are the actual execution that we want takes place on a compute node. It could be a complex numerical simulation, solve a complex optimization problem, a genomic alignment. Any computationally demanding operation will be here.

In the final 6 lines we are demonstrating the used of some Environment variables that are created when the job start running on the compute node. In this case we are revelaling the content of those variables and writing those along with the output of the script. These variables can be use in the script to change the execution lines according to their values. More SLURM environment variables will be described below.

The output is all that the script or the programs call by the script writes produces to be seen on the screen. If the script were executed directly, the standard output is the terminal window. In the case of a batch script, all the output is directed to a couple of files. The normal output also called standard output is written to a file that by default looks like slurm-<jobid>.out The error output, text that is considered could be aside to the normal output is called standard error and it is sent to a separate file. In our case we do not have any error and no error file is generated.

All the #SBATCH directives are optional and there are default values for many of them or no value at all if not declared. Consider for example the same submission script removing all the lines starting with #SBATCH:

#!/bin/bash

echo "The first 65 digits of PI are:"
echo $(echo "scale=65; 4*a(1)" | bc -l)

echo ""
echo "Job ID:              $SLURM_JOB_ID"
echo "Job Name:            $SLURM_JOB_NAME"
echo "Number of Nodes:     $SLURM_JOB_NUM_NODES"
echo "Number of CPU cores: $SLURM_CPUS_ON_NODE"
echo "Number of Tasks:     $SLURM_NTASKS"
echo "Partition:           $SLURM_JOB_PARTITION"

Producing a similar result in the output file:

trcis001:~$ cat slurm-122439.out
The first 65 digits of PI are:
3.14159265358979323846264338327950288419716939937510582097494459228

Job ID:              122439
Job Name:            runjob2.slurm
Number of Nodes:     1
Number of CPU cores: 1
Number of Tasks:
Partition:           standby

Notice that if the job has no name, the name of the submission script becomes its name. One single node will be use and one single CPU core assigned to the job. No declaring a number of tasks will create a job with no value for that.

We will explore more complex submission scripts and the meaning of the multiple options associated but first lets explore how to monitor the jobs submitted and how to cancel jobs.

Monitoring jobs

Lets consider a variation of the submission script were we will be asking for several many compute nodes:

#!/bin/bash

#SBATCH -J PI
#SBATCH -N 80
#SBATCH -c 40
#SBATCH -n 80
#SBATCH -p standby
#SBATCH -t 4:00:00

echo "The first 65 digits of PI are:"
echo $(echo "scale=65; 4*a(1)" | bc -l)

echo ""
echo "Job ID:              $SLURM_JOB_ID"
echo "Job Name:            $SLURM_JOB_NAME"
echo "Number of Nodes:     $SLURM_JOB_NUM_NODES"
echo "Number of CPU cores: $SLURM_CPUS_ON_NODE"
echo "Number of Tasks:     $SLURM_NTASKS"
echo "Partition:           $SLURM_JOB_PARTITION"

Asumming this submission script was written on a file called runjob_80n.slurm, submit the job with the command:

trcis001:~$ sbatch runjob_40n.slurm
Submitted batch job 122837

This time no output in the form of the file slurm-<jobis>.out Check the status of the jobs using the command squeue. The command alone will return a listing of all the jobs running or in queue in the cluster. To restrict the listing just to jobs submitted by the user with the command:

trcis001:~$ squeue -u $USER
                         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                        122837   standby       PI gufranco PD       0:00     80 (Resources)
                        122895   standby       PI gufranco PD       0:00     40 (Priority)

Notice that we have here two jobs that are in queue. The status of PD means that the jobs are pending of execution. The ST column gives the state of the job. In the case above, both jobs are in pending state. The following codes:

Job State Codes

Status

Meaning

Description

R

Running

Job currently has an allocation

PD

PenDing

Job is awaiting resource allocation

TO

TimedOut

Job terminated upon reaching its time limit

PR

PReempted

Job terminated due to preemption

S

Suspended

Execution has been suspended and CPUs have been released for other jobs

CD

CompleteD

Job has terminated all processes on all nodes with an exit code of zero

CA

CAncelled

Job was explicitly cancelled by the user or system administrator

F

Failed

Job terminated with non-zero exit code or other failure condition

NF

Node Failure

Job terminated due to failure of one or more allocated nodes

There are a few other state conditions that appear less often. If you see one state that is not one above, check on the manual page for squeue under JOB STATE CODES.

The columns are for most part self-explanatory. The final column shows briefly the reasons why the job is still not running. One of the jobs states that could not run due to a lack of Resources, ie there are no 80 compute nodes available at this point. The other job is pending due to Priority, ie there are resources availables but they will be assigned to another job with higher priority.

There are several messages over the Reasons column. A common set of reasons messages with explanations follows.

Job Reason Codes

Reason

Description

Resources

Scheduler is unable to find sufficient idle resources to run your job

Priority

There are jobs with higher priority ahead of this job in the queue

QOSMaxCpuPerUserLimit

The CPU request exceeds the maximum each user is allowed to use

Licenses

The job is waiting for a license.

There are more reason codes that can be found in the SLURM documentation for Resource Limits

Canceling a job

Lets assume we want to delete the job 122837 that we submitted above. The job can be canceled with the command scancel followed by the jobid of the job you want to cancel:

trcis001:~$ squeue -u $USER
                         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                        122837   standby       PI gufranco PD       0:00     80 (Resources)
trcis001:~$ scancel 122837
trcis001:~$ squeue -u $USER
                         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

As the canceling of a job is an irreversible action it is suggested to use an interactive version that allow you to double check not only the job ID but also the job name and partition in case of mistyping the jobID:

trcis001:~$ scancel -i 123807
Cancel job_id=123807 name=PI partition=standby [y/n]? y

Estimation of the starting time for a job

SLURM tries to schedule all jobs as quickly as possible, subject to a number of dynamic constrains. These constrains could be cluster policies, available hardware, allocation priorities (contributers to the cluster get higher priority allocations), etc. Typically jobs submitted into a day queue could start running within a day or so. Week queues could demand more time as jobs there last longer in execution. All this can vary a a given point of time on the cluster.

The command squeue, has some arguments that can show you the scheduler’s estimate of when a pending/idle job will start running. It is just the scheduler’s best estimate, given current conditions, that on a cluster are permanently changing. The actual time a job starts might be earlier or later than that depending on factors such as the behavior of currently running jobs, the submission of new jobs, and hardware issues.

To see this, you need to request that squeue show the %S field in the output format option. One particularly good set of arguments in this situations could be:

trcis001:~$ squeue -o "%.9i %.9P %.8j %.8u %.2t %.10M %.6D %S" -u $USER
        JOBID PARTITION     NAME     USER ST       TIME  NODES START_TIME
   123809   standby       PI gufranco PD       0:00     40 2023-03-29T22:11:53
   123808   standby       PI gufranco PD       0:00     80 2023-03-30T15:10:06

It makes sense that requesting 80 nodes will take more time than requesting 40. It could happens that some of the jobs running finish before the walltime and the job will enter earlier or could happen than newer jobs enter in partition with higher priority than standby and the job gets delayed. Use this estimation as a guidance more than a commitment.

Detailed information about jobs

The information provided by the command squeue sometimes is not enough and you would like to gather a more complete picture of the state of a particular job. The command scontrol provides a wealth of information about jobs but also partitions and nodes. Information about a job:

trcis001:~$ scontrol show job 123809
JobId=123809 JobName=PI
   UserId=gufranco(318130) GroupId=its-rc-thorny(1079001) MCS_label=N/A
   Priority=10675 Nice=0 Account=its-rc-admin QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=2023-03-29T19:05:08 EligibleTime=2023-03-29T19:05:08
   AccrueTime=2023-03-29T19:05:08
   StartTime=2023-03-29T22:11:53 EndTime=2023-03-30T02:11:53 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-29T19:18:33 Scheduler=Backfill:*
   Partition=standby AllocNode:Sid=trcis001:28116
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList= SchedNodeList=tcscm300,tjscl100,tjscm001,tmmcm[100-108],tngcm200,tpmcm[001-006],tsacs001,tsdcl[001-002],tsscl[001-002],ttmcm[100-101],tzecl[100-107],tzecs[100-105]
   NumNodes=40-40 NumCPUs=800 NumTasks=40 CPUs/Task=20 ReqB:S:C:T=0:0:*:*
   TRES=cpu=800,mem=7717080M,node=40,billing=800
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/gpfs20/users/gufranco/runjob_40n.slurm
   WorkDir=/gpfs20/users/gufranco
   StdErr=/gpfs20/users/gufranco/slurm-123809.out
   StdIn=/dev/null
   StdOut=/gpfs20/users/gufranco/slurm-123809.out
   Power=

Information about a parition uses a similar command:

trcis001:~$ scontrol show partition standby
PartitionName=standby
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   NodeSets=compute
   Nodes=taicm[001-009],tarcl100,tarcs[100,200-206,300-304],tbdcx001,tbmcs[001-011,100-103],
     tbpcm200,tbpcs001,tcbcx100,tcdcx100,tcgcx300,tcocm[100-104],tcocs[001-064,100],
     tcocx[001-003],tcscm300,tjscl100,tjscm001,tmmcm[100-108],tngcm200,tpmcm[001-006],
     tsacs001,tsdcl[001-002],tsscl[001-002],ttmcm[100-101],tzecl[100-107],tzecs[100-115]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=CANCEL
   State=UP TotalCPUs=6140 TotalNodes=167 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=6140,mem=27963188M,node=167,billing=6140

We can also ask for information about a compute node:

trcis001:~$ scontrol show node tbdcx001
NodeName=tbdcx001 Arch=x86_64 CoresPerSocket=20
   CPUAlloc=40 CPUEfctv=40 CPUTot=40 CPULoad=42.22
   AvailableFeatures=xl,compute,bio
   ActiveFeatures=xl,compute,bio
   Gres=(null)
   NodeAddr=tbdcx001 NodeHostName=tbdcx001 Version=22.05.6
   OS=Linux 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Mar 25 21:21:56 UTC 2021
   RealMemory=773491 AllocMem=0 FreeMem=646603 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=standby
   BootTime=2023-02-23T22:45:08 SlurmdStartTime=2023-02-24T08:35:36
   LastBusyTime=2023-03-29T15:58:11
   CfgTRES=cpu=40,mem=773491M,billing=40
   AllocTRES=cpu=40
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Controling the limit of time for a job

If no time limit is declared on the submission script or the command line for sbatch, a job receives the time limit associated to the partition where it was submitted. There are ways to declare a time limit and help the scheduler to decide if your job can start running ahead some others with larger time limits. This is particularly important if your job uses one of the _week partitions and you only need 2 or 3 days. In those cases, the _day partitions are not a good fit and you can declare a 3 day time limit and the job could enter in execution faster than other jobs with an entire week of execution.

To specify your estimated runtime, use the –time=TIME or -t TIME parameter to sbatch. This value TIME can in any of the following formats:

User Commands

Format

Description

Example

M

M minutes

-t 45 (45 minutes)

M:S

M minutes, S seconds

-t 2:30 (Two minutes and 30 seconds)

H:M:S

H hours, M minutes, S seconds

-t 1:30:00 (One hour and half)

D-H

D days, H hours

-t 3-12 (3 days and a half)

D-H:M

D days, H hours, M minutes

-t 1-12:30 (1 day, 12 hours and 30 minutes)

D-H:M:S

D days, H hours, M minutes, S seconds

-t 6-23:59:59 (One second less than a full week)

Specifying number of Nodes, tasks and CPU Cores

Computationally intesive HPC applications generally use a combination of distributed parallel computing, multithreading or SMP parallelism and accelerators.

The dominant interface for distributed parallel computing is MPI. MPI uses intependent processes, also called ranks or tasks. Those processes could potentially run on separate computers and that is the solution programmers implement when the problems are so big than a single computer cannot complete the task alone on a reasonable amount of time.

Multithreading or SMP parallelism uses the multiple CPU cores present in all modern-day computers. Those cores can all see the entire RAM of the machine. SMP parallelism is implemented in codes that uses OpenMP, OpenACC. In high level languages like Python the use of multiprocessing module can take advantage of multiple CPU cores on the same machine. Some Linear Algebra libraries implement SMP parallelism, such as OpenBLAS and Intel MKL.

Sometimes, the access to the certain portion of RAM is faster for one core than the other. We say that those systems have Non-Uniform Memory Access (NUMA). For example, a machine with a dual scoket could have a portion of RAM that is associated with each CPU. In some cases we can gain important speedups by trying to concentrate the memory than a CPU uses into the RAM associated with it.

To better understand how all this can be controlled from SLURM imagine that we have a code that will use a number of CPU cores distributed across several machines. Consider that N represent the number of MPI tasks, and M represent the number of threads needed by the job. N MPI tasks can run on N compute nodes but not necessarly. On the other side if one of those MPI tasks uses multiple CPU cores, all those cores must be on the same machine. That is a necessary condition for multithreading parallelism.

Most jobs then fall into one of these categories:

  1. Sequential/Serial:

  2. Shared Memory Parallel:

  3. Distributed Parallelism:

  4. Hybrid SMP+Distributed Parallelism:

  5. Accelerator Based Parallelism:

Translating PBS Scripts to Slurm Scripts

All our previous clusters, Mountaineer, Spruce Knob and Thorny Flat used Torque as Resource Manager and Moab as Scheduler. In 2022 we transition to SLURM as workload manager. For most part the transition is transparent as we installed a set of wrappers that translate on the fly the usual commands from Torque and Moab into the equivalents for SLURM. The same happens with submission scripts where the lines with #PBS get translated into the corresponding $SBATCH versions. That being said is good for our users, specially new users to get used to SLURM commands and directives and stop using as much as possible the Torque commands that could become deprecated at some point in the future or the wrappers not being installed in future HPC clusters.

The following are a set of tables that contains a list of common commands and terms used with the TORQUE/PBS resource manager and scheduler, and the corresponding commands and terms used under the SLURM workload manager. These tables can be used to assist you in translating existing PBS scripts into proper SLURM scripts that can be interpreted directly. The same tables can be used for reference when writing new submission scripts directly in SLURM format

User Commands

User Commands

PBS/Torque

SLURM

Job submission

qsub [script_file]

sbatch [script_file]

Job deletion

qdel [job_id]

scancel [job_id]

Job status (by job)

qstat [job_id]

squeue [job_id]

Job status (by user)

qstat -u [user_name]

squeue -u [user_name]

Job hold

qhold [job_id]

scontrol hold [job_id]

Job release

qrls [job_id]

scontrol release [job_id]

Queue list

qstat -Q

squeue

Node list

pbsnodes -l

sinfo -N OR scontrol show nodes

Cluster status

qstat -a

sinfo

Environment Variables

Environment

PBS/Torque

SLURM

Job ID

$PBS_JOBID

$SLURM_JOB_ID

Submit Directory

$PBS_O_WORKDIR

$SLURM_SUBMIT_DIR

Submit Host

$PBS_O_HOST

$SLURM_SUBMIT_HOST

Node List

$PBS_NODEFILE

$SLURM_JOB_NODELIST

Job Name

$PBS_JOBNAME

$SLURM_JOB_NAME

Number of nodes

$PBS_NUM_NODES

$SLURM_JOB_NUM_NODES

Number of cores per node

$PBS_NUM_PPN

$SLURM_CPUS_ON_NODE

Unique index used for Job Arrays

$PBS_ARRAYID

$SLURM_ARRAY_TASK_ID

Job Directives or Specifications

Job Specification

PBS/Torque

SLURM

Script directive

#PBS

#SBATCH

Queue/Partition

-q [name]

--partition=[name] OR -p [name]

Node Count

-l nodes=[count]

--nodes=[min[-max]] OR -N [min[-max]]

Total Task Count

-l ppn=[count]

--ntasks=ntasks OR -n

Total Task Count

-l mppwidth=[PE_count]

--ntasks=ntasks OR -n

Wall Clock Limit

-l walltime=[hh:mm:ss]

--time=[days-hh:mm:ss] OR -t [min]

Standard Output File

-o [file_name]

--output=[file_name] OR -o [file_name]

Standard Error File

-e [file_name]

--error=[file_name] OR -e [file_name]

Write stderr -> stdout

-j oe (both to stdout)

(use -o without -e)

Write stdout -> stderr

-j eo (both to stderr)

(use -e without -o)

Copy Environment

-V

--export=[ALL | NONE | variables]

Event Notification

-m abe

--mail-type=[events]

Email Address

-M [address]

--mail-user=[address]

Job Name

-N [name]

--job-name=[name] OR -J [name]

Job Restart

-r [y | n]

--requeue OR --no-requeue

Resource Sharing

-l naccesspolicy=singlejob

--exclusive OR --shared

Memory Size

-l mem=[MB]

--mem=[mem][M | G | T]

Memory Size

-l mem=[MB]

--mem-per-cpu=[mem][M | G | T]

Accounts to charge

-A OR -W group_list=[account]

--account=[account] OR -A

Tasks Per Node

-l mppnppn [PEs_per_node]

--ntasks-per-node=[count]

CPUs Per Task

N/A

--cpus-per-task=[count]

Job Dependency

-d [job_id]

--depend=[state:job_id]

Quality of Service

-l qos=[name]

--qos=[normal | high]

Job Arrays

-t [array_spec]

--array=[array_spec]

Generic Resources

-l other=[resource_spec]

--gres=[resource_spec]

Job Enqueue Time

-a “YYYY-MM-DD HH:MM:SS”

--begin=YYYY-MM-DD[THH:MM[:SS]]

OLD DOCUMENTATION

Resource Specification

The #PBS -l option is used to specify resources such as number of CPUs, nodes, and length of Walltime for the job specified. The three most common resources specified for the Mountaineer cluster are

nodes

Number of nodes needed

walltime

Maximum limit for walltime given in the format hh:mm:ss

ppn

Processors per node

procs

Number of processors requested

pvmem

Maximum amount of memory used by any single process in the job

vmem

Maximum amount of memory used by all concurrent processes in the job

Note: procs is used when you do not require each CPU to be on the same node.

For example, the PBS directive

#PBS -l nodes=1:ppn=6,walltime=06:00:00

Specifies that the job will need 6 processors located on a single node with a maximum run time of 6 hours. Notice there is no space between commas or equal signs. Alternatively, if nodes=1 (procs=6 instead) had not been specified then the scheduler would just grab the first 6 processors available regardless of what nodes they reside on (which will only work if your program supports distributed computing). In general, unless you are running jobs using MPI libraries (mpirun) or posix threads, you will most likely only specify a single processor for your job (procs=1). Note:Resources specifying per node request are given with the nodes directive and seperated with a :, on the same line in your script.

Requesting Memory Specifications

Requesting memory specifications for jobs is done with the attributes vmem or pvmem through the PBS -l directive (resource specification). The man pages of pbs will specify two other memory related attributes: mem and pmem. However, these two attributes measure different job resources than virtual memory and therefore are not stable for use the way we commonly think of memory (use of RAM). In other words, do not use the attributes mem and pmem - they most likely do not do what you think they do. vmem and pvmem will put resource limits for the amount of RAM a job can access. This is important to ensure two large memory jobs do not end up on the same node; exceeding the node’s memory limits and causing a node crash (which will kill all jobs on the node). If you do not specify memory limits - moab will assume a uniform distribution of memory across all jobs on the node. For example, a 16 processor/64Gb of RAM compute node will assume roughly 4Gb of RAM per processor. However, if a job using 62 Gb of RAM and only 8 cores is running on a compute node - without memory limits Moab will place 8 more processor jobs on that node when clearly there is not enough memory for any remaining jobs. This will crash the node. Therefore, we recommend that if you anticipate your jobs are going to use more than an average of 3Gb per processor that you specify memory limits for your job using pvmem or vmem. On Spruce community nodes and Mountaineer we enforce this by making the system default of pvmem=3gb. On these systems without specifying memory above 3Gb will cause your job to fail. This is important - because on community nodes if you specify a job with 5 cores and vmem=25Gb; the job still will fail if it exceeds 15Gb because pvmem=3gb is assigned to each job by default (i.e. vmem does not override pvmem settings). To make your PBS scripts portable across community nodes and private nodes, we recommend that you only use pvmem to specify memory limits of jobs. pvmem attribute specifies the maximum amount of virtual memory used by any single processes in the job. Therefore, if you want a job that uses 6 processors and needs 35 Gb of RAM you would specify the following resource directive line:

#PBS -l nodes=1:ppn=6,pvmem=6gb

pvmem=6gb with 6 processors specifies 6*6 = 36Gb of total memory for the job.

Requesting Certain Node Types

There might be times where you want to be able to request a node with a particular feature or processor. The following will allow you to accomplish this task. Replace ‘feature_name” with one of the features in the below table.

#PBS -l feature=feature_name

Note, you can also request a particular feature not by doing the following:

#PBS -l feature='!feature_name'

Available Features

Feature

Description

smb

Sandy Bridge Based Processor Nodes

ivy

Ivy Bridge Based Processor Nodes

haswell

Haswell Based Processor Nodes

broadwell

Broadwell Based Processor Nodes

avx

Processors with AVX Extension

avx2

Processors with AVX2 Extension

f16c

Processors with f16c Extension

adx

Processors with adx Extension

large

Nodes with 512 GB of memory

E-mail options

The #PBS -m and #PBS -M options are used to specify when and to whom the scheduler will send e-mails. The -m option consists of either the single character “n”, or one or more of the characters “a”, “b”, and “e”.

n

No mail will be sent

a

Mail is sent when the job is aborted by the batch system

b

Mail is sent when the job begins execution

e

Mail is sent when the job ends

Note: If the -m option is not specified, mail will be sent if the job is aborted.

The shellscript option #PBS -M specifies the e-mail addresses to send mail to. For example, the PBS directive

#PBS -m ae
#PBS -M user@mailserver.com

The scheduler will send an e-mail to user@mailserver.com if the job is aborted, or when the job is completed. To specify more than one e-mail address with the -M option, each address should be separated with a comma without any spaces.

To Receive no e-mails even on aborts

Even with the ‘n’ option of ‘-m’ directive, the system will still send an e-mail if the job is cancelled or aborts. To provide the ability for our users to circumvent this response, we have set-up an alias e-mail address that can be used to bounce these e-mails. To receive absolutely no e-mails from the system, no matter what happens before, during and after execution of your job, use the noemail@hpc.wvu.edu address with the ‘n’ option:

#PBS -m n
#PBS -m noemail@hpc.wvu.edu

Output file specification

Default standard output and standard error of the job will be placed in files named jobname.ojobid and jobname.ejobid, respectfully. These files will be written to the directory in which the qsub command was executed in. Where jobname is specified using the -N environment option and jobid is given at run time by the system. The #PBS -e and #PBS -o options are used to specify what files should be written for the standard error and standard output stream, respectively.

-e

pathname for standard error stream output

-o

pathname for standard output stream output

An example, the PBS directive

#PBS -e /scratch/username/examplejob.error
#PBS -o /scratch/username/examplejob.output

The scheduler will write the files /scratch/username/examplejob.error and /scratch/username/examplejob.output for the standard error and standard output streams, respectively.

Note: Use full pathnames for your home directory and scratch directory

Requesting Array jobs

By using the directive #PBS -t , you can request a job to be repeated by a single script a number of times. This is useful if you have data where you want a single parameter to range over a section of numbers. For instance, if I wanted a series of commands to be run, with a single variable in the command to be executed over a range of 10-20 I could use the following command directives in my shell script

#PBS -N demographic_${PBS_ARRAYID}
#PBS -l nodes=1:ppn=2
#PBS -t 10-20

mkdir output_${PBS_ARRAYID}/
cd output_${PBS_ARRAYID}/
$SCRATCH/demographic_model.py -input_parameter ${PBS_ARRAYID} -procs 2 -output_file demographic_output.txt

The above script would launch ten jobs. Each job would have the name demographic_; so the first job would be named demographic_10, the second job would be named demographic_11, and so fourth. Each job would be run a single node with 2 processors (specified as #PBS -l nodes=1:ppn=2). Further, each job would make a directory named ouput_ (first job output_10, second job output_11, and so forth). Would cd into that directory and execute the python script demographic_model.py from my scratch directory. Notice that one of the input parameters would change each single job using the PBS set environment variable PBS_ARRAYID. Array request are very useful in scientific environments when you need to modify a parameter and see the output for a range of values. Note: this a theoretical example since I never specified walltime or a queue to execute this job from.

The number range for array request does not have to be sequential. You can also list a comma separated list of numbers as

#PBS -t 10,15,20,25

Further, you can also specify that only a certain number of jobs are queued at one time in cases where you have a large number of jobs and need to share a queue with another user

#PBS -t 1-200%10

The above directive will only launch ten jobs to the queue at a time until all 200 job requests have been executed.

Interactive Jobs

Interactive jobs allow a user to be given an interactive terminal on a compute node. This allows a user to “interact” directly with a compute node instead running in a batch or scripted mode. Interactive jobs are very useful when debugging jobs as it allows a user to walk step-by-step through your submit script to find errors or problems. Interactive jobs are also useful when needing to use a graphical program on the cluster.

To run an interactive job use the following command followed by any necessary PBS variables/flags. If you don’t specify any flags, you will be given an interactive job in the default queue for the cluster.

qsub -I

Do note, interactive jobs are only allowed on certain queues. All condo owner queues are allowed to have interactive jobs as well as queues such as ‘standby’ and ‘debug’. If you find you need an interactive queue on a community resource for a particular task or project, please contact Research Computing Help Desk for assistance.

Graphical Interface Jobs

Sometimes it might be useful or required to run a graphical program on the cluster. Non-compute intensive processes for visualization purposes can be run on the login node. These processes include “could” gnuplot, R and Matlab assuming they have low overhead. However, if you know your program is consume a lot of resources, it is best to run an interactive job.

To execute a graphical application on a compute node, you need to first review Using X Windows applications to properly setup your X (i.e. display) environment. To launch a graphical job on a compute node, you will need to execute the following along with any necessary flags/pbs environment variables.

$> qsub -I -X

Once you are given an access to a interactive terminal you can run your the proper executable to launch your graphical (i.e. X Window) program. For example:

$> module load statistics/matlab
$> matlab &

Checking the Status of Jobs

The status of a job currently submitted to the queue can be checked using the checkjob command. checkjob displays detailed job state information and diagnostic output for a specified job. Detailed information is available for queued, blocked, active, and recently completed jobs. Users can use checkjob to view the status of their own jobs.

Examples:

$> checkjob -v <jobid>

where is the jobid given at submission time.

The output of checkjob looks like this

job 1653450 (RM job '1653450.srih0001.hpc.wvu.edu')

AName: IVY
State: Completed
Completion Code: 0  Time: Fri May 19 15:30:21
Creds:  user:username  group:groupname  class:debug  qos:member
WallTime:   00:00:16 of 00:01:00
SubmitTime: Fri May 19 15:29:58
  (Time Queued  Total: 00:00:07  Eligible: 00:00:07)

Deadline:  3:59:49  (Fri May 19 19:30:58)
TemplateSets:  DEFAULT
Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: torque
Opsys: ---  Arch: ---  Features: ivy
GMetric[energy_used]  Current: 0.00  Min: 0.00  Max: 0.00  Avg: 0.00 Total: 0.00
NodeAccess: SINGLEJOB
TasksPerNode: 1
Allocated Nodes:
[sgpc0001.hpc.wvu.edu:1]


SystemID:   Moab
SystemJID:  1653450
Notification Events: JobEnd,JobFail
Task Distribution: sgpc0001.hpc.wvu.edu
UMask:          0000
OutputFile:     srih0001.hpc.wvu.edu:/gpfs/home/username/IVY.o1653450
ErrorFile:      srih0001.hpc.wvu.edu:/gpfs/home/username/IVY.e1653450
StartCount:     1
Execution Partition:  torque
SrcRM:          torque  DstRM: torque  DstRMJID: 1653450.srih0001.hpc.wvu.edu
Submit Args:    runjob_ivy.pbs
Flags:          RESTARTABLE
Attr:           checkpoint
StartPriority:  1000
PE:             1.00

Sometimes your job is rejected and you still get a jobid in that case you can check the reasons with checkjob For example, consider this submission script where we ask for too much memory for a serial job.

The submisssion script looks like

#!/bin/sh

#PBS -N TEST
#PBS -l nodes=1:ppn=1,vmem=200g
#PBS -l walltime=00:01:00
#PBS -m ae
#PBS -q groupname
#PBS -n

cd $PBS_O_WORKDIR

date

The jobs is accepted by torque but will see the job in queue for a long time. Now we execute checkjob to know the reasons for not being running

$> checkjob -v 1653589

job 1653589 (RM job '1653589.srih0001.hpc.wvu.edu')

AName: TEST
State: Idle
Creds:  user:username  group:groupname  class:groupname  qos:member
WallTime:   00:00:00 of 00:01:00
BecameEligible: Fri May 19 15:52:14
SubmitTime: Fri May 19 15:51:52
  (Time Queued  Total: 00:01:06  Eligible: 00:00:53)

Deadline:  3:59:54  (Fri May 19 19:52:52)
TemplateSets:  DEFAULT
Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Memory >= 0  Disk >= 0  Swap >= 3072M
Dedicated Resources Per Task: PROCS: 1  SWAP: 200G
NodeAccess: SINGLEJOB
TasksPerNode: 1
Reserved Nodes:  (3:09:16:24 -> 3:09:17:24  Duration: 00:01:00)
[sarc3001.hpc.wvu.edu:1]


SystemID:   Moab
SystemJID:  1653589
Notification Events: JobEnd,JobFail

UMask:          0000
OutputFile:     srih0001.hpc.wvu.edu:/gpfs/home/username/TEST.o1653589
ErrorFile:      srih0001.hpc.wvu.edu:/gpfs/home/username/TEST.e1653589
Partition List: torque
SrcRM:          torque  DstRM: torque  DstRMJID: 1653589.srih0001.hpc.wvu.edu
Submit Args:    runjob_badmem.pbs
Flags:          RESTARTABLE
Attr:           checkpoint
StartPriority:  2000
PE:             37.34
Reservation '1653589' (3:09:16:24 -> 3:09:17:24  Duration: 00:01:00)
Node Availability for Partition torque --------

srig0001.hpc.wvu.edu     rejected: Swap
szec2001.hpc.wvu.edu     rejected: State (Busy)
szec2002.hpc.wvu.edu     rejected: State (Busy)
szec2003.hpc.wvu.edu     rejected: State (Busy)
...
sbmc0017.hpc.wvu.edu     rejected: State (Busy)
sbmc0018.hpc.wvu.edu     rejected: State (Busy)
sbmg0001.hpc.wvu.edu     rejected: Swap
sric0001.hpc.wvu.edu     rejected: Swap
sric0002.hpc.wvu.edu     rejected: Swap
ssmc0006.hpc.wvu.edu     rejected: Swap
sgsc2001.hpc.wvu.edu     rejected: Class
sgsg2001.hpc.wvu.edu     rejected: Swap
sric0022.hpc.wvu.edu     rejected: Class
sric0025.hpc.wvu.edu     rejected: State (Busy)
sbmc0019.hpc.wvu.edu     rejected: State (Busy)
sbmc0020.hpc.wvu.edu     rejected: Swap
sbmc0021.hpc.wvu.edu     rejected: State (Busy)
sbmc0022.hpc.wvu.edu     rejected: State (Busy)
sric0024.hpc.wvu.edu     rejected: Swap
sllc0001.hpc.wvu.edu     rejected: Swap
...
sspc3006.hpc.wvu.edu     rejected: Swap
sspc3007.hpc.wvu.edu     rejected: Swap
sspc3008.hpc.wvu.edu     rejected: Swap
sspc3009.hpc.wvu.edu     rejected: State (Running)
sspc3010.hpc.wvu.edu     rejected: Swap
NOTE:  job req cannot run in partition torque (available procs do not meet requirements : 0 of 1 procs found)
idle procs: 623  feasible procs:   0

Node Rejection Summary: [Class: 2][State: 110][Swap: 53]

The “Swap” reason is “memory” related. The “State” reason is CPU related. The Queue system search for 623 cores available and could not find a single machine with 200GB available to launch the job.

Another important tool to monitor jobs and its state is showq

You can get the eligible jobs and their priorities with

showq- i -u <username>

For example

$ showq -i -u username

eligible jobs----------------------
JOBID                 PRIORITY  XFACTOR  Q  USERNAME    GROUP  PROCS     WCLIMIT     CLASS      SYSTEMQUEUETIME

1579829*                 14108      1.7 me   username groupname     16 14:00:00:00  groupname   Tue May  9 12:09:46
1595467*                 10599      1.6 me   username groupname      4 14:00:00:00  groupname   Thu May 11 22:39:11
1595464*                 10599      1.6 me   username groupname      4 14:00:00:00  groupname   Thu May 11 22:39:11
1595468*                 10599      1.6 me   username groupname      4 14:00:00:00  groupname   Thu May 11 22:39:11
1595466*                 10599      1.6 me   username groupname      4 14:00:00:00  groupname   Thu May 11 22:39:11
1595463*                 10599      1.6 me   username groupname      4 14:00:00:00  groupname   Thu May 11 22:39:10
1595465*                 10599      1.6 me   username groupname      4 14:00:00:00  groupname   Thu May 11 22:39:11
1595462*                 10599      1.6 me   username groupname      4 14:00:00:00  groupname   Thu May 11 22:39:10
1618053*                  6423      1.3 me   username groupname      2 14:00:00:00  groupname   Sun May 14 20:15:33
1618385*                  6363      1.3 me   username groupname      4 14:00:00:00  groupname   Sun May 14 21:14:58
1618386*                  6363      1.3 me   username groupname      4 14:00:00:00  groupname   Sun May 14 21:14:58
1618387*                  6363      1.3 me   username groupname      4 14:00:00:00  groupname   Sun May 14 21:14:59
1618388*                  6363      1.3 me   username groupname      4 14:00:00:00  groupname   Sun May 14 21:14:59
1630355*                  3967      1.2 me   username groupname      4 14:00:00:00  groupname   Tue May 16 13:11:17
1630507*                  3903      1.2 me   username groupname      4 14:00:00:00  groupname   Tue May 16 14:15:09
1630546*                  3884      1.2 me   username groupname     16 14:00:00:00  groupname   Tue May 16 14:34:33
1630494*                     1      1.4 co   username groupname     16  7:00:00:00 comm_larg   Tue May 16 14:08:50
1630349*                     1      1.4 co   username groupname     16  7:00:00:00 comm_larg   Tue May 16 13:10:08

18 eligible jobs

Total jobs:  18

Those are jobs that accrue priority as time passes for them on queue. Some jobs could become blocked, meaning that they are not gaining priority but will eventually become eligible later in time.

$ showq -b -u username

blocked jobs-----------------------
JOBID              USERNAME    GROUP      STATE PROCS     WCLIMIT            QUEUETIME

1623738             username groupname       Idle    16  7:00:00:00  Mon May 15 13:49:50
1623747             username groupname       Idle    16  7:00:00:00  Mon May 15 13:51:21
1623757             username groupname       Idle    16  7:00:00:00  Mon May 15 13:52:57
1652487             username groupname       Idle    16     4:00:00  Fri May 19 12:24:44
1646112             username groupname       Idle     4     4:00:00  Thu May 18 15:20:54
1646096             username groupname       Idle     4     4:00:00  Thu May 18 15:17:55
1630495             username groupname       Idle     4  7:00:00:00  Tue May 16 14:10:13
1630501             username groupname       Idle    16  7:00:00:00  Tue May 16 14:11:17
1623766             username groupname       Idle    16  7:00:00:00  Mon May 15 13:55:24
1623746             username groupname       Idle    16  7:00:00:00  Mon May 15 13:50:50
1623749             username groupname       Idle    16  7:00:00:00  Mon May 15 13:51:48
1623751             username groupname       Idle    16  7:00:00:00  Mon May 15 13:52:25
1646143             username groupname       Idle    16  7:00:00:00  Thu May 18 15:26:36
1623759             username groupname       Idle    16  7:00:00:00  Mon May 15 13:53:51
1623758             username groupname       Idle    16  7:00:00:00  Mon May 15 13:53:29
1623760             username groupname       Idle    16  7:00:00:00  Mon May 15 13:54:53
1623740             username groupname       Idle    16  7:00:00:00  Mon May 15 13:50:23
1623731             username groupname       Idle    16  7:00:00:00  Mon May 15 13:49:08
1630569             username groupname       Idle    16  7:00:00:00  Tue May 16 14:48:03
1623739             username groupname       Idle    16  7:00:00:00  Mon May 15 13:49:53
1623732             username groupname       Idle    16  7:00:00:00  Mon May 15 13:49:10

21 blocked jobs

Total jobs:  21

Finally, you can see the jobs that are currently running with their remaining time until hit their wall time

$ showq -r -u username

active jobs------------------------
JOBID               S  PAR  EFFIC  XFACTOR  Q  USERNAME    GROUP            MHOST PROCS   REMAINING            STARTTIME

1599005             R  tor  24.99      1.0 co   username groupname sric0011.hpc.wvu    16    00:24:38  Fri May 12 17:01:10
1599006             R  tor  24.99      1.0 co   username groupname sric0020.hpc.wvu    16    00:51:08  Fri May 12 17:27:40
1599007             R  tor  24.98      1.0 co   username groupname sric0021.hpc.wvu    16     1:03:41  Fri May 12 17:40:13
1599008             R  tor  24.99      1.0 co   username groupname sric0023.hpc.wvu    16     1:04:45  Fri May 12 17:41:17
1599009             R  tor  24.99      1.1 co   username groupname sric0032.hpc.wvu    16     4:42:25  Fri May 12 21:18:57
1599010             R  tor  24.99      1.1 co   username groupname sric0026.hpc.wvu    16     4:42:25  Fri May 12 21:18:57
1599011             R  tor  24.99      1.1 co   username groupname sric0017.hpc.wvu    16    10:10:42  Sat May 13 02:47:14
1546851             R  tor  99.73      2.6 co   username groupname sric0025.hpc.wvu    16  2:13:45:30  Mon May 15 06:22:02
1570354             R  tor  87.78      1.0 me   username groupname sarc3001.hpc.wvu    16  3:08:32:50  Tue May  9 01:09:22
1595446             R  tor  98.27      1.0 me   username groupname sarc2001.hpc.wvu     4  6:06:02:54  Thu May 11 22:39:26
1595448             R  tor  99.98      1.0 me   username groupname sarc2001.hpc.wvu     4  6:07:29:35  Fri May 12 00:06:07
1595449             R  tor  99.99      1.0 me   username groupname sarc0001.hpc.wvu     4  6:08:21:37  Fri May 12 00:58:09
1595453             R  tor  99.99      1.0 me   username groupname sarc0002.hpc.wvu     4  6:08:49:41  Fri May 12 01:26:13
1618813             R  tor  24.77      1.7 co   username groupname sric0037.hpc.wvu    16  6:20:47:59  Fri May 19 13:24:31
1618812             R  tor  24.77      1.7 co   username groupname sric0051.hpc.wvu    16  6:20:47:59  Fri May 19 13:24:31
1618814             R  tor  24.78      1.7 co   username groupname sric0036.hpc.wvu    16  6:20:47:59  Fri May 19 13:24:31
1618815             R  tor  24.84      1.7 co   username groupname sric0030.hpc.wvu    16  6:20:54:14  Fri May 19 13:30:46
1595460             R  tor  99.97      1.1 me   username groupname sarc0006.hpc.wvu     4  8:06:16:50  Sat May 13 22:53:22
1595461             R  tor  99.97      1.2 me   username groupname sarc0009.hpc.wvu     4  8:13:36:38  Sun May 14 06:13:10

19 active jobs         232 of 3112 processors in use by local jobs (7.46%)
                        155 of 165 nodes active      (93.94%)

Total jobs:  19

Canceling/Removing a Job

Jobs can be cancelled or removed using the canceljob command. Users can only remove jobs they submitted to the scheduler.

  $> canceljob <jobid>

is the jobid given at submission time.

Now canceljob is deprecated and Moab offers and alternative to cancel jobs For example, if you want to cancel jobs that starts with 1693 you can use this command to cancel those jobs. As user you can only cancel jobs that you own so do not worry about canceling jobs from other users by doing this.

$> mjobctl -c "x:1693.*"

Adding Prologue and Epilogue scripts to a Job

It is possible to declare scripts that run before and after the execution of the main submission script. The main advantage of those is to keep a record of the conditions under which a given job is running. Here we present a simple example of how to declare an prologue and epilogue.

Add these lines to your submission script:

#PBS -l prologue=/absolute/path/to/prologue.sh
#PBS -l epilogue=/absolute/path/to/epilogue.sh

The best way of working with those scripts is adding them to your home folder and use them on all your submission scripts. They should collect information that you can use later for debugging or profiling purposes.

Example of Prologue

prologue.sh

#!/bin/sh

echo ""
echo "Prologue Args:"
echo "Job ID: $1"
echo "User ID: $2"
echo "Group ID: $3"
echo ""

env | sort
hostname
date

exit 0

Example of Epilogue

epilogue.sh

#!/bin/sh

echo ""
echo "Epilogue Args:"
echo "Job ID: $1"
echo "User ID: $2"
echo "Group ID: $3"
echo "Job Name: $4"
echo "Session ID: $5"
echo "Resource List: $6"
echo "Resources Used: $7"
echo "Queue Name: $8"
echo "Account String: $9"
echo ""

env | sort
hostname
date

exit 0

Both prologue and epilogue must be made executable, use “

chmod +x prologue.sh epilogue.sh

to change their permissions.

Samples of Job Submission scripts

Below are bash scripts that can be modified and submitted to the qsub command for job submission. For details about the different parts of the scripts please visit the Running Jobs page. These scripts can be copied and pasted in the terminal using any number of text editors (i.e. vi, emacs, etc…)

Script for running a non-array batch queue

The below script has PBS directives to set-up commonly used variables such as job name, resources needed, e-mail address upon job completion and abnormal termination and specify a queue to run on

#!/bin/sh

#This is an example script for executing generic jobs with
# the use of the command 'qsub <name of this script>'


#These commands set up the Grid Environment for your job.  Words surrounding by a backet ('<','>') should be changed
#Any of the PBS directives can be commented out by placing another pound sign in front
#example
##PBS -N name
#The above line will be skipped by qsub because of the two consecutive # signs

# Specify job name
#PBS -N <name>

# Specify the resources need for the job
# Walltime is specified as hh:mm:ss (hours:minutes:seconds)
#PBS -l nodes=<number_of_nodes>:ppn=<number_of_processors_per_node,walltime=<time_needed_by_job>


# Specify when Moab should send e-mails 'ae' below user will
# receive e-mail for any errors with the job and/or upon completion
# If you don't want e-mails just comment out these next two PBS lines
#PBS -m ae

# Specify the e-mail address to receive above mentioned e-mails
#PBS -M <email_address>

# Specify the queue to execute task in. Current options can be found by excuting the command qstat -q at the terminal
#PBS -q <queue_name>

# Enter your command below with arguments just as if you where going to execute on the command line
# It is generally good practice to issue a 'cd' command into the directory that contains the files
# you want to use or use full path names

Script for running an array batch queue

Script is the same as above, but adds PBS -t to execute array request job submissions.

#!/bin/sh

#This is an example script for executing genetic jobs with
# the use of the command 'qsub <name of this script>'


#These commands set up the Grid Environment for your job.  Words surrounding by a backet ('<','>') should be changed
#Any of the PBS directives can be commented out by placing another pound sign in front
#example
##PBS -N name
#The above line will be skipped by qsub because of the two consecutive # signs

# Specify job name, use ${PBS_ARRAYID} to ensure names and output/error files have different names
#PBS -N <name_${PBS_ARRAYID}

# Specify the range for the PBS_ARRAYID environment variable
# <num_range> can be a continous range like 1-200 or 5-20
# or <num_range> can be a comma seperated list of numbers like 5,15,20,55
# You can also specify the maximum number of jobs queued at one time with the percent sign
# so a <num_range> specified as 5-45%8 would launch forty jobs with a range from 5-45, but only queue 8 at a time until
# all jobs are completed.
# Further, you can mix and match continous range and list like 1-10,15,25-40%10
#PBS -t <num_range>

# Specify the resources need for the job
# Walltime is specified as hh:mm:ss (hours:minutes:seconds)
#PBS -l nodes=<number_of_nodes>:ppn=<number_of_processors_per_node,walltime=<time_needed_by_job>


# Specify when Moab should send e-mails 'ae' below user will
# receive e-mail for any errors with the job and/or upon completion
# If you don't want e-mails just comment out these next two PBS lines
#PBS -m ae

# Specify the e-mail address to receive above mentioned e-mails
#PBS -M <email_address>

# Specify the queue to execute task in. Current options can be found by excuting the command qstat -q at the terminal
#PBS -q <queue_name>

# Enter your command below with arguments just as if you where going to execute on the command line
# It is generally good practice to issue a 'cd' command into the directory that contains the files
# you want to use or use full path names
# Any parameter or filename that needs to use the current job number of the array number range use ${PBS_ARRAYID}