Workload Manager (SLURM)¶
A workload manager is a piece of software that transforms a set of networked computers into a HPC cluster. The workload manager has several responsibilites: Resource Manager and Scheduler. In the case of SLURM both tasks are assumeb by the same piece of software. Is the workload manager what makes an HPC cluster to look like a supercomputer rather than a set of independent machines on a datacenter.
Resource manager has several subtasks associated with it. From one side, it keeps track of the resources present in the cluster and the availability on a given point in time. Individual computers could be added or removed from the pool of resources. Their load is recorded periodically to determine the feasibility of executing more jobs. It also keeps records for accounting purposes or profiling.
The scheduler side of a workload manager takes care of the jobs submitted to the cluster. Process the list of resources requested for each job and priorizes the execution according to some criteria or constrains imposed on the job or the current state of the cluster.
In this section we will cover in more detail the commands, variables and directives used by SLURM to help users to submit, monitor and control the jobs on the cluster. The configuration and administration of SLURM is out of scope for this section.
Understanding Partitions¶
The compute nodes on an HPC cluster are logically segmented into partitions. A partition is just a list of compute nodes where jobs submitted to it can execute. A compute node could belong to several partitions. A partition also include rules that must be followed before a job can be admitted and rules that declare how jobs will run on the cluster and conditions on when and how jobs can start execution.
All jobs regardless of if they are batch jobs or interactive jobs they are always submitted to some partition. To know the list of partitions on the cluster execute:
$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
standby* up 4:00:00 88/77/2/167 taicm[001-009],tarcl100,
tarcs[100,200-206,300-304],tbdcx001,
tbmcs[001-011,100-103],tbpcm200,tbpcs001,
tcbcx100,tcdcx100,tcgcx300,tcocm[100-104],
tcocs[001-064,100],tcocx[001-003],tcscm300,
tjscl100,tjscm001,tmmcm[100-108],tngcm200,
tpmcm[001-006],tsacs001,tsdcl[001-002],
tsscl[001-002],ttmcm[100-101],tzecl[100-107],
tzecs[100-115]
comm_small_day up 1-00:00:00 64/0/1/65 tcocs[001-064,100]
comm_small_week up 7-00:00:00 64/0/1/65 tcocs[001-064,100]
comm_med_day up 1-00:00:00 4/1/0/5 tcocm[100-104]
comm_med_week up 7-00:00:00 4/1/0/5 tcocm[100-104]
comm_xl_week up 7-00:00:00 0/3/0/3 tcocx[001-003]
comm_gpu_inter up 4:00:00 5/6/0/11 tbegq[200-202],tbmgq[001,100],tcogq[001-006]
comm_gpu_week up 7-00:00:00 1/5/0/6 tcogq[001-006]
aei0001 up infinite 2/6/1/9 taicm[001-009]
alromero up infinite 14/0/0/14 tarcl100,tarcs[100,200-206,300-304]
be_gpu up infinite 2/1/0/3 tbegq[200-202]
bvpopp up infinite 0/1/0/1 tbpcs001
cedumitrescu up infinite 0/1/0/1 tcdcx100
cfb0001 up infinite 0/1/0/1 tcbcx100
cgriffin up infinite 1/0/0/1 tcgcx300
chemdept up infinite 0/4/0/4 tbmcs[100-103]
chemdept-gpu up infinite 1/0/0/1 tbmgq100
cs00048 up infinite 0/1/0/1 tcscm300
jaspeir up infinite 0/2/0/2 tjscl100,tjscm001
jbmertz up infinite 3/14/0/17 tbmcs[001-011,100-103],tbmgq[001,100]
mamclaughlin up infinite 0/9/0/9 tmmcm[100-108]
ngarapat up infinite 0/1/0/1 tngcm200
pmm0026 up infinite 0/6/0/6 tpmcm[001-006]
sbs0016 up infinite 0/2/0/2 tsscl[001-002]
spdifazio up infinite 0/2/0/2 tsdcl[001-002]
tdmusho up infinite 0/2/0/2 ttmcm[100-101]
vyakkerman up infinite 0/1/0/1 tsacs001
zbetienne up infinite 1/23/0/24 tzecl[100-107],tzecs[100-115]
zbetienne_large up infinite 0/8/0/8 tzecl[100-107]
zbetienne_small up infinite 1/15/0/16 tzecs[100-115]
The first column is the name of the partition. The star (*) after default
indicates that this will be the selected partition if not specified during the submission of the job either via command line arguments or submission directives in the submission file. All the partitions starting with comm_
are community partitions, meaning that anyone with an user account can submit on them. Faculty can purchase compute nodes that will receive its own partition with no limits on the amount of time the job can run on their own partition.
Community queues use a name schema that makes easy to identify its specifications.
Word |
Meaning |
---|---|
|
Compute nodes associated to the partition offer 96GB of RAM |
|
Compute nodes associated to the partition offer 192GB of RAM |
|
Compute nodes associated to the partition offer 768GB of RAM |
|
Compute nodes associated to the partition include GPU cards |
|
The partition has a walltime of maximum 1 day |
|
The partition has a walltime of maximum 1 week |
|
Partition for interactive jobs with GPUs. The walltime is 4 hours |
The second column is the availability of the partition. At this point all partitions are enable and that is indicated with the word up
.
The third column is the amount of time a job submitted to that partition can run, this is also known as the walltime. Jobs submitted without any limit of time explicitly indicated will received the walltime of the partition. Jobs that declare a larger walltime than the maximum allowed by the partition will be rejected inmediately after submission.
The fourth column provides a summary of the condition of nodes associated to the partition. The format is indicated as (A/I/O/T) and means (allocated/idle/other/total). Allocated are nodes executing one or more jobs. Idle are compute nodes currently inactive but enable to execute jobs. Other include nodes that are not allowed to execute jobs, due to maintnance or some other condition. The sum of all these 3 states is the total number of compute nodes associated with the partition.
The fifth column is the nodelist. This is a compacted listing of all the machines associated with each partition.
The compact form is particularly useful in large HPC clusters with hundreds of compute nodes.
Notice for example that compute nodes tcocs[001-064,100]
appear as nodes associated to standby
, comm_small_day
and comm_small_week
.
Sockets, CPU cores, and Hyperthreading¶
On a desktop computer or laptop, you will find a single processor also called Central Processing Unit (CPU). The CPU is the main chip responsible for most computational calculations taking place on the machine. Different from Desktop computers and laptops, on HPC, compute nodes it is often the case to find two or four CPU chips. Each CPU is located in what is called a socket. A dual socket node is then a node with two CPU chips. Those CPUs are in general identical and the Operating System will distribute the workload among them.
Modern CPUs are made of multiple cores. A CPU core is a completely functional processing unit and several CPU cores are printed on a single chip. We called these CPUs multicore and almost all CPUs today are multicore.
Some CPUs are capable of “logically dividing” each CPU core into two hardware threads, a technology called Hyperthreading. Hardware threads are designed to hide the latencies of the memory and feed the compute units fast enough to keep them busy all the time. Hyperthreading can be activated or deactivated depending on the cluster or its workload. Depending on the code running on the node hyperthreading can benefit or harm the performance.
Submitting batch jobs¶
A batch job is a job that has not expectations of running inmediately and will not be interactively operated when start running.
This is the kind of jobs an HPC cluster is preferentially built for.
You simply write on a text file the list of resources you need for the job and the list of steps to execute in the form of script and the job will be put in execution when resources became available.
The text file is called a submission script and it has two roles.
From one side, it has the script that will be put in execution on the compute node associated to the job.
From the other side, it has a set of lines starting with #SBATCH
.
Those lines start with #
meaning that they are ignored by the shell when running the script.
Those lines are important for SLURM as it will interpret those lines and compile a list of requirements and configurations associated to the job.
The lines starting with #SBATCH
will not interpret shell variables or environment variables.
These lines will contain resource requests such as the number of compute nodes, number of CPU cores, memory requested, partition.
They also could contain the name of the job, specification for sending emails when the job start, ends of fails, where the output of the script wil go.
These lines could also include other configurations that will be used before, during and after the job enters in execution.
Our first example will be very simple. Consider a submission script for a job called PI. The job will compute the value of pi using the the arbitrary precision calculator bc. The command to be executed will be:
echo $(echo "scale=65; 4*a(1)" | bc -l)
This is a simple execution that takes a fraction of a second on any modern computer. However, our purpose here is to use it to demonstrate how to submit a job that will be executed on a compute node. In practical cases the execution will require several hours or even days and need multiple CPU cores or multiple compute nodes. The submission script could be written like this:
#!/bin/bash
#SBATCH -J PI
#SBATCH -N 1
#SBATCH -c 1
#SBATCH -n 1
#SBATCH -p standby
#SBATCH -t 4:00:00
echo "The first 65 digits of PI are:"
echo $(echo "scale=65; 4*a(1)" | bc -l)
echo ""
echo "Job ID: $SLURM_JOB_ID"
echo "Job Name: $SLURM_JOB_NAME"
echo "Number of Nodes: $SLURM_JOB_NUM_NODES"
echo "Number of CPU cores: $SLURM_CPUS_ON_NODE"
echo "Number of Tasks: $SLURM_NTASKS"
echo "Partition: $SLURM_JOB_PARTITION"
Assuming that this text is written on a file called runjob.pbs
. Submit the job using the command:
trcis001:~$ sbatch runjob.pbs
The job will most likely execute after a few seconds. A file with a name such as slurm-<jobid>.out
is created.
For example, if the job ID were 122014. The output produce by the submission script will contain:
trcis001:~$ cat slurm-122014.out
The first 65 digits of PI are:
3.14159265358979323846264338327950288419716939937510582097494459228
Job ID: 122014
Job Name: PI
Number of Nodes: 1
Number of CPU cores: 1
Number of Tasks: 1
Partition: standby
Let us now understand the lines present on this first submission script.
The first line is called a shebang
This is used to indicate which interpreter will be used for the lines on the script.
In this case we are saying that bash
which is a commom shell interpreter must be used for the script.
The next 6 lines all start with $SBATCH
:
#SBATCH -J PI
#SBATCH -N 1
#SBATCH -c 1
#SBATCH -n 1
#SBATCH -p standby
#SBATCH -t 4:00:00
The lines will each in order, set the name of the job (-J), the number of compute nodes (-N), the number of CPU cores per task (-c), the number of tasks (-n), the partition selected for the job (-p) and the limit of time for the job (-t). Except for the job name which is undefined by default, all other values here correspont to the default values. The job name is optional and the other 5 lines can be removed from the submission script and the script will take assume the default values. For this simple case all that result being one: One node, one task, one cpu per task and one cpu per node. The concepts nodes, tasks and cpus per task or node will be covered below.
The next 2 lines are the actual execution that we want takes place on a compute node. It could be a complex numerical simulation, solve a complex optimization problem, a genomic alignment. Any computationally demanding operation will be here.
In the final 6 lines we are demonstrating the used of some Environment variables that are created when the job start running on the compute node. In this case we are revelaling the content of those variables and writing those along with the output of the script. These variables can be use in the script to change the execution lines according to their values. More SLURM environment variables will be described below.
The output is all that the script or the programs call by the script writes produces to be seen on the screen.
If the script were executed directly, the standard output is the terminal window.
In the case of a batch script, all the output is directed to a couple of files.
The normal output also called standard output is written to a file that by default looks like slurm-<jobid>.out
The error output, text that is considered could be aside to the normal output is called standard error and it is sent to a separate file.
In our case we do not have any error and no error file is generated.
All the #SBATCH
directives are optional and there are default values for many of them or no value at all if not declared.
Consider for example the same submission script removing all the lines starting with #SBATCH
:
#!/bin/bash
echo "The first 65 digits of PI are:"
echo $(echo "scale=65; 4*a(1)" | bc -l)
echo ""
echo "Job ID: $SLURM_JOB_ID"
echo "Job Name: $SLURM_JOB_NAME"
echo "Number of Nodes: $SLURM_JOB_NUM_NODES"
echo "Number of CPU cores: $SLURM_CPUS_ON_NODE"
echo "Number of Tasks: $SLURM_NTASKS"
echo "Partition: $SLURM_JOB_PARTITION"
Producing a similar result in the output file:
trcis001:~$ cat slurm-122439.out
The first 65 digits of PI are:
3.14159265358979323846264338327950288419716939937510582097494459228
Job ID: 122439
Job Name: runjob2.slurm
Number of Nodes: 1
Number of CPU cores: 1
Number of Tasks:
Partition: standby
Notice that if the job has no name, the name of the submission script becomes its name. One single node will be use and one single CPU core assigned to the job. No declaring a number of tasks will create a job with no value for that.
We will explore more complex submission scripts and the meaning of the multiple options associated but first lets explore how to monitor the jobs submitted and how to cancel jobs.
Monitoring jobs¶
Lets consider a variation of the submission script were we will be asking for several many compute nodes:
#!/bin/bash
#SBATCH -J PI
#SBATCH -N 80
#SBATCH -c 40
#SBATCH -n 80
#SBATCH -p standby
#SBATCH -t 4:00:00
echo "The first 65 digits of PI are:"
echo $(echo "scale=65; 4*a(1)" | bc -l)
echo ""
echo "Job ID: $SLURM_JOB_ID"
echo "Job Name: $SLURM_JOB_NAME"
echo "Number of Nodes: $SLURM_JOB_NUM_NODES"
echo "Number of CPU cores: $SLURM_CPUS_ON_NODE"
echo "Number of Tasks: $SLURM_NTASKS"
echo "Partition: $SLURM_JOB_PARTITION"
Asumming this submission script was written on a file called runjob_80n.slurm
, submit the job with the command:
trcis001:~$ sbatch runjob_40n.slurm
Submitted batch job 122837
This time no output in the form of the file slurm-<jobis>.out
Check the status of the jobs using the command squeue
.
The command alone will return a listing of all the jobs running or in queue in the cluster.
To restrict the listing just to jobs submitted by the user with the command:
trcis001:~$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
122837 standby PI gufranco PD 0:00 80 (Resources)
122895 standby PI gufranco PD 0:00 40 (Priority)
Notice that we have here two jobs that are in queue. The status of PD means that the jobs are pending of execution. The ST column gives the state of the job. In the case above, both jobs are in pending state. The following codes:
Status |
Meaning |
Description |
---|---|---|
R |
Running |
Job currently has an allocation |
PD |
PenDing |
Job is awaiting resource allocation |
TO |
TimedOut |
Job terminated upon reaching its time limit |
PR |
PReempted |
Job terminated due to preemption |
S |
Suspended |
Execution has been suspended and CPUs have been released for other jobs |
CD |
CompleteD |
Job has terminated all processes on all nodes with an exit code of zero |
CA |
CAncelled |
Job was explicitly cancelled by the user or system administrator |
F |
Failed |
Job terminated with non-zero exit code or other failure condition |
NF |
Node Failure |
Job terminated due to failure of one or more allocated nodes |
There are a few other state conditions that appear less often. If you see one state that is not one above, check on the manual page for squeue under JOB STATE CODES.
The columns are for most part self-explanatory. The final column shows briefly the reasons why the job is still not running. One of the jobs states that could not run due to a lack of Resources, ie there are no 80 compute nodes available at this point. The other job is pending due to Priority, ie there are resources availables but they will be assigned to another job with higher priority.
There are several messages over the Reasons column. A common set of reasons messages with explanations follows.
Reason |
Description |
---|---|
Resources |
Scheduler is unable to find sufficient idle resources to run your job |
Priority |
There are jobs with higher priority ahead of this job in the queue |
QOSMaxCpuPerUserLimit |
The CPU request exceeds the maximum each user is allowed to use |
Licenses |
The job is waiting for a license. |
There are more reason codes that can be found in the SLURM documentation for Resource Limits
Canceling a job¶
Lets assume we want to delete the job 122837 that we submitted above.
The job can be canceled with the command scancel
followed by the jobid of the job you want to cancel:
trcis001:~$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
122837 standby PI gufranco PD 0:00 80 (Resources)
trcis001:~$ scancel 122837
trcis001:~$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
As the canceling of a job is an irreversible action it is suggested to use an interactive version that allow you to double check not only the job ID but also the job name and partition in case of mistyping the jobID:
trcis001:~$ scancel -i 123807
Cancel job_id=123807 name=PI partition=standby [y/n]? y
Estimation of the starting time for a job¶
SLURM tries to schedule all jobs as quickly as possible, subject to a number of dynamic constrains. These constrains could be cluster policies, available hardware, allocation priorities (contributers to the cluster get higher priority allocations), etc. Typically jobs submitted into a day queue could start running within a day or so. Week queues could demand more time as jobs there last longer in execution. All this can vary a a given point of time on the cluster.
The command squeue
, has some arguments that can show you the scheduler’s estimate of when a pending/idle job will start running.
It is just the scheduler’s best estimate, given current conditions, that on a cluster are permanently changing.
The actual time a job starts might be earlier or later than that depending on factors such as the behavior of currently running jobs, the submission of new jobs, and hardware issues.
To see this, you need to request that squeue show the %S field in the output format option. One particularly good set of arguments in this situations could be:
trcis001:~$ squeue -o "%.9i %.9P %.8j %.8u %.2t %.10M %.6D %S" -u $USER
JOBID PARTITION NAME USER ST TIME NODES START_TIME
123809 standby PI gufranco PD 0:00 40 2023-03-29T22:11:53
123808 standby PI gufranco PD 0:00 80 2023-03-30T15:10:06
It makes sense that requesting 80 nodes will take more time than requesting 40. It could happens that some of the jobs running finish before the walltime and the job will enter earlier or could happen than newer jobs enter in partition with higher priority than standby and the job gets delayed. Use this estimation as a guidance more than a commitment.
Detailed information about jobs¶
The information provided by the command squeue
sometimes is not enough and you would like to gather a more complete picture of the state of a particular job.
The command scontrol
provides a wealth of information about jobs but also partitions and nodes.
Information about a job:
trcis001:~$ scontrol show job 123809
JobId=123809 JobName=PI
UserId=gufranco(318130) GroupId=its-rc-thorny(1079001) MCS_label=N/A
Priority=10675 Nice=0 Account=its-rc-admin QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A
SubmitTime=2023-03-29T19:05:08 EligibleTime=2023-03-29T19:05:08
AccrueTime=2023-03-29T19:05:08
StartTime=2023-03-29T22:11:53 EndTime=2023-03-30T02:11:53 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-29T19:18:33 Scheduler=Backfill:*
Partition=standby AllocNode:Sid=trcis001:28116
ReqNodeList=(null) ExcNodeList=(null)
NodeList= SchedNodeList=tcscm300,tjscl100,tjscm001,tmmcm[100-108],tngcm200,tpmcm[001-006],tsacs001,tsdcl[001-002],tsscl[001-002],ttmcm[100-101],tzecl[100-107],tzecs[100-105]
NumNodes=40-40 NumCPUs=800 NumTasks=40 CPUs/Task=20 ReqB:S:C:T=0:0:*:*
TRES=cpu=800,mem=7717080M,node=40,billing=800
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/gpfs20/users/gufranco/runjob_40n.slurm
WorkDir=/gpfs20/users/gufranco
StdErr=/gpfs20/users/gufranco/slurm-123809.out
StdIn=/dev/null
StdOut=/gpfs20/users/gufranco/slurm-123809.out
Power=
Information about a parition uses a similar command:
trcis001:~$ scontrol show partition standby
PartitionName=standby
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
NodeSets=compute
Nodes=taicm[001-009],tarcl100,tarcs[100,200-206,300-304],tbdcx001,tbmcs[001-011,100-103],
tbpcm200,tbpcs001,tcbcx100,tcdcx100,tcgcx300,tcocm[100-104],tcocs[001-064,100],
tcocx[001-003],tcscm300,tjscl100,tjscm001,tmmcm[100-108],tngcm200,tpmcm[001-006],
tsacs001,tsdcl[001-002],tsscl[001-002],ttmcm[100-101],tzecl[100-107],tzecs[100-115]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=CANCEL
State=UP TotalCPUs=6140 TotalNodes=167 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=6140,mem=27963188M,node=167,billing=6140
We can also ask for information about a compute node:
trcis001:~$ scontrol show node tbdcx001
NodeName=tbdcx001 Arch=x86_64 CoresPerSocket=20
CPUAlloc=40 CPUEfctv=40 CPUTot=40 CPULoad=42.22
AvailableFeatures=xl,compute,bio
ActiveFeatures=xl,compute,bio
Gres=(null)
NodeAddr=tbdcx001 NodeHostName=tbdcx001 Version=22.05.6
OS=Linux 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Mar 25 21:21:56 UTC 2021
RealMemory=773491 AllocMem=0 FreeMem=646603 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=standby
BootTime=2023-02-23T22:45:08 SlurmdStartTime=2023-02-24T08:35:36
LastBusyTime=2023-03-29T15:58:11
CfgTRES=cpu=40,mem=773491M,billing=40
AllocTRES=cpu=40
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Controling the limit of time for a job¶
If no time limit is declared on the submission script or the command line for sbatch
, a job receives the time limit associated to the partition where it was submitted.
There are ways to declare a time limit and help the scheduler to decide if your job can start running ahead some others with larger time limits.
This is particularly important if your job uses one of the _week
partitions and you only need 2 or 3 days.
In those cases, the _day
partitions are not a good fit and you can declare a 3 day time limit and the job could enter in execution faster than other jobs with an entire week of execution.
To specify your estimated runtime, use the –time=TIME or -t TIME parameter to sbatch. This value TIME can in any of the following formats:
Format |
Description |
Example |
---|---|---|
M |
M minutes |
-t 45 (45 minutes) |
M:S |
M minutes, S seconds |
-t 2:30 (Two minutes and 30 seconds) |
H:M:S |
H hours, M minutes, S seconds |
-t 1:30:00 (One hour and half) |
D-H |
D days, H hours |
-t 3-12 (3 days and a half) |
D-H:M |
D days, H hours, M minutes |
-t 1-12:30 (1 day, 12 hours and 30 minutes) |
D-H:M:S |
D days, H hours, M minutes, S seconds |
-t 6-23:59:59 (One second less than a full week) |
Specifying number of Nodes, tasks and CPU Cores¶
Computationally intesive HPC applications generally use a combination of distributed parallel computing, multithreading or SMP parallelism and accelerators.
The dominant interface for distributed parallel computing is MPI. MPI uses intependent processes, also called ranks or tasks. Those processes could potentially run on separate computers and that is the solution programmers implement when the problems are so big than a single computer cannot complete the task alone on a reasonable amount of time.
Multithreading or SMP parallelism uses the multiple CPU cores present in all modern-day computers.
Those cores can all see the entire RAM of the machine.
SMP parallelism is implemented in codes that uses OpenMP, OpenACC.
In high level languages like Python the use of multiprocessing
module can take advantage of multiple CPU cores on the same machine.
Some Linear Algebra libraries implement SMP parallelism, such as OpenBLAS and Intel MKL.
Sometimes, the access to the certain portion of RAM is faster for one core than the other. We say that those systems have Non-Uniform Memory Access (NUMA). For example, a machine with a dual scoket could have a portion of RAM that is associated with each CPU. In some cases we can gain important speedups by trying to concentrate the memory than a CPU uses into the RAM associated with it.
To better understand how all this can be controlled from SLURM imagine that we have a code that will use a number of CPU cores distributed across several machines. Consider that N represent the number of MPI tasks, and M represent the number of threads needed by the job. N MPI tasks can run on N compute nodes but not necessarly. On the other side if one of those MPI tasks uses multiple CPU cores, all those cores must be on the same machine. That is a necessary condition for multithreading parallelism.
Most jobs then fall into one of these categories:
Sequential/Serial:
Shared Memory Parallel:
Distributed Parallelism:
Hybrid SMP+Distributed Parallelism:
Accelerator Based Parallelism:
Translating PBS Scripts to Slurm Scripts¶
All our previous clusters, Mountaineer, Spruce Knob and Thorny Flat used Torque as Resource Manager and Moab as Scheduler. In 2022 we transition to SLURM as workload manager.
For most part the transition is transparent as we installed a set of wrappers that translate on the fly the usual commands from Torque and Moab into the equivalents for SLURM.
The same happens with submission scripts where the lines with #PBS
get translated into the corresponding $SBATCH
versions.
That being said is good for our users, specially new users to get used to SLURM commands and directives and stop using as much as possible the Torque commands that could become deprecated at some point in the future or the wrappers not being installed in future HPC clusters.
The following are a set of tables that contains a list of common commands and terms used with the TORQUE/PBS resource manager and scheduler, and the corresponding commands and terms used under the SLURM workload manager. These tables can be used to assist you in translating existing PBS scripts into proper SLURM scripts that can be interpreted directly. The same tables can be used for reference when writing new submission scripts directly in SLURM format
User Commands |
PBS/Torque |
SLURM |
---|---|---|
Job submission |
|
|
Job deletion |
|
|
Job status (by job) |
|
|
Job status (by user) |
|
|
Job hold |
|
|
Job release |
|
|
Queue list |
|
|
Node list |
|
|
Cluster status |
|
|
Environment |
PBS/Torque |
SLURM |
---|---|---|
Job ID |
|
|
Submit Directory |
|
|
Submit Host |
|
|
Node List |
|
|
Job Name |
|
|
Number of nodes |
|
|
Number of cores per node |
|
|
Unique index used for Job Arrays |
|
|
Job Specification |
PBS/Torque |
SLURM |
---|---|---|
Script directive |
|
|
Queue/Partition |
|
|
Node Count |
|
|
Total Task Count |
|
|
Total Task Count |
|
|
Wall Clock Limit |
|
|
Standard Output File |
|
|
Standard Error File |
|
|
Write stderr -> stdout |
|
(use -o without -e) |
Write stdout -> stderr |
|
(use -e without -o) |
Copy Environment |
|
|
Event Notification |
|
|
Email Address |
|
|
Job Name |
|
|
Job Restart |
|
|
Resource Sharing |
|
|
Memory Size |
|
|
Memory Size |
|
|
Accounts to charge |
|
|
Tasks Per Node |
|
|
CPUs Per Task |
N/A |
|
Job Dependency |
|
|
Quality of Service |
|
|
Job Arrays |
|
|
Generic Resources |
|
|
Job Enqueue Time |
|
|
OLD DOCUMENTATION¶
Resource Specification¶
The #PBS -l option is used to specify resources such as number of CPUs, nodes, and length of Walltime for the job specified. The three most common resources specified for the Mountaineer cluster are
nodes |
Number of nodes needed |
walltime |
Maximum limit for walltime given in the format hh:mm:ss |
ppn |
Processors per node |
procs |
Number of processors requested |
pvmem |
Maximum amount of memory used by any single process in the job |
vmem |
Maximum amount of memory used by all concurrent processes in the job |
Note: procs is used when you do not require each CPU to be on the same node.
For example, the PBS directive
#PBS -l nodes=1:ppn=6,walltime=06:00:00
Specifies that the job will need 6 processors located on a single node with a maximum run time of 6 hours. Notice there is no space between commas or equal signs. Alternatively, if nodes=1 (procs=6 instead) had not been specified then the scheduler would just grab the first 6 processors available regardless of what nodes they reside on (which will only work if your program supports distributed computing). In general, unless you are running jobs using MPI libraries (mpirun) or posix threads, you will most likely only specify a single processor for your job (procs=1). Note:Resources specifying per node request are given with the nodes directive and seperated with a :, on the same line in your script.
Requesting Memory Specifications¶
Requesting memory specifications for jobs is done with the attributes vmem or pvmem through the PBS -l directive (resource specification). The man pages of pbs will specify two other memory related attributes: mem and pmem. However, these two attributes measure different job resources than virtual memory and therefore are not stable for use the way we commonly think of memory (use of RAM). In other words, do not use the attributes mem and pmem - they most likely do not do what you think they do. vmem and pvmem will put resource limits for the amount of RAM a job can access. This is important to ensure two large memory jobs do not end up on the same node; exceeding the node’s memory limits and causing a node crash (which will kill all jobs on the node). If you do not specify memory limits - moab will assume a uniform distribution of memory across all jobs on the node. For example, a 16 processor/64Gb of RAM compute node will assume roughly 4Gb of RAM per processor. However, if a job using 62 Gb of RAM and only 8 cores is running on a compute node - without memory limits Moab will place 8 more processor jobs on that node when clearly there is not enough memory for any remaining jobs. This will crash the node. Therefore, we recommend that if you anticipate your jobs are going to use more than an average of 3Gb per processor that you specify memory limits for your job using pvmem or vmem. On Spruce community nodes and Mountaineer we enforce this by making the system default of pvmem=3gb. On these systems without specifying memory above 3Gb will cause your job to fail. This is important - because on community nodes if you specify a job with 5 cores and vmem=25Gb; the job still will fail if it exceeds 15Gb because pvmem=3gb is assigned to each job by default (i.e. vmem does not override pvmem settings). To make your PBS scripts portable across community nodes and private nodes, we recommend that you only use pvmem to specify memory limits of jobs. pvmem attribute specifies the maximum amount of virtual memory used by any single processes in the job. Therefore, if you want a job that uses 6 processors and needs 35 Gb of RAM you would specify the following resource directive line:
#PBS -l nodes=1:ppn=6,pvmem=6gb
pvmem=6gb with 6 processors specifies 6*6 = 36Gb of total memory for the job.
Requesting Certain Node Types¶
There might be times where you want to be able to request a node with a particular feature or processor. The following will allow you to accomplish this task. Replace ‘feature_name” with one of the features in the below table.
#PBS -l feature=feature_name
Note, you can also request a particular feature not by doing the following:
#PBS -l feature='!feature_name'
Available Features¶
Feature |
Description |
---|---|
smb |
Sandy Bridge Based Processor Nodes |
ivy |
Ivy Bridge Based Processor Nodes |
haswell |
Haswell Based Processor Nodes |
broadwell |
Broadwell Based Processor Nodes |
avx |
Processors with AVX Extension |
avx2 |
Processors with AVX2 Extension |
f16c |
Processors with f16c Extension |
adx |
Processors with adx Extension |
large |
Nodes with 512 GB of memory |
E-mail options¶
The #PBS -m and #PBS -M options are used to specify when and to whom the scheduler will send e-mails. The -m option consists of either the single character “n”, or one or more of the characters “a”, “b”, and “e”.
n |
No mail will be sent |
a |
Mail is sent when the job is aborted by the batch system |
b |
Mail is sent when the job begins execution |
e |
Mail is sent when the job ends |
Note: If the -m option is not specified, mail will be sent if the job is aborted.
The shellscript option #PBS -M specifies the e-mail addresses to send mail to. For example, the PBS directive
#PBS -m ae
#PBS -M user@mailserver.com
The scheduler will send an e-mail to user@mailserver.com if the job is aborted, or when the job is completed. To specify more than one e-mail address with the -M option, each address should be separated with a comma without any spaces.
To Receive no e-mails even on aborts¶
Even with the ‘n’ option of ‘-m’ directive, the system will still send an e-mail if the job is cancelled or aborts. To provide the ability for our users to circumvent this response, we have set-up an alias e-mail address that can be used to bounce these e-mails. To receive absolutely no e-mails from the system, no matter what happens before, during and after execution of your job, use the noemail@hpc.wvu.edu address with the ‘n’ option:
#PBS -m n
#PBS -m noemail@hpc.wvu.edu
Output file specification¶
Default standard output and standard error of the job will be placed in files named jobname.ojobid and jobname.ejobid, respectfully. These files will be written to the directory in which the qsub command was executed in. Where jobname is specified using the -N environment option and jobid is given at run time by the system. The #PBS -e and #PBS -o options are used to specify what files should be written for the standard error and standard output stream, respectively.
-e |
pathname for standard error stream output |
-o |
pathname for standard output stream output |
An example, the PBS directive
#PBS -e /scratch/username/examplejob.error
#PBS -o /scratch/username/examplejob.output
The scheduler will write the files /scratch/username/examplejob.error and /scratch/username/examplejob.output for the standard error and standard output streams, respectively.
Note: Use full pathnames for your home directory and scratch directory
Requesting Array jobs¶
By using the directive #PBS -t , you can request a job to be repeated by a single script a number of times. This is useful if you have data where you want a single parameter to range over a section of numbers. For instance, if I wanted a series of commands to be run, with a single variable in the command to be executed over a range of 10-20 I could use the following command directives in my shell script
#PBS -N demographic_${PBS_ARRAYID}
#PBS -l nodes=1:ppn=2
#PBS -t 10-20
mkdir output_${PBS_ARRAYID}/
cd output_${PBS_ARRAYID}/
$SCRATCH/demographic_model.py -input_parameter ${PBS_ARRAYID} -procs 2 -output_file demographic_output.txt
The above script would launch ten jobs. Each job would have the name demographic_; so the first job would be named demographic_10, the second job would be named demographic_11, and so fourth. Each job would be run a single node with 2 processors (specified as #PBS -l nodes=1:ppn=2). Further, each job would make a directory named ouput_ (first job output_10, second job output_11, and so forth). Would cd into that directory and execute the python script demographic_model.py from my scratch directory. Notice that one of the input parameters would change each single job using the PBS set environment variable PBS_ARRAYID. Array request are very useful in scientific environments when you need to modify a parameter and see the output for a range of values. Note: this a theoretical example since I never specified walltime or a queue to execute this job from.
The number range for array request does not have to be sequential. You can also list a comma separated list of numbers as
#PBS -t 10,15,20,25
Further, you can also specify that only a certain number of jobs are queued at one time in cases where you have a large number of jobs and need to share a queue with another user
#PBS -t 1-200%10
The above directive will only launch ten jobs to the queue at a time until all 200 job requests have been executed.
Interactive Jobs¶
Interactive jobs allow a user to be given an interactive terminal on a compute node. This allows a user to “interact” directly with a compute node instead running in a batch or scripted mode. Interactive jobs are very useful when debugging jobs as it allows a user to walk step-by-step through your submit script to find errors or problems. Interactive jobs are also useful when needing to use a graphical program on the cluster.
To run an interactive job use the following command followed by any necessary PBS variables/flags. If you don’t specify any flags, you will be given an interactive job in the default queue for the cluster.
qsub -I
Do note, interactive jobs are only allowed on certain queues. All condo owner queues are allowed to have interactive jobs as well as queues such as ‘standby’ and ‘debug’. If you find you need an interactive queue on a community resource for a particular task or project, please contact Research Computing Help Desk for assistance.
Graphical Interface Jobs¶
Sometimes it might be useful or required to run a graphical program on the cluster. Non-compute intensive processes for visualization purposes can be run on the login node. These processes include “could” gnuplot, R and Matlab assuming they have low overhead. However, if you know your program is consume a lot of resources, it is best to run an interactive job.
To execute a graphical application on a compute node, you need to first review Using X Windows applications to properly setup your X (i.e. display) environment. To launch a graphical job on a compute node, you will need to execute the following along with any necessary flags/pbs environment variables.
$> qsub -I -X
Once you are given an access to a interactive terminal you can run your the proper executable to launch your graphical (i.e. X Window) program. For example:
$> module load statistics/matlab
$> matlab &
Checking the Status of Jobs¶
The status of a job currently submitted to the queue can be checked using the checkjob command. checkjob displays detailed job state information and diagnostic output for a specified job. Detailed information is available for queued, blocked, active, and recently completed jobs. Users can use checkjob to view the status of their own jobs.
Examples:
$> checkjob -v <jobid>
where is the jobid given at submission time.
The output of checkjob looks like this
job 1653450 (RM job '1653450.srih0001.hpc.wvu.edu')
AName: IVY
State: Completed
Completion Code: 0 Time: Fri May 19 15:30:21
Creds: user:username group:groupname class:debug qos:member
WallTime: 00:00:16 of 00:01:00
SubmitTime: Fri May 19 15:29:58
(Time Queued Total: 00:00:07 Eligible: 00:00:07)
Deadline: 3:59:49 (Fri May 19 19:30:58)
TemplateSets: DEFAULT
Total Requested Tasks: 1
Req[0] TaskCount: 1 Partition: torque
Opsys: --- Arch: --- Features: ivy
GMetric[energy_used] Current: 0.00 Min: 0.00 Max: 0.00 Avg: 0.00 Total: 0.00
NodeAccess: SINGLEJOB
TasksPerNode: 1
Allocated Nodes:
[sgpc0001.hpc.wvu.edu:1]
SystemID: Moab
SystemJID: 1653450
Notification Events: JobEnd,JobFail
Task Distribution: sgpc0001.hpc.wvu.edu
UMask: 0000
OutputFile: srih0001.hpc.wvu.edu:/gpfs/home/username/IVY.o1653450
ErrorFile: srih0001.hpc.wvu.edu:/gpfs/home/username/IVY.e1653450
StartCount: 1
Execution Partition: torque
SrcRM: torque DstRM: torque DstRMJID: 1653450.srih0001.hpc.wvu.edu
Submit Args: runjob_ivy.pbs
Flags: RESTARTABLE
Attr: checkpoint
StartPriority: 1000
PE: 1.00
Sometimes your job is rejected and you still get a jobid in that case you can check the reasons with checkjob For example, consider this submission script where we ask for too much memory for a serial job.
The submisssion script looks like
#!/bin/sh
#PBS -N TEST
#PBS -l nodes=1:ppn=1,vmem=200g
#PBS -l walltime=00:01:00
#PBS -m ae
#PBS -q groupname
#PBS -n
cd $PBS_O_WORKDIR
date
The jobs is accepted by torque but will see the job in queue for a long time. Now we execute checkjob to know the reasons for not being running
$> checkjob -v 1653589
job 1653589 (RM job '1653589.srih0001.hpc.wvu.edu')
AName: TEST
State: Idle
Creds: user:username group:groupname class:groupname qos:member
WallTime: 00:00:00 of 00:01:00
BecameEligible: Fri May 19 15:52:14
SubmitTime: Fri May 19 15:51:52
(Time Queued Total: 00:01:06 Eligible: 00:00:53)
Deadline: 3:59:54 (Fri May 19 19:52:52)
TemplateSets: DEFAULT
Total Requested Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Memory >= 0 Disk >= 0 Swap >= 3072M
Dedicated Resources Per Task: PROCS: 1 SWAP: 200G
NodeAccess: SINGLEJOB
TasksPerNode: 1
Reserved Nodes: (3:09:16:24 -> 3:09:17:24 Duration: 00:01:00)
[sarc3001.hpc.wvu.edu:1]
SystemID: Moab
SystemJID: 1653589
Notification Events: JobEnd,JobFail
UMask: 0000
OutputFile: srih0001.hpc.wvu.edu:/gpfs/home/username/TEST.o1653589
ErrorFile: srih0001.hpc.wvu.edu:/gpfs/home/username/TEST.e1653589
Partition List: torque
SrcRM: torque DstRM: torque DstRMJID: 1653589.srih0001.hpc.wvu.edu
Submit Args: runjob_badmem.pbs
Flags: RESTARTABLE
Attr: checkpoint
StartPriority: 2000
PE: 37.34
Reservation '1653589' (3:09:16:24 -> 3:09:17:24 Duration: 00:01:00)
Node Availability for Partition torque --------
srig0001.hpc.wvu.edu rejected: Swap
szec2001.hpc.wvu.edu rejected: State (Busy)
szec2002.hpc.wvu.edu rejected: State (Busy)
szec2003.hpc.wvu.edu rejected: State (Busy)
...
sbmc0017.hpc.wvu.edu rejected: State (Busy)
sbmc0018.hpc.wvu.edu rejected: State (Busy)
sbmg0001.hpc.wvu.edu rejected: Swap
sric0001.hpc.wvu.edu rejected: Swap
sric0002.hpc.wvu.edu rejected: Swap
ssmc0006.hpc.wvu.edu rejected: Swap
sgsc2001.hpc.wvu.edu rejected: Class
sgsg2001.hpc.wvu.edu rejected: Swap
sric0022.hpc.wvu.edu rejected: Class
sric0025.hpc.wvu.edu rejected: State (Busy)
sbmc0019.hpc.wvu.edu rejected: State (Busy)
sbmc0020.hpc.wvu.edu rejected: Swap
sbmc0021.hpc.wvu.edu rejected: State (Busy)
sbmc0022.hpc.wvu.edu rejected: State (Busy)
sric0024.hpc.wvu.edu rejected: Swap
sllc0001.hpc.wvu.edu rejected: Swap
...
sspc3006.hpc.wvu.edu rejected: Swap
sspc3007.hpc.wvu.edu rejected: Swap
sspc3008.hpc.wvu.edu rejected: Swap
sspc3009.hpc.wvu.edu rejected: State (Running)
sspc3010.hpc.wvu.edu rejected: Swap
NOTE: job req cannot run in partition torque (available procs do not meet requirements : 0 of 1 procs found)
idle procs: 623 feasible procs: 0
Node Rejection Summary: [Class: 2][State: 110][Swap: 53]
The “Swap” reason is “memory” related. The “State” reason is CPU related. The Queue system search for 623 cores available and could not find a single machine with 200GB available to launch the job.
Another important tool to monitor jobs and its state is showq
You can get the eligible jobs and their priorities with
showq- i -u <username>
For example
$ showq -i -u username
eligible jobs----------------------
JOBID PRIORITY XFACTOR Q USERNAME GROUP PROCS WCLIMIT CLASS SYSTEMQUEUETIME
1579829* 14108 1.7 me username groupname 16 14:00:00:00 groupname Tue May 9 12:09:46
1595467* 10599 1.6 me username groupname 4 14:00:00:00 groupname Thu May 11 22:39:11
1595464* 10599 1.6 me username groupname 4 14:00:00:00 groupname Thu May 11 22:39:11
1595468* 10599 1.6 me username groupname 4 14:00:00:00 groupname Thu May 11 22:39:11
1595466* 10599 1.6 me username groupname 4 14:00:00:00 groupname Thu May 11 22:39:11
1595463* 10599 1.6 me username groupname 4 14:00:00:00 groupname Thu May 11 22:39:10
1595465* 10599 1.6 me username groupname 4 14:00:00:00 groupname Thu May 11 22:39:11
1595462* 10599 1.6 me username groupname 4 14:00:00:00 groupname Thu May 11 22:39:10
1618053* 6423 1.3 me username groupname 2 14:00:00:00 groupname Sun May 14 20:15:33
1618385* 6363 1.3 me username groupname 4 14:00:00:00 groupname Sun May 14 21:14:58
1618386* 6363 1.3 me username groupname 4 14:00:00:00 groupname Sun May 14 21:14:58
1618387* 6363 1.3 me username groupname 4 14:00:00:00 groupname Sun May 14 21:14:59
1618388* 6363 1.3 me username groupname 4 14:00:00:00 groupname Sun May 14 21:14:59
1630355* 3967 1.2 me username groupname 4 14:00:00:00 groupname Tue May 16 13:11:17
1630507* 3903 1.2 me username groupname 4 14:00:00:00 groupname Tue May 16 14:15:09
1630546* 3884 1.2 me username groupname 16 14:00:00:00 groupname Tue May 16 14:34:33
1630494* 1 1.4 co username groupname 16 7:00:00:00 comm_larg Tue May 16 14:08:50
1630349* 1 1.4 co username groupname 16 7:00:00:00 comm_larg Tue May 16 13:10:08
18 eligible jobs
Total jobs: 18
Those are jobs that accrue priority as time passes for them on queue. Some jobs could become blocked, meaning that they are not gaining priority but will eventually become eligible later in time.
$ showq -b -u username
blocked jobs-----------------------
JOBID USERNAME GROUP STATE PROCS WCLIMIT QUEUETIME
1623738 username groupname Idle 16 7:00:00:00 Mon May 15 13:49:50
1623747 username groupname Idle 16 7:00:00:00 Mon May 15 13:51:21
1623757 username groupname Idle 16 7:00:00:00 Mon May 15 13:52:57
1652487 username groupname Idle 16 4:00:00 Fri May 19 12:24:44
1646112 username groupname Idle 4 4:00:00 Thu May 18 15:20:54
1646096 username groupname Idle 4 4:00:00 Thu May 18 15:17:55
1630495 username groupname Idle 4 7:00:00:00 Tue May 16 14:10:13
1630501 username groupname Idle 16 7:00:00:00 Tue May 16 14:11:17
1623766 username groupname Idle 16 7:00:00:00 Mon May 15 13:55:24
1623746 username groupname Idle 16 7:00:00:00 Mon May 15 13:50:50
1623749 username groupname Idle 16 7:00:00:00 Mon May 15 13:51:48
1623751 username groupname Idle 16 7:00:00:00 Mon May 15 13:52:25
1646143 username groupname Idle 16 7:00:00:00 Thu May 18 15:26:36
1623759 username groupname Idle 16 7:00:00:00 Mon May 15 13:53:51
1623758 username groupname Idle 16 7:00:00:00 Mon May 15 13:53:29
1623760 username groupname Idle 16 7:00:00:00 Mon May 15 13:54:53
1623740 username groupname Idle 16 7:00:00:00 Mon May 15 13:50:23
1623731 username groupname Idle 16 7:00:00:00 Mon May 15 13:49:08
1630569 username groupname Idle 16 7:00:00:00 Tue May 16 14:48:03
1623739 username groupname Idle 16 7:00:00:00 Mon May 15 13:49:53
1623732 username groupname Idle 16 7:00:00:00 Mon May 15 13:49:10
21 blocked jobs
Total jobs: 21
Finally, you can see the jobs that are currently running with their remaining time until hit their wall time
$ showq -r -u username
active jobs------------------------
JOBID S PAR EFFIC XFACTOR Q USERNAME GROUP MHOST PROCS REMAINING STARTTIME
1599005 R tor 24.99 1.0 co username groupname sric0011.hpc.wvu 16 00:24:38 Fri May 12 17:01:10
1599006 R tor 24.99 1.0 co username groupname sric0020.hpc.wvu 16 00:51:08 Fri May 12 17:27:40
1599007 R tor 24.98 1.0 co username groupname sric0021.hpc.wvu 16 1:03:41 Fri May 12 17:40:13
1599008 R tor 24.99 1.0 co username groupname sric0023.hpc.wvu 16 1:04:45 Fri May 12 17:41:17
1599009 R tor 24.99 1.1 co username groupname sric0032.hpc.wvu 16 4:42:25 Fri May 12 21:18:57
1599010 R tor 24.99 1.1 co username groupname sric0026.hpc.wvu 16 4:42:25 Fri May 12 21:18:57
1599011 R tor 24.99 1.1 co username groupname sric0017.hpc.wvu 16 10:10:42 Sat May 13 02:47:14
1546851 R tor 99.73 2.6 co username groupname sric0025.hpc.wvu 16 2:13:45:30 Mon May 15 06:22:02
1570354 R tor 87.78 1.0 me username groupname sarc3001.hpc.wvu 16 3:08:32:50 Tue May 9 01:09:22
1595446 R tor 98.27 1.0 me username groupname sarc2001.hpc.wvu 4 6:06:02:54 Thu May 11 22:39:26
1595448 R tor 99.98 1.0 me username groupname sarc2001.hpc.wvu 4 6:07:29:35 Fri May 12 00:06:07
1595449 R tor 99.99 1.0 me username groupname sarc0001.hpc.wvu 4 6:08:21:37 Fri May 12 00:58:09
1595453 R tor 99.99 1.0 me username groupname sarc0002.hpc.wvu 4 6:08:49:41 Fri May 12 01:26:13
1618813 R tor 24.77 1.7 co username groupname sric0037.hpc.wvu 16 6:20:47:59 Fri May 19 13:24:31
1618812 R tor 24.77 1.7 co username groupname sric0051.hpc.wvu 16 6:20:47:59 Fri May 19 13:24:31
1618814 R tor 24.78 1.7 co username groupname sric0036.hpc.wvu 16 6:20:47:59 Fri May 19 13:24:31
1618815 R tor 24.84 1.7 co username groupname sric0030.hpc.wvu 16 6:20:54:14 Fri May 19 13:30:46
1595460 R tor 99.97 1.1 me username groupname sarc0006.hpc.wvu 4 8:06:16:50 Sat May 13 22:53:22
1595461 R tor 99.97 1.2 me username groupname sarc0009.hpc.wvu 4 8:13:36:38 Sun May 14 06:13:10
19 active jobs 232 of 3112 processors in use by local jobs (7.46%)
155 of 165 nodes active (93.94%)
Total jobs: 19
Canceling/Removing a Job¶
Jobs can be cancelled or removed using the canceljob command. Users can only remove jobs they submitted to the scheduler.
$> canceljob <jobid>
is the jobid given at submission time.
Now canceljob is deprecated and Moab offers and alternative to cancel jobs For example, if you want to cancel jobs that starts with 1693 you can use this command to cancel those jobs. As user you can only cancel jobs that you own so do not worry about canceling jobs from other users by doing this.
$> mjobctl -c "x:1693.*"
Adding Prologue and Epilogue scripts to a Job¶
It is possible to declare scripts that run before and after the execution of the main submission script. The main advantage of those is to keep a record of the conditions under which a given job is running. Here we present a simple example of how to declare an prologue and epilogue.
Add these lines to your submission script:
#PBS -l prologue=/absolute/path/to/prologue.sh
#PBS -l epilogue=/absolute/path/to/epilogue.sh
The best way of working with those scripts is adding them to your home folder and use them on all your submission scripts. They should collect information that you can use later for debugging or profiling purposes.
Example of Prologue¶
prologue.sh
#!/bin/sh
echo ""
echo "Prologue Args:"
echo "Job ID: $1"
echo "User ID: $2"
echo "Group ID: $3"
echo ""
env | sort
hostname
date
exit 0
Example of Epilogue¶
epilogue.sh
#!/bin/sh
echo ""
echo "Epilogue Args:"
echo "Job ID: $1"
echo "User ID: $2"
echo "Group ID: $3"
echo "Job Name: $4"
echo "Session ID: $5"
echo "Resource List: $6"
echo "Resources Used: $7"
echo "Queue Name: $8"
echo "Account String: $9"
echo ""
env | sort
hostname
date
exit 0
Both prologue and epilogue must be made executable, use “
chmod +x prologue.sh epilogue.sh
to change their permissions.
Samples of Job Submission scripts¶
Below are bash scripts that can be modified and submitted to the qsub command for job submission. For details about the different parts of the scripts please visit the Running Jobs page. These scripts can be copied and pasted in the terminal using any number of text editors (i.e. vi, emacs, etc…)
Script for running a non-array batch queue¶
The below script has PBS directives to set-up commonly used variables such as job name, resources needed, e-mail address upon job completion and abnormal termination and specify a queue to run on
#!/bin/sh
#This is an example script for executing generic jobs with
# the use of the command 'qsub <name of this script>'
#These commands set up the Grid Environment for your job. Words surrounding by a backet ('<','>') should be changed
#Any of the PBS directives can be commented out by placing another pound sign in front
#example
##PBS -N name
#The above line will be skipped by qsub because of the two consecutive # signs
# Specify job name
#PBS -N <name>
# Specify the resources need for the job
# Walltime is specified as hh:mm:ss (hours:minutes:seconds)
#PBS -l nodes=<number_of_nodes>:ppn=<number_of_processors_per_node,walltime=<time_needed_by_job>
# Specify when Moab should send e-mails 'ae' below user will
# receive e-mail for any errors with the job and/or upon completion
# If you don't want e-mails just comment out these next two PBS lines
#PBS -m ae
# Specify the e-mail address to receive above mentioned e-mails
#PBS -M <email_address>
# Specify the queue to execute task in. Current options can be found by excuting the command qstat -q at the terminal
#PBS -q <queue_name>
# Enter your command below with arguments just as if you where going to execute on the command line
# It is generally good practice to issue a 'cd' command into the directory that contains the files
# you want to use or use full path names
Script for running an array batch queue¶
Script is the same as above, but adds PBS -t to execute array request job submissions.
#!/bin/sh
#This is an example script for executing genetic jobs with
# the use of the command 'qsub <name of this script>'
#These commands set up the Grid Environment for your job. Words surrounding by a backet ('<','>') should be changed
#Any of the PBS directives can be commented out by placing another pound sign in front
#example
##PBS -N name
#The above line will be skipped by qsub because of the two consecutive # signs
# Specify job name, use ${PBS_ARRAYID} to ensure names and output/error files have different names
#PBS -N <name_${PBS_ARRAYID}
# Specify the range for the PBS_ARRAYID environment variable
# <num_range> can be a continous range like 1-200 or 5-20
# or <num_range> can be a comma seperated list of numbers like 5,15,20,55
# You can also specify the maximum number of jobs queued at one time with the percent sign
# so a <num_range> specified as 5-45%8 would launch forty jobs with a range from 5-45, but only queue 8 at a time until
# all jobs are completed.
# Further, you can mix and match continous range and list like 1-10,15,25-40%10
#PBS -t <num_range>
# Specify the resources need for the job
# Walltime is specified as hh:mm:ss (hours:minutes:seconds)
#PBS -l nodes=<number_of_nodes>:ppn=<number_of_processors_per_node,walltime=<time_needed_by_job>
# Specify when Moab should send e-mails 'ae' below user will
# receive e-mail for any errors with the job and/or upon completion
# If you don't want e-mails just comment out these next two PBS lines
#PBS -m ae
# Specify the e-mail address to receive above mentioned e-mails
#PBS -M <email_address>
# Specify the queue to execute task in. Current options can be found by excuting the command qstat -q at the terminal
#PBS -q <queue_name>
# Enter your command below with arguments just as if you where going to execute on the command line
# It is generally good practice to issue a 'cd' command into the directory that contains the files
# you want to use or use full path names
# Any parameter or filename that needs to use the current job number of the array number range use ${PBS_ARRAYID}