Workload Manager (SLURM)

The workload manager is the software tool that makes a computer cluster appear and work like a single entity rather than an aggregate of computers on a network. Our clusters use SLURM as their workload manager, and users must be familiar with a few commands to use an HPC cluster effectively.

We will describe the different commands in Workload Manager (SLURM), but if you are eager to know the list, here is a table with the commands you will use more often:

User Commands

Command

Purpose

sbatch

Submit a batch script for later execution.

scancel

Signal a job to be removed from the queue or its execution stopped.

squeue

View information about jobs currently in queue or execution.

sinfo

View information about nodes and partitions.

scontrol

View and modify the configuration and state of jobs, nodes, and partitions.

sacct

View accounting information, including data about previous jobs.

salloc

Obtain an interactive job allocation

srun

Execute an application, including its allocation if needed.

As you can see, the commands above deal with three main concepts: jobs, nodes, and partitions. To be able to work effectively with SLURM, you must be familiar with these three concepts.

SLURM Concepts

To use SLURM effectively, we need to understand the concepts of compute node, partition, and job. SLURM provides a set of commands to submit, cancel, and monitor jobs. These jobs will execute on compute nodes that are organized logically in partitions. Let us elaborate on these three concepts before submitting our first job.

Nodes

A High-Performance Computing cluster (HPC cluster) is made of a collection of computers. The term used for each computer is “node”. These nodes are linked through a fast network such as Gigabit Ethernet or Infiniband. In the case of Thorny Flat, we use the Omni-Path Architecture (OPA) as the fast network fabric. In the case of Dolly Sods, the fabric is Infiniband.

The nodes in an HPC cluster are configured to serve different purposes. A typical organization of a cluster divides the nodes into four categories:

  • Management nodes: These computers run management services, databases, monitoring tools, reporting applications, provisioning tools, and other related services system administrators use. Regular users have no direct access to these nodes.

  • Login nodes: These computers are the machines where users log in through SSH to submit jobs and check results. Users must never use login nodes for any intense computations as many people is served by these machines, and they should be kept at a reasonable load to properly serve the users currently connected.

  • Storage nodes: These computers host user files in possibly multiple filesystems. We use dedicated storage systems running distributed filesystems such as GPFS.

  • Compute nodes: These are the computers where jobs run and where the computations actually take place. As most nodes on an HPC cluster are compute nodes, we refer to them simply as “nodes”. We will make the distinction when needed.

Partitions

In SLURM, a partition is a set of compute nodes grouped logically based on the hardware’s physical properties or job-scheduling policies. The term “partition” can be misleading as the same compute node can belong to several partitions. In other resource managers, partitions are called “queues” which is an alternative term also used in HPC. In general, compute nodes are grouped based on the features shared by the nodes, such as the presence of GPUs or the amount of memory (RAM) installed. Another reason to create partitions is to manage jobs that run under different job-scheduling policies, such as priority or the amount of time a job is allowed to run on a partition.

Jobs

A Job is the central structure of a workload manager like SLURM. A Job is made of one or more sequential steps, each step consisting of one or multiple parallel tasks that could be dispatched to multiple CPU cores on a single node or to several nodes on the cluster. At a given time, many jobs exist on an HPC cluster. A cluster like Thorny Flat or Dolly Sods usually has thousands of jobs running or in the queue.

Once a job is submitted, it stays pending of resources being allocated for it in the form of a fraction, one or more compute nodes. Jobs run on compute nodes until they complete the tasks they are supposed to run or the limit of time allowed for them is reached. Once the job is finished, accounting information is stored in databases so information is preserved even if the job is no longer listed in the queue.

The roles of a workload manager

A workload manager like SLURM serves two main roles: Resource Manager and Job Scheduler.

Resource Manager

The resource manager’s role is to collect information about all the computers in the cluster, their characteristics, and their current state. A human equivalent for a resource manager is a mix of an accountant and a manager. The resource manager role is mainly responsible for gluing an HPC cluster to appear to users as a single entity rather than a pile of computers. A resource manager provides tools to execute tasks of an HPC cluster with several nodes and nodes with several cores as simple to use as an individual computer.

To better understand what a job is, consider first a simple command that return the name of the computer where it is executed. For example, on the login node execute:

$> hostname
trcis001.hpc.wvu.edu

If we want to execute the same command on three machines, we can use SLURM command srun and execute:

$> srun -N3 hostname
srun: job 3410 queued and waiting for resources
srun: job 3410 has been allocated resources
tcocm102.hpc.wvu.edu
tcocm101.hpc.wvu.edu
tcocm100.hpc.wvu.edu

The command has been executed on 3 machines, the cluster is used as a single entity and we are not interested exactly in which machines the command run as far as it executes on 3 different nodes. The whole purpose of using an HPC cluster is to have many computers to run and not be concerned exactly about which machine or machines the actual execution takes place.

When the amount of resources requested by all the jobs from all the users exceeds the number of resources available we need a system to prioritize the execution of the different jobs. That is the role of a Job Scheduler

Job Scheduler

The algorithms behind the prioritization of jobs can become fairly sophisticated. The resources available on a cluster are permanently changing and jobs are submitted permanently. The job scheduler has several objective functions including the maximal utilization of the cluster but also fairness among the users, preventing one user from monopolizing the cluster.

SLURM is a workload manager that takes both roles in its architecture. From the user’s point of view, all that you need to know is a handful of SLURM commands. The SLURM commands that you will learn in this section will allow you to:

  • Submit jobs to the cluster, both interactive and non-interactive jobs.

  • Monitor the list of jobs running on the system

  • Learn the status and extra information for a particular job

  • Cancel jobs that have been submitted and they are either running or waiting in the queue

  • List the partitions on the cluster and their state

Gathering cluster information

The sinfo command on SLURM can be used to get an overview of the resources offered by the cluster. By default, sinfo lists the partitions that are available.

On WVU clusters, partitions with the prefix “comm” are community resources. Any HPC user can submit jobs to those partitions and the partitions were created differentiating the amount of RAM (small, medium [med], large, and extra large [xl]), the wall time policy for the partition (day or week) and two community partitions with GPU nodes, one for interactive jobs (comm_gpu_inter) and another for non-interactive jobs running for up to a week (comm_gpu_week). The default queue is marked with a star (*) and it is called standby. Most compute nodes belong to this partition and jobs can run on it for up to 4 hours. The standby partition should be used preferentially except if you are certain that 4 hours is not enough time to complete the job.

The command sinfo will list all the partitions and the state of the nodes for each of them. A more summarized version can be obtained with the argument -s

$> info -s
PARTITION       AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
standby*           up    4:00:00      82/82/3/167 taicm[001-009],tarcl100,tarcs[100,200-206,300-304],tbdcx001,tbmcs[001-011,100-103],tbpcm200,tbpcs001,tcbcx100,tcdcx100,tcgcx300,tcocm[100-104],tcocs[001-064,100],tcocx[001-003],tcscm300,tjscl100,tjscm001,tmmcm[100-108],tngcm200,tpmcm[001-006],tsacs001,tsdcl[001-002],tsscl[001-002],ttmcm[100-101],tzecl[100-107],tzecs[100-115]
comm_small_day     up 1-00:00:00        57/8/0/65 tcocs[001-064,100]
comm_small_week    up 7-00:00:00        57/8/0/65 tcocs[001-064,100]
comm_med_day       up 1-00:00:00          1/4/0/5 tcocm[100-104]
comm_med_week      up 7-00:00:00          1/4/0/5 tcocm[100-104]
comm_xl_week       up 7-00:00:00          2/1/0/3 tcocx[001-003]
comm_gpu_inter     up    4:00:00         8/3/0/11 tbegq[200-202],tbmgq[001,100],tcogq[001-006]
comm_gpu_week      up 7-00:00:00          6/0/0/6 tcogq[001-006]
aei0001            up   infinite          0/8/1/9 taicm[001-009]
alromero           up   infinite        10/4/0/14 tarcl100,tarcs[100,200-206,300-304]
be_gpu             up   infinite          1/2/0/3 tbegq[200-202]
bvpopp             up   infinite          0/1/0/1 tbpcs001
cedumitrescu       up   infinite          0/0/1/1 tcdcx100
cfb0001            up   infinite          0/1/0/1 tcbcx100
cgriffin           up   infinite          1/0/0/1 tcgcx300
chemdept           up   infinite          0/4/0/4 tbmcs[100-103]
chemdept-gpu       up   infinite          1/0/0/1 tbmgq100
cs00048            up   infinite          0/1/0/1 tcscm300
jaspeir            up   infinite          0/2/0/2 tjscl100,tjscm001
jbmertz            up   infinite        11/6/0/17 tbmcs[001-011,100-103],tbmgq[001,100]
mamclaughlin       up   infinite          0/9/0/9 tmmcm[100-108]
ngarapat           up   infinite          0/1/0/1 tngcm200
pmm0026            up   infinite          0/6/0/6 tpmcm[001-006]
sbs0016            up   infinite          0/2/0/2 tsscl[001-002]
spdifazio          up   infinite          0/2/0/2 tsdcl[001-002]
tdmusho            up   infinite          0/2/0/2 ttmcm[100-101]
vyakkerman         up   infinite          1/0/0/1 tsacs001
zbetienne          up   infinite        0/24/0/24 tzecl[100-107],tzecs[100-115]
zbetienne_large    up   infinite          0/8/0/8 tzecl[100-107]
zbetienne_small    up   infinite        0/16/0/16 tzecs[100-115]

Now you know the partitions on the cluster and based on your knowledge of the job you can decide on which partition submit your job. Now we will learn about the kinds of jobs that can be submitted and how to submit jobs.

Job Submission

The main purpuse of using an HPC cluster is the execution of jobs. In particular jobs that due their characteristics are impractical to be executed on a normal desktop computer or laptop. Such is the case of jobs that could take several hours or use significant amount of resources like multiple CPU cores or memory.

As we learn above an HPC has a variety of computers with particular purposes. Computationally intese calculation must only take place on compute nodes. Login nodes, the computers you first reach when connected to the cluster should be spared from any intense workload as these computers serve several other users and running on them will slow the machine and prevent others from executing effectively even the most simple commands. Short post processing tasks are acceptable on login nodes. As a rule of thumb, if a task takes more than one core or last for more than a few minutes it should run on a compute node instead of a login node.

There are two kind of jobs that can be executed on an HPC cluster, interactive and non-interactive jobs. Interactive jobs are those where you receive resouces for you to use in real time, very similar to the way you use your own computer. Interactive jobs are a good solution when you want to learn the steps needed to achieve the results you need. Later on you can write those steps in the form of scripts and let the computer to execute them in your absence.

Non-interactive jobs are the solution to jobs that take hours to execute or to run several jobs on the cluster. In non-interactive jobs you prepare an script, a recipe, indicating the computer, step by step, how to get the results that will allow you to take decisions later on or producing the final results for that level in your research.

Regardless of running, interactive or non-interactive jobs, SLURM, as workload manager, will decide on which machines (compute nodes) the jobs will run and will give you the tools to monitor the status of the jobs submitted. It is time to learn the basics of submitting interactive and non-interactive jobs.

A very simple way of launch an interactive job is using the command srun:

trcis001:~$ srun --pty bash
srun: job 22432 queued and waiting for resources
srun: job 22432 has been allocated resources
tzecs115:~$

Notice that srun is actually taking a double function. From one side is creating a new job (In the case above the job with ID=22432) followed by a remote terminal session on the machine assigned to the job. In the example above the job is requesting default values for all parameters. The partition is set to standby which offers a walltime of 4 hours. No selecting any number of nodes or cores will automaticaly assigned a single core on a single machine.

In the case of needing more resources, maybe a different partition or number of cores add extra arguments to the command line:

trcis001:~$ srun -p standby -t 40:00 -c 4 --pty bash

In the example above we are explicitly selecting standby as partition, 40 minutes of walltime and 4 cores o a single compute nodes. The last argument in the srun command line must be the command to be executed. In this case, a bash session once logged into the assigned compute node.

The following is an example of a request for interactive job asking for 1 GPU 8 CPU coress for 2 hours:

trcis001:~$ srun -p comm_gpu_inter -G 1 -t 2:00:00 -c 8 --pty bash

You can verify the assigned GPU using the command nvidia-smi:

trcis001:~$ srun -p comm_gpu_inter -G 1 -t 2:00:00 -c 8 --pty bash
srun: job 22599 queued and waiting for resources
srun: job 22599 has been allocated resources
tbegq200:~$ nvidia-smi
Wed Jan 18 13:27:01 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   28C    P0    31W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The command above shows an NVIDIA A100 as the GPU assigned to us during the lifetime of the job.

In SLURM an interactive job can be launched with the command salloc:

trcis001:~$ salloc -N3
salloc: Pending job allocation 3506
salloc: job 3506 queued and waiting for resources
salloc: job 3506 has been allocated resources
salloc: Granted job allocation 3506
Loading git version 2.29.1 : dev/git/2.29.1
trcis001:~$

The command salloc will allocate resources (e.g. nodes or CPU cores), possibly with a set of constraints (e.g. number of processors per node or amount of memory per node). salloc will allocate the resources and spawn a shell in which the srun command is used to launch parallel tasks. Notice that salloc will return a shell on the same machine where the command salloc was executed.