Workload Manager (SLURM)¶
The workload manager is the software tool that makes a computer cluster appear and work like a single entity rather than an aggregate of computers on a network. Our clusters use SLURM as their workload manager, and users must be familiar with a few commands to use an HPC cluster effectively.
We will describe the different commands in Workload Manager (SLURM), but if you are eager to know the list, here is a table with the commands you will use more often:
Command |
Purpose |
---|---|
|
Submit a batch script for later execution. |
|
Signal a job to be removed from the queue or its execution stopped. |
|
View information about jobs currently in queue or execution. |
|
View information about nodes and partitions. |
|
View and modify the configuration and state of jobs, nodes, and partitions. |
|
View accounting information, including data about previous jobs. |
|
Obtain an interactive job allocation |
|
Execute an application, including its allocation if needed. |
As you can see, the commands above deal with three main concepts: jobs, nodes, and partitions. To be able to work effectively with SLURM, you must be familiar with these three concepts.
SLURM Concepts¶
To use SLURM effectively, we need to understand the concepts of compute node, partition, and job. SLURM provides a set of commands to submit, cancel, and monitor jobs. These jobs will execute on compute nodes that are organized logically in partitions. Let us elaborate on these three concepts before submitting our first job.
Nodes¶
A High-Performance Computing cluster (HPC cluster) is made of a collection of computers. The term used for each computer is “node”. These nodes are linked through a fast network such as Gigabit Ethernet or Infiniband. In the case of Thorny Flat, we use the Omni-Path Architecture (OPA) as the fast network fabric. In the case of Dolly Sods, the fabric is Infiniband.
The nodes in an HPC cluster are configured to serve different purposes. A typical organization of a cluster divides the nodes into four categories:
Management nodes: These computers run management services, databases, monitoring tools, reporting applications, provisioning tools, and other related services system administrators use. Regular users have no direct access to these nodes.
Login nodes: These computers are the machines where users log in through SSH to submit jobs and check results. Users must never use login nodes for any intense computations as many people is served by these machines, and they should be kept at a reasonable load to properly serve the users currently connected.
Storage nodes: These computers host user files in possibly multiple filesystems. We use dedicated storage systems running distributed filesystems such as GPFS.
Compute nodes: These are the computers where jobs run and where the computations actually take place. As most nodes on an HPC cluster are compute nodes, we refer to them simply as “nodes”. We will make the distinction when needed.
Partitions¶
In SLURM, a partition is a set of compute nodes grouped logically based on the hardware’s physical properties or job-scheduling policies. The term “partition” can be misleading as the same compute node can belong to several partitions. In other resource managers, partitions are called “queues” which is an alternative term also used in HPC. In general, compute nodes are grouped based on the features shared by the nodes, such as the presence of GPUs or the amount of memory (RAM) installed. Another reason to create partitions is to manage jobs that run under different job-scheduling policies, such as priority or the amount of time a job is allowed to run on a partition.
Jobs¶
A Job is the central structure of a workload manager like SLURM. A Job is made of one or more sequential steps, each step consisting of one or multiple parallel tasks that could be dispatched to multiple CPU cores on a single node or to several nodes on the cluster. At a given time, many jobs exist on an HPC cluster. A cluster like Thorny Flat or Dolly Sods usually has thousands of jobs running or in the queue.
Once a job is submitted, it stays pending of resources being allocated for it in the form of a fraction, one or more compute nodes. Jobs run on compute nodes until they complete the tasks they are supposed to run or the limit of time allowed for them is reached. Once the job is finished, accounting information is stored in databases so information is preserved even if the job is no longer listed in the queue.
The roles of a workload manager¶
A workload manager like SLURM serves two main roles: Resource Manager and Job Scheduler.
Resource Manager¶
The resource manager’s role is to collect information about all the computers in the cluster, their characteristics, and their current state. A human equivalent for a resource manager is a mix of an accountant and a manager. The resource manager role is mainly responsible for gluing an HPC cluster to appear to users as a single entity rather than a pile of computers. A resource manager provides tools to execute tasks of an HPC cluster with several nodes and nodes with several cores as simple to use as an individual computer.
To better understand what a job is, consider first a simple command that return the name of the computer where it is executed. For example, on the login node execute:
$> hostname
trcis001.hpc.wvu.edu
If we want to execute the same command on three machines, we can use SLURM command srun
and execute:
$> srun -N3 hostname
srun: job 3410 queued and waiting for resources
srun: job 3410 has been allocated resources
tcocm102.hpc.wvu.edu
tcocm101.hpc.wvu.edu
tcocm100.hpc.wvu.edu
The command has been executed on 3 machines, the cluster is used as a single entity and we are not interested exactly in which machines the command run as far as it executes on 3 different nodes. The whole purpose of using an HPC cluster is to have many computers to run and not be concerned exactly about which machine or machines the actual execution takes place.
When the amount of resources requested by all the jobs from all the users exceeds the number of resources available we need a system to prioritize the execution of the different jobs. That is the role of a Job Scheduler
Job Scheduler
The algorithms behind the prioritization of jobs can become fairly sophisticated. The resources available on a cluster are permanently changing and jobs are submitted permanently. The job scheduler has several objective functions including the maximal utilization of the cluster but also fairness among the users, preventing one user from monopolizing the cluster.
SLURM is a workload manager that takes both roles in its architecture. From the user’s point of view, all that you need to know is a handful of SLURM commands. The SLURM commands that you will learn in this section will allow you to:
Submit jobs to the cluster, both interactive and non-interactive jobs.
Monitor the list of jobs running on the system
Learn the status and extra information for a particular job
Cancel jobs that have been submitted and they are either running or waiting in the queue
List the partitions on the cluster and their state
Gathering cluster information¶
The sinfo command on SLURM can be used to get an overview of the resources offered by the cluster. By default, sinfo lists the partitions that are available.
On WVU clusters, partitions with the prefix “comm” are community resources. Any HPC user can submit jobs to those partitions and the partitions were created differentiating the amount of RAM (small, medium [med], large, and extra large [xl]), the wall time policy for the partition (day or week) and two community partitions with GPU nodes, one for interactive jobs (comm_gpu_inter) and another for non-interactive jobs running for up to a week (comm_gpu_week). The default queue is marked with a star (*) and it is called standby. Most compute nodes belong to this partition and jobs can run on it for up to 4 hours. The standby partition should be used preferentially except if you are certain that 4 hours is not enough time to complete the job.
The command sinfo will list all the partitions and the state of the nodes for each of them. A more summarized version can be obtained with the argument -s
$> info -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
standby* up 4:00:00 82/82/3/167 taicm[001-009],tarcl100,tarcs[100,200-206,300-304],tbdcx001,tbmcs[001-011,100-103],tbpcm200,tbpcs001,tcbcx100,tcdcx100,tcgcx300,tcocm[100-104],tcocs[001-064,100],tcocx[001-003],tcscm300,tjscl100,tjscm001,tmmcm[100-108],tngcm200,tpmcm[001-006],tsacs001,tsdcl[001-002],tsscl[001-002],ttmcm[100-101],tzecl[100-107],tzecs[100-115]
comm_small_day up 1-00:00:00 57/8/0/65 tcocs[001-064,100]
comm_small_week up 7-00:00:00 57/8/0/65 tcocs[001-064,100]
comm_med_day up 1-00:00:00 1/4/0/5 tcocm[100-104]
comm_med_week up 7-00:00:00 1/4/0/5 tcocm[100-104]
comm_xl_week up 7-00:00:00 2/1/0/3 tcocx[001-003]
comm_gpu_inter up 4:00:00 8/3/0/11 tbegq[200-202],tbmgq[001,100],tcogq[001-006]
comm_gpu_week up 7-00:00:00 6/0/0/6 tcogq[001-006]
aei0001 up infinite 0/8/1/9 taicm[001-009]
alromero up infinite 10/4/0/14 tarcl100,tarcs[100,200-206,300-304]
be_gpu up infinite 1/2/0/3 tbegq[200-202]
bvpopp up infinite 0/1/0/1 tbpcs001
cedumitrescu up infinite 0/0/1/1 tcdcx100
cfb0001 up infinite 0/1/0/1 tcbcx100
cgriffin up infinite 1/0/0/1 tcgcx300
chemdept up infinite 0/4/0/4 tbmcs[100-103]
chemdept-gpu up infinite 1/0/0/1 tbmgq100
cs00048 up infinite 0/1/0/1 tcscm300
jaspeir up infinite 0/2/0/2 tjscl100,tjscm001
jbmertz up infinite 11/6/0/17 tbmcs[001-011,100-103],tbmgq[001,100]
mamclaughlin up infinite 0/9/0/9 tmmcm[100-108]
ngarapat up infinite 0/1/0/1 tngcm200
pmm0026 up infinite 0/6/0/6 tpmcm[001-006]
sbs0016 up infinite 0/2/0/2 tsscl[001-002]
spdifazio up infinite 0/2/0/2 tsdcl[001-002]
tdmusho up infinite 0/2/0/2 ttmcm[100-101]
vyakkerman up infinite 1/0/0/1 tsacs001
zbetienne up infinite 0/24/0/24 tzecl[100-107],tzecs[100-115]
zbetienne_large up infinite 0/8/0/8 tzecl[100-107]
zbetienne_small up infinite 0/16/0/16 tzecs[100-115]
Now you know the partitions on the cluster and based on your knowledge of the job you can decide on which partition submit your job. Now we will learn about the kinds of jobs that can be submitted and how to submit jobs.
Job Submission¶
The main purpuse of using an HPC cluster is the execution of jobs. In particular jobs that due their characteristics are impractical to be executed on a normal desktop computer or laptop. Such is the case of jobs that could take several hours or use significant amount of resources like multiple CPU cores or memory.
As we learn above an HPC has a variety of computers with particular purposes. Computationally intese calculation must only take place on compute nodes. Login nodes, the computers you first reach when connected to the cluster should be spared from any intense workload as these computers serve several other users and running on them will slow the machine and prevent others from executing effectively even the most simple commands. Short post processing tasks are acceptable on login nodes. As a rule of thumb, if a task takes more than one core or last for more than a few minutes it should run on a compute node instead of a login node.
There are two kind of jobs that can be executed on an HPC cluster, interactive and non-interactive jobs. Interactive jobs are those where you receive resouces for you to use in real time, very similar to the way you use your own computer. Interactive jobs are a good solution when you want to learn the steps needed to achieve the results you need. Later on you can write those steps in the form of scripts and let the computer to execute them in your absence.
Non-interactive jobs are the solution to jobs that take hours to execute or to run several jobs on the cluster. In non-interactive jobs you prepare an script, a recipe, indicating the computer, step by step, how to get the results that will allow you to take decisions later on or producing the final results for that level in your research.
Regardless of running, interactive or non-interactive jobs, SLURM, as workload manager, will decide on which machines (compute nodes) the jobs will run and will give you the tools to monitor the status of the jobs submitted. It is time to learn the basics of submitting interactive and non-interactive jobs.
A very simple way of launch an interactive job is using the command srun:
trcis001:~$ srun --pty bash
srun: job 22432 queued and waiting for resources
srun: job 22432 has been allocated resources
tzecs115:~$
Notice that srun is actually taking a double function. From one side is creating a new job (In the case above the job with ID=22432) followed by a remote terminal session on the machine assigned to the job. In the example above the job is requesting default values for all parameters. The partition is set to standby which offers a walltime of 4 hours. No selecting any number of nodes or cores will automaticaly assigned a single core on a single machine.
In the case of needing more resources, maybe a different partition or number of cores add extra arguments to the command line:
trcis001:~$ srun -p standby -t 40:00 -c 4 --pty bash
In the example above we are explicitly selecting standby as partition, 40 minutes of walltime and 4 cores o a single compute nodes. The last argument in the srun command line must be the command to be executed. In this case, a bash session once logged into the assigned compute node.
The following is an example of a request for interactive job asking for 1 GPU 8 CPU coress for 2 hours:
trcis001:~$ srun -p comm_gpu_inter -G 1 -t 2:00:00 -c 8 --pty bash
You can verify the assigned GPU using the command nvidia-smi:
trcis001:~$ srun -p comm_gpu_inter -G 1 -t 2:00:00 -c 8 --pty bash
srun: job 22599 queued and waiting for resources
srun: job 22599 has been allocated resources
tbegq200:~$ nvidia-smi
Wed Jan 18 13:27:01 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:3B:00.0 Off | 0 |
| N/A 28C P0 31W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The command above shows an NVIDIA A100 as the GPU assigned to us during the lifetime of the job.
In SLURM an interactive job can be launched with the command salloc:
trcis001:~$ salloc -N3
salloc: Pending job allocation 3506
salloc: job 3506 queued and waiting for resources
salloc: job 3506 has been allocated resources
salloc: Granted job allocation 3506
Loading git version 2.29.1 : dev/git/2.29.1
trcis001:~$
The command salloc will allocate resources (e.g. nodes or CPU cores), possibly with a set of constraints (e.g. number of processors per node or amount of memory per node). salloc will allocate the resources and spawn a shell in which the srun command is used to launch parallel tasks. Notice that salloc will return a shell on the same machine where the command salloc was executed.