Thorny Flat Cluster

Thorny Flat is WVU Latest HPC Cluster. It was deployed in Feburary 2019 and funded in large part by NSF Major Research Instrumentation (MRI) Grant Award #1726534. The cluster is a 4,208 core cluster spread over 108 nodes using a shared Intel Omnipath 100Gbps Interconnect. The system is a heterogeneous cluster in which there are different node types. In addition, each year a new addition is added to the cluster, which is known as a new phase. The cluster has five phases and currently is in phase 0.

Overview

(This text can be used for proposals to Grant Funding Agencies)

Thorny Flat is a general-purpose High-Performance Computing (HPC) cluster. Thorny Flat serves the HPC needs for West Virginia University (WVU) and other higher education institutions in West Virginia. It is hosted in Pittsburgh Supercomputer Center and was built thanks to NSF Major Research Instrumentation (MRI) Grant Award #1726534

Thorny Flat is a cluster of 108 compute nodes plus 4 management nodes. 101 nodes have a dual-socket motherboard with Intel(R) Xeon(R) Gold 6138 Processors for a total of 40 cores per node. The remaining 7 nodes have dual socket motherboard with Intel(R) Xeon(R) Gold 6126 Processors for a total of 24 cores per node. The total number of cores is 4208.

The 7 nodes with 24 cores also have 3 NVIDIA(R) Quadro P6000 24GB PCIe GPUs for a total of 21 GPU cards on the cluster. Memory on compute nodes range from 96GB to 768GB. The machines are interconnected using Intel(R) Omnipath(R) 100 Gbps with a blocking ratio of 5:1.

Thorny Flat scored 115 TeraFLOPS using jut the 101 CPU-only compute nodes. Score was measured from the HPL Linpack benchmark.

Thorny Flat uses Torque and Moab for Resource Managment and Job Scheduling. It has a variety of compilers, numerical libraries and scientific software specifically compiled and optimized for the hardware architecture.

Resource manager and system scheduler

  • Torque v. 6.1.x
  • Moab Cluster Manager v. 9.1.x

Total Compute Resources

  • 4,208 Cores
  • 101 Compute Nodes
  • 7 Compute Nodes
  • 21 GPUs
  • 4 Management Nodes

Shared Interconnect

  • Intel Omnipath 100 Gbps
  • 5:1 Blocking

Hardware

Phase 0/1 Hardware

Node Type Description Community Count Condo Count Total
Small Memory
  • 2 x 6138 Intel Procs (40 Cores Total)
  • 96GB memory
  • 240GB SSD
  • 100 Gb Omnipath
  • 5 yr warranty
64 13 77
Medium Memory
  • 2 x 6138 Intel Procs (40 Cores Total)
  • 192GB memory
  • 240GB SSD
  • 100 Gb Omnipath
  • 5 yr warranty
0 16 16
Large Memory
  • 2 x 6138 Intel Procs (40 Cores Total)
  • 384GB memory
  • 240GB SSD
  • 100 Gb Omnipath
  • 5 yr warranty
0 4 4
XL Memory
  • 2 x 6138 Intel Procs (40 Cores Total)
  • 768GB memory
  • 240GB SSD
  • 100 Gb Omnipath
  • 5 yr warranty
3 1 4
GPU
  • 2 x 6126 Intel Procs (24 Cores Total)
  • 3 x NVIDIA Quadro P6000 24GB PCIe GPUs,
  • 96GB memory
  • 240GB SSD
  • 100 Gb Omnipath
  • 5 yr warranty
6 1 7

Queues

The current state and limits of queues can be found using the qstat command.

server: trcis002.hpc.wvu.edu

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
standby            --      --    04:00:00   --    0   0 --   E R
comm_small_week    --      --    168:00:0   --    0   0 --   E R
comm_small_day     --      --    24:00:00   --    0   0 --   E R
comm_gpu_week      --      --    168:00:0   --    0   0 --   E R
comm_xl_week       --      --    168:00:0   --    0   0 --   E R
                                           ----- -----
                                                  0     0

There are three main queue types - research team queues, the standby queue, and community node queues.

Research Team Queues

Research teams that have bought their own compute nodes have private queues that link all their compute nodes together. Only users given permission from the research team’s buyer (Usually the labs PI) will have permission to directly submit jobs to these queues. While these are private queues - unused resources/compute nodes from these queues will be available to the standby queue (see below). However, per the system-wide policies, all research team’s compute nodes must be available to the research team’s users within 4 hours of job submission. By default, these queues are regulated by first come, first serve queuing. However, individual research teams can ask for different settings for their respective queue, and should contact the RC HPC team with these requests.

Standby Queue

The standy queue is for using resources from research teams queues that are not currently being used. Priority on the standby queue is set by fair share queuing. This means that user priority is assigned based on a combination of the size of the job and how much system resources the user have used during the given week, with higher priority assigned to larger jobs and/or user jobs that have used fewer system resources in the week. Further, the standby queue has a 4 hour wall time.

Community Node Queues

Thorny Flat has several queues that start with the word ‘comm’. These queues are linked to the 73 compute/GPU nodes bought using NSF funding sources, and as such is open for Statewide Higher Education use, hardware/resource information can be found on the Thorny Flat Systems page These queues are separated by node type (i.e. extra large memory, and gpu) and can be used by all users. Currently, these nodes are regulated by fair share queuing. This means that user priority is assigned based on a combination of the size of the job and how much system resources the user have used during the given week, with higher priority assigned to larger jobs and/or user jobs that have used less system resources in the week. Further, all community queues have a week wall time, except for the (comm_small_day). comm_small_day allows jobs up 24 hours; and, this queue class has access to a larger number of resources than than comm_small_week). These restrictions are set to prevent a single user occupying a large number of the community resources for an excessively long time.