Spruce Knob (2014-2023)¶

Spruce Knob was a 3,376 core cluster spread over 176 nodes using a shared FDR Infiniband Network. The system was a heterogeneous cluster with processors from Sandy Bridge, Ivy Bridge, Haswell and Broadwell generations. Over the years 2014 to 2018 additional nodes were added to the cluster called phases. Thhe cluster reached five phases. The cluster was decomissioned in 2023 after serving for almost a decade.

Overview¶

Resource manager and system scheduler¶

Torque v. 6.1.x
Moab Cluster Manager v. 9.1.x

Total Compute Resources¶

3,376 Cores
176 Compute Nodes
4 Management Nodes
18 GPUs

Shared Interconnect¶

Mellanox SX6512 Infiniband Switch
216 ports all non-blocking
FDR (56 Gb/s) speed
All Fibre Connections

Hardware¶

Spruce Knob was a heterogeneous cluster with compute nodes with a variety of Intel Microarchitectures The table below shows the differences between the 4 architectures:

Microarchitectures present on *Spruce Knob*¶
Microarchitecture	Launched	Extensions
Sandy Bridge	2012	MMX, SSE, SSE2, SSE3, SSSE3, SSE4, SSE4.1, SSE4.2, AVX, VT-x, VT-d, AES-NI, CLMUL, TXT
Ivy Bridge	2013	MMX, SSE, SSE2, SSE3, SSSE3, SSE4, SSE4.1, SSE4.2, AVX, F16C, AES-NI, CLMUL, RDRAND, TXT, VT-x, VT-d
Haswell	2014	MMX, SSE, SSE2, SSE3, SSSE3, SSE4, SSE4.1, SSE4.2, AVX, AVX2, FMA3, VT-x, VT-d, AES-NI, CLMUL, RDRAND, TXT
Broadwell	2016	MMX, SSE, SSE2, SSE3, SSSE3, SSE4, SSE4.1, SSE4.2, AVX, AVX2, TSX, FMA3, AES-NI, CLMUL, RDRAND, TXT, VT-x, VT-d

The Advanced Vector Extensions (AVX, also known as Sandy Bridge New Extensions) are extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD) proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011. AVX provides new features, new instructions and a new coding scheme.

AVX2 (also known as Haswell New Instructions) expands most integer commands to 256 bits and introduces fused multiply–accumulate operations (FMA). They were first supported by Intel with the Haswell processor, which shipped in 2013.

The AVX and AVX2 are important for Scientific Computing because AVX uses sixteen YMM registers to perform a Single Instruction on Multiple pieces of Data. Each YMM register can hold and do simultaneous operations (math) on eight 32-bit single-precision floating point numbers or four 64-bit double-precision floating point numbers.

Notice that all nodes on Spruce support AVX extensions but not all nodes support AVX2.

Phase 0/1 Hardware¶

Intel(R) Xeon(R) CPU E5-4620

Intel(R) Xeon(R) CPU E5-2650 v2

Node Type	Description	Compute Nodes
Sandy Bridge EP Q2’12 32 nm	4 X 8 Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz 64GB memory 1TB HDD HPE Proliant DL560 Gen8	14
Ivy Bridge EP Q3’13 22 nm	2 X 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz 32GB memory 1TB HDD HPE ProLiant SL230s Gen 8	1
Ivy Bridge EP Q3’13 22 nm	2 X 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz 64GB memory 1TB HDD HPE ProLiant SL230s Gen 8	95
Ivy Bridge EP Q3’13 22 nm	2 X 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz 256GB memory 1TB HDD HPE ProLiant SL230s Gen 8	1
Ivy Bridge EP Q3’13 22 nm	2 X 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz 512GB memory 1TB HDD HPE ProLiant SL230s Gen 8	3

Phase 2 Hardware¶

Intel(R) Xeon(R) CPU E5-2650 v3

Node Type	Description	Compute Nodes
Haswell Q3’14 22 nm	2 X 10 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz 64GB memory 1TB HDD	14
Haswell Q3’14 22nm	2 X 10 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz 128GB memory 1TB HDD	3
Haswell Q3’14 22 nm	2 X 10 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz 512GB memory 1TB HDD	1

Phase 3 Hardware¶

Intel(R) Xeon(R) CPU E5-2650 v3

Node Type	Description	Compute Nodes
Haswell Q3’14 22 nm	2 X 10 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz 64GB memory 1TB HDD	1
Haswell Q3’14 22 nm	2 X 10 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz 128GB memory 1TB HDD	10
Haswell Q3’14 22 nm	2 X 10 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz 512GB memory 1TB HDD	1

Phase 4 Hardware¶

Intel(R) Xeon(R) CPU E5-2650 v4

Node Type	Description	Compute Nodes
Broadwell Q1’16 14 nm	2 X 12 Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz 64GB memory 1TB HDD	3
Broadwell Q1’16 14 nm	2 X 12 Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz 128GB memory 1TB HDD	7
Broadwell Q1’16 14 nm	2 X 12 Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz 512GB memory 1TB HDD	1

Queues¶

Queue	Walltime
debug	15:00
standby	4:00:00
comm_mmem_week	168:00:00
comm_256g_mem	168:00:00
comm_mmem_day	24:00:00
comm_gpu	168:00:00
comm_smp	168:00:00
comm_large_mem	168:00:00

Research Team Queues¶

Research teams that have bought their own compute nodes have private queues that link all their compute nodes together. Only users given permission from the research team’s buyer (Usually the labs PI) will have permission to directly submit jobs to these queues. While these are private queues - unused resources/compute nodes from these queues will be available to the standby queue (see below). However, per the system-wide policies, all research team’s compute nodes must be available to the research team’s users within 4 hours of job submission. By default, these queues are regulated by first come, first serve queuing. However, individual research teams can ask for different settings for their respective queue, and should contact the RC HPC team with these requests.

Standby Queue¶

The standy queue is for using resources from research teams queues that are not currently being used. Priority on the standby queue is set by fair share queuing. This means that user priority is assigned based on a combination of the size of the job and how much system resources the user have used during the given week, with higher priority assigned to larger jobs and/or user jobs that have used fewer system resources in the week. Further, the standby queue has a 4 hour wall time.

Community Node Queues¶

Spruce Knob has several queues that start with the word ‘comm’. These queues are linked to the 51 compute nodes bought using NSF funding sources, and as such is open for Statewide Academic use. These queues are separated by node type (i.e. large memory, gpu, smp) and can be used by all users. Currently, these nodes are regulated by fair share queuing. This means that user priority is assigned based on a combination of the size of the job and how much system resources the user have used during the given week, with higher priority assigned to larger jobs and/or user jobs that have used less system resources in the week. Further, all community queues have a 24 hour wall time, except for the week long medium memory queue (comm_mmem_week). comm_mmem_week allows jobs up to a week (168 hours); however, this queue class also limits the maximum number of nodes to 11, and a single user can not exceed 80 CPUs total within this queue. These restrictions are set to prevent a single user occupying a large number of the community resources for an excessively long time.