Linear Algebra

Linear Algebra is at the core of many of the algorithms used in High Performance Computing. Operations like the addition of two vectors or the product of matrices have been highly optimized for the kind of computers used today for HPC.

The term array is used to generically denote vectors matrices and even higher-dimensional arrays of numbers.

There are basically two classes of Linear Algebra algorithms: Dense Linear Algebra deals with operations over arrays where most of the values are non-zero and Sparse Linear Algebra works with arrays where most of the values are zero. Algorithms on both sides are radically different even from the way the arrays are stored in memory.

In the case of dense linear algebra there are well established algorithms and the de-facto standard for many scientific applications is BLAS.

BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The routines are organized in three levels:

  • Level 1 BLAS perform scalar, vector and vector-vector operations.

  • Level 2 BLAS perform matrix-vector operations

  • Level 3 BLAS perform matrix-matrix operations.

Because the BLAS are efficient, portable, and widely available, they are commonly used as the foundation for more complex linear algebra functionality like LAPACK.

LAPACK is a set of routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers.

There are several implementations of BLAS and LAPACK. Just to mention three examples we have ATLAS, a self-optimized generation of BLAS routines. OpenBLAS a fork of GotoBLAS, a set of hand-crafted optimizations for specific processor types and MKL, with optimized math routines. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math. The routines in MKL are hand-optimized specifically for Intel processors.

The reference BLAS and LAPACK are available on most Linux Distributions and named libblas.{a,so} and liblapack.{a,so}.

However, reference versions of BLAS and LAPACK are typically the last packages that you want use for Dense Linear algebra. In general OpenBLAS and MKL outperform the reference BLAS and LAPACK, they provide multithreading and they are optimized for modern multicore processors with vector extensions like AVX.

We have on Spruce and Thorny at least two much better alternatives to the reference BLAS and LAPACK. The Intel MKL and OpenBLAS.

Intel MKL can be used even if you compile with gcc or gfortran. OpenBLAS was compiled on Thorny enabling all the AVX512 extensions available on the CPUs on Thorny.

OpenBLAS and MKL perform very similar. MKL being a bit better but also more numerically unstable.

Atlas is also an implementation of BLAS that is part of the system and several other packages depend of it. Atlas can be found here:

/usr/lib64/atlas/libsatlas.so.3
/usr/lib64/atlas/libsatlas.so.3.10

I do not recommend using that for serious Linear Algebra usage. It performs worst than the two options above.

Notice that the standard compiler on RedHat Enterprise Linux (RHEL) is GCC 4.x. Thorny runs RHEL 7.6 and the compiler is GCC 4.8. Spruce Knob uses RHEL 6.10 and the compiler is 4.4.

GCC 4.x has no programming for optimize code for AVX512, that is the main reason to use newer compilers like GCC 8.x when compiling scientific code.

Compiling Reference BLAS and LAPACK

Now we will show how to compile the reference BLAS and LAPACK. The instructions were created for LAPACK 3.8

Download the code

The Reference BLAS and LAPACK are hosted on Netlib http://www.netlib.org/lapack/ The sources for LAPACK also contain the reference BLAS so all that we need is to download is the package for lapack:

wget http://www.netlib.org/lapack/lapack-3.8.0.tar.gz

Uncompress and untar the file with:

tar -zxvf lapack-3.8.0.tar.gz

That will generate a folder called lapack-3.8.0 with all the sources inside.

Patching the sources to support shared libraries and optimization for AVX512

As they are the sources only allow for static libraries when using the Makefile. There is also an option to use cmake, and we will cover that later on.

Using make, several files need to be changed. The patch file can be downloaded from here lapack.patch The content of the file is this:

diff -ruN lapack-3.8.0/BLAS/SRC/Makefile lapack-3.8.0_new/BLAS/SRC/Makefile
--- lapack-3.8.0/BLAS/SRC/Makefile    2017-11-12 23:15:54.000000000 -0500
+++ lapack-3.8.0_new/BLAS/SRC/Makefile        2019-11-12 13:23:07.985953000 -0500
@@ -55,7 +55,10 @@
 #
 #######################################################################

-all: $(BLASLIB)
+all: $(BLASLIB) libblas.so
+
+libblas.so: $(ALLOBJ)
+     gfortran -shared -Wl,-soname,$@ -o $@ $(ALLOBJ)

 #---------------------------------------------------------
 #  Comment out the next 6 definitions if you already have
diff -ruN lapack-3.8.0/Makefile lapack-3.8.0_new/Makefile
--- lapack-3.8.0/Makefile     2017-11-12 23:15:54.000000000 -0500
+++ lapack-3.8.0_new/Makefile 2019-11-12 13:11:37.379427000 -0500
@@ -147,3 +147,9 @@
      $(MAKE) -C BLAS cleantest
      $(MAKE) -C CBLAS cleantest
      $(MAKE) -C TESTING cleantest
+
+install: all
+     cp BLAS/SRC/libblas.so $(PREFIX)/lib
+     cp SRC/liblapack.so $(PREFIX)/lib
+     cp *.a $(PREFIX)/lib
+
diff -ruN lapack-3.8.0/make.inc lapack-3.8.0_new/make.inc
--- lapack-3.8.0/make.inc     1969-12-31 19:00:00.000000000 -0500
+++ lapack-3.8.0_new/make.inc 2019-11-12 13:30:49.341408000 -0500
@@ -0,0 +1,88 @@
+####################################################################
+#  LAPACK make include file.                                       #
+#  LAPACK, Version 3.8.0                                           #
+#  November 2017                                                   #
+####################################################################
+
+SHELL = /bin/sh
+
+#INSTALLATION PATH
+PREFIX=/shared/software/libs/blas_lapack/3.8_gcc82
+
+#  CC is the C compiler, normally invoked with options CFLAGS.
+#
+CC     = gcc
+CFLAGS = -O2 -fPIC -g -pipe -march=skylake-avx512
+
+#  Modify the FORTRAN and OPTS definitions to refer to the compiler
+#  and desired compiler options for your machine.  NOOPT refers to
+#  the compiler options desired when NO OPTIMIZATION is selected.
+#
+#  Note: During a regular execution, LAPACK might create NaN and Inf
+#  and handle these quantities appropriately. As a consequence, one
+#  should not compile LAPACK with flags such as -ffpe-trap=overflow.
+#
+FORTRAN = gfortran
+OPTS    = -O2 -frecursive -fPIC -g -pipe -march=skylake-avx512
+DRVOPTS = $(OPTS)
+NOOPT   = -O0 -frecursive -fPIC -g -pipe -march=skylake-avx512
+
+#  Define LOADER and LOADOPTS to refer to the loader and desired
+#  load options for your machine.
+#
+LOADER   = gfortran
+LOADOPTS =
+
+#  The archiver and the flag(s) to use when building an archive
+#  (library).  If your system has no ranlib, set RANLIB = echo.
+#
+ARCH      = ar
+ARCHFLAGS = cr
+RANLIB    = ranlib
+
+#  Timer for the SECOND and DSECND routines
+#
+#  Default:  SECOND and DSECND will use a call to the
+#  EXTERNAL FUNCTION ETIME
+#TIMER = EXT_ETIME
+#  For RS6K:  SECOND and DSECND will use a call to the
+#  EXTERNAL FUNCTION ETIME_
+#TIMER = EXT_ETIME_
+#  For gfortran compiler:  SECOND and DSECND will use a call to the
+#  INTERNAL FUNCTION ETIME
+TIMER = INT_ETIME
+#  If your Fortran compiler does not provide etime (like Nag Fortran
+#  Compiler, etc...) SECOND and DSECND will use a call to the
+#  INTERNAL FUNCTION CPU_TIME
+#TIMER = INT_CPU_TIME
+#  If none of these work, you can use the NONE value.
+#  In that case, SECOND and DSECND will always return 0.
+#TIMER = NONE
+
+#  Uncomment the following line to include deprecated routines in
+#  the LAPACK library.
+#
+#BUILD_DEPRECATED = Yes
+
+#  LAPACKE has the interface to some routines from tmglib.
+#  If LAPACKE_WITH_TMG is defined, add those routines to LAPACKE.
+#
+#LAPACKE_WITH_TMG = Yes
+
+#  Location of the extended-precision BLAS (XBLAS) Fortran library
+#  used for building and testing extended-precision routines.  The
+#  relevant routines will be compiled and XBLAS will be linked only
+#  if USEXBLAS is defined.
+#
+#USEXBLAS = Yes
+#XBLASLIB = -lxblas
+
+#  The location of the libraries to which you will link.  (The
+#  machine-specific, optimized BLAS library should be used whenever
+#  possible.)
+#
+BLASLIB      = ../../librefblas.a
+CBLASLIB     = ../../libcblas.a
+LAPACKLIB    = liblapack.a
+TMGLIB       = libtmglib.a
+LAPACKELIB   = liblapacke.a
diff -ruN lapack-3.8.0/SRC/Makefile lapack-3.8.0_new/SRC/Makefile
--- lapack-3.8.0/SRC/Makefile 2017-11-12 23:15:54.000000000 -0500
+++ lapack-3.8.0_new/SRC/Makefile     2019-11-12 13:26:04.437198491 -0500
@@ -512,7 +512,10 @@
 DEPRECATED = $(DEPRECSRC)
 endif

-all: ../$(LAPACKLIB)
+all: ../$(LAPACKLIB)  liblapack.so
+
+liblapack.so: $(ALLOBJ)
+     gfortran -shared -Wl,-soname,$@ -o $@ $(ALLOBJ)

 ../$(LAPACKLIB): $(ALLOBJ) $(ALLXOBJ) $(DEPRECATED)
      $(ARCH) $(ARCHFLAGS) $@ $^

You can download the patch to the folder lapack-3.8.0 and execute:

patch -s -p1 < lapack.patch

Another way is to edit the files one by one.

For the file lapack-3.8./BLAS/SRC/Makefile you need to add a new target “libblas.so” and adding the target to “all”:

all: $(BLASLIB) libblas.so

libblas.so: $(ALLOBJ)
gfortran -shared -Wl,-soname,$@ -o $@ $(ALLOBJ)

The file lapack-3.8./Makefile should have now:

install: all
    cp BLAS/SRC/libblas.so $(PREFIX)/lib
    cp SRC/liblapack.so $(PREFIX)/lib
    cp *.a $(PREFIX)/lib

Installing Numpy and Scipy using OpenBLAS

For the particular case of Python BLAS/LAPACK are very important for two packages Numpy and Scipy many scientific code written in Python relies on them.

The various environment modules for Python include Numpy and Scipy and they were specifically compiled to use OpenBLAS for linear algebra routines.

The instructions below show how you can do the same for clean virtual environments where you want to install numpy and scipy.

Assume that you want to use PyPy3 from module:

$ module purge
$ module load lang/python/pypy3.6-7.1.1-portable

The first thing to do is to also load GCC 8.2 and OpenBLAS 0.3.7:

$ module load load lang/gcc/8.2.0 libs/openblas/0.3.7_gcc82

We start creating the virtual environment with virtualenv. Notice that PyPy comes with its own version of virtualenv called virtualenv-pypy:

$ virtualenv-pypy pypy3.6

Activate the environment, create a build folder inside and change to that folder:

$ source pypy3.6/bin/activate
$ mkdir pypy3.6/build
$ cd pypy3.6/build

Now we want to install numpy and scipy, do not try to use pip install ..., it will bring generic versions using the reference BLAS and Lapack.

Download and extract the packages manually:

$ wget https://github.com/numpy/numpy/releases/download/v1.17.3/numpy-1.17.3.tar.gz
$ wget https://github.com/scipy/scipy/releases/download/v1.3.1/scipy-1.3.1.tar.gz
$ tar -zxvf numpy-1.17.3.tar.gz
$ tar -zxvf scipy-1.3.1.tar.gz

First configure numpy. Go to the folder and create a file site.cfg with this content:

[ALL]
library_dirs = /shared/software/libs/openblas/0.3.7_gcc82/lib
include_dirs = /shared/software/libs/openblas/0.3.7_gcc82/include

[openblas]
libraries = openblas
library_dirs = /shared/software/libs/openblas/0.3.7_gcc82/lib
include_dirs = /shared/software/libs/openblas/0.3.7_gcc82/include
runtime_library_dirs = /shared/software/libs/openblas/0.3.7_gcc82/lib

You can build and install numpy with pip, just execute:

$ pip3 install .

It takes a while and you get numpy installed and the installation tracked by pip. The configuration and installation Scipy is very similar:

$ cd ../scipy-1.3.1/
$ cp ../numpy-1.17.3/site.cfg .
$ pip3 install .

With numpy and scipy installed we can install any other packages that you plan for the virtual environment. As an example, we will install ipython and pytest:

$ pip3 install pytest ipython

(pypy3.6) 12:31:11-gufranco@trcis001:~/scratch/pypy3.6$ ipython
Python 3.6.1 (784b254d669919c872a505b807db8462b6140973, Apr 16 2019, 18:18:28)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.9.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: np.test()
NumPy version 1.17.3
NumPy relaxed strides checking option: True
................................................. [  0%]
.................................................
...ssssssss...................................... [100%]
10090 passed, 116 skipped, 173 deselected, 18 xfailed, 4 xpassed in 274.53s (0:04:34)

The configuration flags for numpy and scipy can be obtained from the command below:

(pypy3.6) 13:21:20-gufranco@trcis001:~/scratch/pypy3.6$ ipython
Python 3.6.1 (784b254d669919c872a505b807db8462b6140973, Apr 16 2019, 18:18:28)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.9.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import scipy

In [2]: scipy.__config__.show()
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/shared/software/libs/openblas/0.3.7_gcc82/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/shared/software/libs/openblas/0.3.7_gcc82/lib']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/shared/software/libs/openblas/0.3.7_gcc82/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/shared/software/libs/openblas/0.3.7_gcc82/lib']
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/shared/software/libs/openblas/0.3.7_gcc82/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/shared/software/libs/openblas/0.3.7_gcc82/lib']
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/shared/software/libs/openblas/0.3.7_gcc82/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/shared/software/libs/openblas/0.3.7_gcc82/lib']

Multithreading with OpenBLAS

Another reason to prefer OpenBLAS or MKL over the reference BLAS and LAPACK is that optimized versions provide multithreading, multithreading allows several cores that share the same memory space to work on the arrays concurrently.

OpenBLAS was compiled multithreaded and the number of threads can be controled with several environment variables:

export OPENBLAS_NUM_THREADS=4
export GOTO_NUM_THREADS=4
export OMP_NUM_THREADS=4

The priorities are OPENBLAS_NUM_THREADS > GOTO_NUM_THREADS > OMP_NUM_THREADS.

See more about this here:

https://github.com/xianyi/OpenBLAS

On the case of MKL there are two versions -mkl_sequential and -lmkl_intel_thread Setting the selection of libraries for MKL is complex, but I always use the link advisor to help me up

https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

The only case where I can think someone would like to use the reference BLAS and LAPCACK is if you want to check a numerical issue with the optimized versions but in production, those versions should never being used.

ScaLapack 2.2.0 on Spruce with GCC 10.3.0

Modules:

$> module load lang/gcc/10.3.0 parallel/openmpi/3.1.6_gcc103 libs/openblas/0.3.19_gcc103 dev/cmake/3.23.0

Download the software:

$> wget https://github.com/Reference-ScaLAPACK/scalapack/archive/refs/tags/v2.2.0.tar.gz -O scalapack-2.2.0.tar.gz
$> tar -zxvf scalapack-2.2.0.tar.gz
$> cd scalapack-2.2.0

Use cmake:

$> cmake -DCMAKE_Fortran_COMPILER=mpif90 -DCMAKE_C_COMPILER=mpicc \
-DCMAKE_Fortran_FLAGS=-fallow-argument-mismatch \
-DCMAKE_INSTALL_PREFIX=/shared/software/libs/scalapack/2.2.0_gcc103_ompi316 \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON ..

Run make:

$> make

ELPA 2021.11 on Spruce with GCC 10.3

Modules:

$> module load lang/gcc/10.3.0 libs/openblas/0.3.19_gcc103 libs/scalapack/2.2.0_gcc103_ompi407 \
  dev/cmake/3.23.0 parallel/openmpi/4.0.7_gcc103

Configure

$> ../configure --prefix=/shared/software/libs/elpa/2021.11_gcc103_ompi407 \
LDFLAGS="-L${MD_OPENBLAS}/lib -L${MD_SCALAPACK}/lib" \
--disable-sse --disable-avx --disable-avx2 --disable-avx512 CFLAGS=-fPIC

ELPA 2021.11 on Thorny with Intel 2022

Modules:

$> module load lang/gcc/9.3.0 parallel/cuda/11.7 compiler mpi mkl

Configure:

$> export SCALAPACK="-L/gpfs20/shared/software/intel/oneapi/mkl/2022.1.0/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm -ldl"
$> ../configure --enable-avx512 --enable-nvidia-gpu CC=mpiicc CXX=mpiicpc FC=mpiifort SCALAPACK_LDFLAGS="$SCALAPACK" --with-NVIDIA-GPU-compute-capability=sm_61 --prefix=/shared/software/libs/elpa/2021.11.002_intel22_impi22