Parallel Programming: OpenACC

The fastest supercomputers today include alongside with the normal CPUs, extra electronic devices specialized to perform certain computational tasks much faster than CPUs, those devices are called Accelerators and the most prominent of those today are Graphical Processing Units (GPUs).

Accelerators typically have their own memory and programming those devices requires being aware of memory management and data movement. The memory on Accelerators is usually smaller than the RAM available on compute nodes and the transfer of data from RAM to the internal memory of the accelerator is usually one order of magnitude slower than between RAM and the CPU.

As GPUs are the most common accelerator today and those are the devices on our clusters we will concentrate the rest of this section to them. We will present two solutions for programming GPUs. OpenACC is a directive based standard for parallel computing with GPUs from several vendors, Intel’s Xeon Phi (Discontinued in 2020), Field-programmable gate arrays (FPGAs), and exotic devices such as Digital Signal Processors (DSPs). We will present how to use OpenACC using the NVIDIA HPC SDK.

For the specific case of NVIDIA GPUs is the solution for parallel computing . CUDA is a platform and application programming interface (API) model created by Nvidia for their GPUs, the API have evolved over the years and several libraries have been created to take advantage of those GPUS. On our clusters we have NVIDIA GPUs and we will present how to compile and run basic programs written in CUDA.

OpenACC

OpenACC is a directive based standard to allow developers to take advantage of various kinds of accelerators such as GPUs from NVIDIA. As being a directive base approach it is similar to OpenMP that we saw in a previous section. All that is needed is to add some directives tothe original source code in such a way that a OpenACC aware compiler can use those directives to parallelize certain portions of the code. The compiler will generate parallel multithreaded code similar to OpenMP but with the addition of data movement in and out the device as most accelerators will have their own memory and data must be transferred before the accelerator can process that data. For compilers that are not OpenACC capable, will simply ignore those directives and compile a serial code as usual.

Compiler with OpenACC support

The best set of compilers for using OpenACC is the NVIDIA HPC SDK. There is support from GNU Compiler Collection (GCC) starting from GCC 8 that supports OpenACC 2.0 up to GCC 10 that supports OpenACC 2.6. For this section we will use the NVIDIA HPC SDK compilers. To load the compilers execute:

module load lang/nvidia/nvhpc

Now that you have loaded the module, the three compilers that you can use for OpenACC are nvfortran, nvc and nvc++ for Fortran, C and C++ respectively.

Submit a job interactively

You do not need to compile the code on a compute node with GPUs, but for the purpose of learning is good to work directly on an interactive node and test executions and recompile as needed.

The command to request one GPU card for an interactive job is:

$> qsub -I -l nodes=1:ppn=8:gpus=1 -q comm_gpu_inter

Now, lets start with a very simple example. Consider the following code:

#include  <stdio.h>
#include <limits.h>
#include <math.h>

// LONG MAX 9223372036854775807
#define N            1000000000

int main()
{

double pi= 0.0;
long int i,nmax;

nmax=(long int)(LONG_MAX/N);

#pragma acc kernels
for (i=0; i<(long int)nmax; i++)
{
  double t = (double)((i+0.05)/nmax);
  pi += 4.0/(1.0+t*t);
}

printf("PI using %ld terms in the series is equal to %20.15le\n", nmax, pi/nmax);
printf("Difference with reference value %e\n",fabs(M_PI-pi/nmax));

}

Notice that we are using the a line with the pragma:

#pragma acc kernels

That line, instructs a compiler with support for OpenACC that the block of code after the pragma, ie, the for loop, could be parallelized. Lets first compile the code without requesting OpenACC:

$> nvc pi_acc_d.c

After the code is compile the executable by default is the file a.out. We can execute this file and timing the execution with:

$> time ./a.out
PI using 9223372036 terms in the series is equal to 3.141592653687312e+00
Difference with reference value 9.751933e-11

real    0m24.704s
user    0m24.643s
sys     0m0.003s

On the current machine, it takes more than 20 seconds to compute all the series. Now lets compile again using the arguments to nvc and ask it to introduce OpenACC for the pragma in the sources:

$> nvc pi_acc_d.c -acc -Minfo=all

With the addition of -Minfo=all we will get some extra information about which portions of the code will be parallelized and how. Repeat the execution for the new binary:

$> time ./a.out
PI using 9223372036 terms in the series is equal to 3.141592653687371e+00
Difference with reference value 9.757795e-11

real    0m1.922s
user    0m0.399s
sys     0m1.459s

You will observe that it takes less than 2 seconds, a very significant improvement over the serial version.

Example from the XSEDE workshop on OpenACC

Now, we will show how to compile and execute the same code proposed during the XSEDE workshop on OpenACC. To compile the serial code for the fortran example use:

$> nvfortran laplace_serial.f90

Execute the code with a limit of 4000 iterations:

$> echo 4000 | time ./a.out

The final lines from the execution are:

---------- Iteration number:          3300  ---------------
( 995, 995): 97.66  ( 996, 996): 98.24  ( 997, 997): 98.75  ( 998, 998): 99.19  ( 999, 999): 99.56  (1000,1000): 99.87
Max error at iteration          3372  was    9.9953310357534519E-003
Total time was     17.64035      seconds.
17.59user 0.00system 0:17.64elapsed 99%CPU (0avgtext+0avgdata 18484maxresident)k
0inputs+0outputs (0major+5181minor)pagefaults 0swaps

Now, lets compile the solution proposed during the workshop:

$> nvfortran laplace_acc.f90 -acc -Minfo=all
serial:
     44, Generating copy(temperature_last(:,:)) [if not already present]
         Generating create(temperature(:,:)) [if not already present]
     48, Loop is parallelizable
     49, Loop is parallelizable
         Generating Tesla code
         48, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
         49, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
     59, Generating implicit copy(dt) [if not already present]
     60, Loop is parallelizable
     61, Loop is parallelizable
         Generating Tesla code
         60, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             Generating implicit reduction(max:dt)
         61, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
     70, Generating update self(temperature(:,:))
initialize:
     98, Memory zero idiom, array assignment replaced by call to pgf90_mzero8

And test the new binary:

---------- Iteration number:          3300  ---------------
( 995, 995): 97.66  ( 996, 996): 98.24  ( 997, 997): 98.75  ( 998, 998): 99.19  ( 999, 999): 99.56  (1000,1000): 99.87
Max error at iteration          3372  was    9.9953310357534519E-003
Total time was     1.402100      seconds.
0.30user 1.32system 0:01.68elapsed 96%CPU (0avgtext+0avgdata 208792maxresident)k
0inputs+0outputs (0major+24657minor)pagefaults 0swaps

There is a significant improvement in the execution of the Laplace solver on a compute node using one GPU card.