LAMMPS¶
Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a molecular dynamics software code from Sandia National Laboratories. LAMMPS makes use of hybrid parallelization techniques being able to use multicore parallelism (OpenMP), Message Passing Interface (MPI) for distributed parallel communication and accelerators such as GPUs The code is free and open-source software, distributed under the terms of the GNU General Public License.
In Molecular dynamics the forces that act over particules have a limited range. For computing efficiency, LAMMPS uses neighbor lists (Verlet lists) to keep track of nearby particles. The lists are optimized for systems with particles that repel at short distances, so that the local density of particles never grows too large.
Another stragety for optimization in large systems is domain decomposition. LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small 3d sub-domains, one of which is assigned to different processes. Processes communicate and store ghost atom information for atoms that border their subdomain. LAMMPS is most efficient (in a parallel computing sense) for systems whose particles fill a 3D rectangular box with approximately uniform density.
In this tutorial we will demonstrate how to use a hybrid combination for LAMMPS. We will combine MPI, OpenMP and GPU acceleration for a simple benchmark case.
Creating the input file¶
Select a good location for executing the simulation. For example use $SCRATCH/LAMMPS:
$> mkdir $SCRATCH/LAMMPS
$> cd $SCRATCH/LAMMPS
Download the two examples from the LAMMPS webserver:
$> wget https://lammps.sandia.gov/inputs/in.lj.txt
The example is very simple, alternatively you can just copy and paste the input file below:
# 3d Lennard-Jones melt
variable x index 1
variable y index 1
variable z index 1
variable xx equal 20*$x
variable yy equal 20*$y
variable zz equal 20*$z
units lj
atom_style atomic
lattice fcc 0.8442
region box block 0 ${xx} 0 ${yy} 0 ${zz}
create_box 1 box
create_atoms 1 box
mass 1 1.0
velocity all create 1.44 87287 loop geom
pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5
neighbor 0.3 bin
neigh_modify delay 0 every 20 check no
fix 1 all nve
run 100
Requesting an interactive job¶
For the purpose of this example we will use an interactive job. Down below we will show how to achieve the same result via a batch job We start by requesting and entire GPU compute node. On Thorny Flat GPU nodes have 3 GPU cards and 24 CPU cores. That is what we are requesting here for a job with a wall time of 2 hours:
$> srun -p comm_gpu_inter -G 3 -t 2:00:00 -c 24 --pty bash
After a few seconds we are redirected to one of the GPU compute nodes. We can check which are the GPU cards that were assigned to us with the command nvidia-smi:
tcogq003:~/scratch/LAMMPS$ nvidia-smi
Wed May 3 18:37:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro P6000 Off| 00000000:37:00.0 Off | Off |
| 26% 18C P8 8W / 250W| 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Quadro P6000 Off| 00000000:AF:00.0 Off | Off |
| 26% 19C P8 8W / 250W| 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Quadro P6000 Off| 00000000:D8:00.0 Off | Off |
| 26% 21C P8 8W / 250W| 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
We can see 3 GPU cards. They are NVIDIA Quadro P6000 which are our older cards. It is important to verify this becase that will determine which capabilites are available on the cards and the corresponding build of LAMMPS that we can use
Our next step is to load singularity as the LAMMPS we will use is an image optimized for GPUs. Execute this command:
tcogq003:~/scratch/LAMMPS$ module load singularity
Now we are ready to enter into the filesystem of the container.
The container we will use is NGC_LAMMPS_patch_3Nov2022.sif
located in /shared/containers
Get a shell inside the cntainer with the command:
tcogq003:~/scratch/LAMMPS$ singularity shell --nv /shared/containers/NGC_LAMMPS_patch_3Nov2022.sif
Remember to use the argument –nv to access the GPUs inside the container. It is a good practice to double check that such is the case:
Singularity> nvidia-smi
Wed May 3 18:48:19 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro P6000 Off| 00000000:37:00.0 Off | Off |
| 26% 18C P8 8W / 250W| 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Quadro P6000 Off| 00000000:AF:00.0 Off | Off |
| 26% 19C P8 8W / 250W| 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Quadro P6000 Off| 00000000:D8:00.0 Off | Off |
| 26% 21C P8 8W / 250W| 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
We can now prepare the variables to execute LAMMPS.
The first variable is export OMP_NUM_THREADS. This variable controls the number of OpenMP threads that a process can create. We have 3 GPU cards and 24 CPU cores. As we will see in a minute a good distribution is to assigned one GPU to one MPI process and use OpenMP to use the extra cores via OpenMP threads. Most of the processing is taking place on the GPU so it is not critical but a good way for trying to maximize the use of resources. Execute this command:
export OMP_NUM_THREADS=8
The next command depends on the Hardware Capabilities of the GPU we are using. The NVIDIA P6000 support software architecture (gencode) SM 6.1
The container offers several builds of LAMMPS:
Singularity> ls /usr/local/lammps/
sm60 sm70 sm75 sm80 sm86 sm90
The selection of which build to use must be consistent with the highest gencode supported by the hardware.
This table could Guide you about the selection of the appropiated gencode
Fermi |
Kepler |
Maxwell |
Pascal |
Volta |
Turing |
Ampere |
Ada (Lovelace) |
Hopper |
---|---|---|---|---|---|---|---|---|
sm_20 |
sm_30 |
sm_50 |
sm_60 |
sm_70 |
sm_75 |
sm_80 |
sm_89 |
sm_90 |
sm_35 |
sm_52 |
sm_61 |
sm_72 (Xavier) |
sm_86 |
sm_90a (Thor) |
|||
sm_37 |
sm_53 |
sm_62 |
sm_87 (Orin) |
What matters to us is that the P6000 will not support a gencode other than sm60. We will setup the variable LD_LIBRARY_PATH to search for the lammps variables on the right location Execute:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lammps/sm60/lib
We are now ready to execute LAMMPS: The following command will ask for 3 MPI processes 3 GPU cards to run the input file above:
Singularity> mpirun -n 3 /usr/local/lammps/sm60/bin/lmp -k on g 3 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 4 -var z 8 -in in.lj.txt
The code runs is just a few seconds. The output looks like this:
LAMMPS (3 Nov 2022)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:106)
will use up to 3 GPU(s) per node
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
using 1 OpenMP thread(s) per MPI task
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Created orthogonal box = (0 0 0) to (268.73539 134.3677 268.73539)
1 by 1 by 3 MPI processor grid
Created 8192000 atoms
using lattice units in orthogonal box = (0 0 0) to (268.73539 134.3677 268.73539)
create_atoms CPU = 0.964 seconds
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 20 steps, delay = 0 steps, check = no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 2.8
ghost atom cutoff = 2.8
binsize = 2.8, bins = 96 48 96
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair lj/cut/kk, perpetual
attributes: full, newton off, kokkos_device
pair build: full/bin/kk/device
stencil: full/bin/3d
bin: kk/device
Setting up Verlet run ...
Unit style : lj
Current step : 0
Time step : 0.005
Per MPI rank memory allocation (min/avg/max) = 417.8 | 419.8 | 423 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133683 -5.0196694
100 0.75927734 -5.761232 0 -4.6223161 0.19102612
Loop time of 3.78732 on 3 procs for 100 steps with 8192000 atoms
Performance: 11406.477 tau/day, 26.404 timesteps/s, 216.301 Matom-step/s
64.0% CPU use with 3 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 0.054429 | 0.054921 | 0.055782 | 0.3 | 1.45
Neigh | 0.57971 | 0.60578 | 0.63878 | 3.2 | 16.00
Comm | 0.51117 | 0.53519 | 0.55821 | 2.6 | 14.13
Output | 0.00041937 | 0.00451 | 0.0078808 | 4.6 | 0.12
Modify | 2.5216 | 2.5246 | 2.5287 | 0.2 | 66.66
Other | | 0.06236 | | | 1.65
Nlocal: 2.73067e+06 ave 2.73706e+06 max 2.7274e+06 min
Histogram: 2 0 0 0 0 0 0 0 0 1
Nghost: 361536 ave 368174 max 348360 min
Histogram: 1 0 0 0 0 0 0 0 0 2
Neighs: 0 ave 0 max 0 min
Histogram: 3 0 0 0 0 0 0 0 0 0
FullNghs: 2.05098e+08 ave 2.05575e+08 max 2.04853e+08 min
Histogram: 2 0 0 0 0 0 0 0 0 1
Total # of neighbors = 6.1529347e+08
Ave neighs/atom = 75.109066
Neighbor list builds = 5
Dangerous builds not checked
Total wall time: 0:00:08
Executing LAMMPS from a batch job¶
We can use all that we learn above to convert the execution into a submission script Write a submission script that we will call here runjob.slurm:
#!/bin/bash
#SBATCH --job-name=LAMMPS
#SBATCH -p comm_gpu_inter
#SBATCH -G 3
#SBATCH -t 2:00:00
#SBATCH -c 24
module purge
module load singularity
cd $SLURM_SUBMIT_DIR
pwd
singularity exec --nv /shared/containers/NGC_LAMMPS_patch_3Nov2022.sif ./run_lammps.sh
We need also another file that will set the variables and run LAMMPS inside the container:: The file is called run_lammps.sh and the content is this:
#!/bin/bash
export OMP_NUM_THREADS=8
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lammps/sm60/lib
mpirun -n 3 /usr/local/lammps/sm60/bin/lmp -k on g 3 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 4 -var z 8 -in in.lj.txt
We need to make this file executable using the command:
$> chmod +x run_lammps.sh
Now we can submit the job and wait for the result. Submit the job with the command:
$> sbatch runjob.slurm
We will get two files log.lammps that is the traditional log from LAMMPS and a file that looks like slurm-194660.out. The output from LAMMPS that instead of being shown on the screen is stored on a file with the corresponding JobID
The file looks similar to the output from our interactive execution:
tcogq003:~/scratch/LAMMPS$ cat slurm-194660.out
Removing gcc version 9.3.0 : lang/gcc/9.3.0
Removing git version 2.29.1 : dev/git/2.29.1
/gpfs20/scratch/gufranco/LAMMPS
LAMMPS (3 Nov 2022)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:106)
will use up to 3 GPU(s) per node
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Created orthogonal box = (0 0 0) to (268.73539 134.3677 268.73539)
1 by 1 by 3 MPI processor grid
Created 8192000 atoms
using lattice units in orthogonal box = (0 0 0) to (268.73539 134.3677 268.73539)
create_atoms CPU = 0.963 seconds
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 20 steps, delay = 0 steps, check = no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 2.8
ghost atom cutoff = 2.8
binsize = 2.8, bins = 96 48 96
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair lj/cut/kk, perpetual
attributes: full, newton off, kokkos_device
pair build: full/bin/kk/device
stencil: full/bin/3d
bin: kk/device
Setting up Verlet run ...
Unit style : lj
Current step : 0
Time step : 0.005
Per MPI rank memory allocation (min/avg/max) = 417.8 | 419.8 | 423 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133683 -5.0196694
100 0.75927734 -5.761232 0 -4.6223161 0.19102612
Loop time of 3.78997 on 3 procs for 100 steps with 8192000 atoms
Performance: 11398.509 tau/day, 26.385 timesteps/s, 216.149 Matom-step/s
63.1% CPU use with 3 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 0.054214 | 0.054977 | 0.056248 | 0.4 | 1.45
Neigh | 0.57713 | 0.60327 | 0.63866 | 3.3 | 15.92
Comm | 0.50607 | 0.54467 | 0.57153 | 3.8 | 14.37
Output | 0.00047028 | 0.012894 | 0.03579 | 14.3 | 0.34
Modify | 2.5112 | 2.5169 | 2.5282 | 0.5 | 66.41
Other | | 0.05729 | | | 1.51
Nlocal: 2.73067e+06 ave 2.73706e+06 max 2.7274e+06 min
Histogram: 2 0 0 0 0 0 0 0 0 1
Nghost: 361536 ave 368174 max 348360 min
Histogram: 1 0 0 0 0 0 0 0 0 2
Neighs: 0 ave 0 max 0 min
Histogram: 3 0 0 0 0 0 0 0 0 0
FullNghs: 2.05098e+08 ave 2.05575e+08 max 2.04853e+08 min
Histogram: 2 0 0 0 0 0 0 0 0 1
Total # of neighbors = 6.1529347e+08
Ave neighs/atom = 75.109066
Neighbor list builds = 5
Dangerous builds not checked
Total wall time: 0:00:08