LAMMPS ====== Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a molecular dynamics software code from Sandia National Laboratories. LAMMPS makes use of hybrid parallelization techniques being able to use multicore parallelism (OpenMP), Message Passing Interface (MPI) for distributed parallel communication and accelerators such as GPUs The code is free and open-source software, distributed under the terms of the GNU General Public License. In Molecular dynamics the forces that act over particules have a limited range. For computing efficiency, LAMMPS uses neighbor lists (Verlet lists) to keep track of nearby particles. The lists are optimized for systems with particles that repel at short distances, so that the local density of particles never grows too large. Another stragety for optimization in large systems is domain decomposition. LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small 3d sub-domains, one of which is assigned to different processes. Processes communicate and store ghost atom information for atoms that border their subdomain. LAMMPS is most efficient (in a parallel computing sense) for systems whose particles fill a 3D rectangular box with approximately uniform density. In this tutorial we will demonstrate how to use a hybrid combination for LAMMPS. We will combine MPI, OpenMP and GPU acceleration for a simple benchmark case. Creating the input file ------------------------ Select a good location for executing the simulation. For example use $SCRATCH/LAMMPS:: $> mkdir $SCRATCH/LAMMPS $> cd $SCRATCH/LAMMPS Download the two examples from the LAMMPS webserver:: $> wget https://lammps.sandia.gov/inputs/in.lj.txt The example is very simple, alternatively you can just copy and paste the input file below:: # 3d Lennard-Jones melt variable x index 1 variable y index 1 variable z index 1 variable xx equal 20*$x variable yy equal 20*$y variable zz equal 20*$z units lj atom_style atomic lattice fcc 0.8442 region box block 0 ${xx} 0 ${yy} 0 ${zz} create_box 1 box create_atoms 1 box mass 1 1.0 velocity all create 1.44 87287 loop geom pair_style lj/cut 2.5 pair_coeff 1 1 1.0 1.0 2.5 neighbor 0.3 bin neigh_modify delay 0 every 20 check no fix 1 all nve run 100 Requesting an interactive job ----------------------------- For the purpose of this example we will use an interactive job. Down below we will show how to achieve the same result via a batch job We start by requesting and entire GPU compute node. On Thorny Flat GPU nodes have 3 GPU cards and 24 CPU cores. That is what we are requesting here for a job with a wall time of 2 hours:: $> srun -p comm_gpu_inter -G 3 -t 2:00:00 -c 24 --pty bash After a few seconds we are redirected to one of the GPU compute nodes. We can check which are the GPU cards that were assigned to us with the command `nvidia-smi`:: tcogq003:~/scratch/LAMMPS$ nvidia-smi Wed May 3 18:37:41 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Quadro P6000 Off| 00000000:37:00.0 Off | Off | | 26% 18C P8 8W / 250W| 0MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Quadro P6000 Off| 00000000:AF:00.0 Off | Off | | 26% 19C P8 8W / 250W| 0MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Quadro P6000 Off| 00000000:D8:00.0 Off | Off | | 26% 21C P8 8W / 250W| 0MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ We can see 3 GPU cards. They are **NVIDIA Quadro P6000** which are our older cards. It is important to verify this becase that will determine which capabilites are available on the cards and the corresponding build of LAMMPS that we can use Our next step is to load singularity as the LAMMPS we will use is an image optimized for GPUs. Execute this command:: tcogq003:~/scratch/LAMMPS$ module load singularity Now we are ready to enter into the filesystem of the container. The container we will use is ``NGC_LAMMPS_patch_3Nov2022.sif`` located in `/shared/containers` Get a shell inside the cntainer with the command:: tcogq003:~/scratch/LAMMPS$ singularity shell --nv /shared/containers/NGC_LAMMPS_patch_3Nov2022.sif Remember to use the argument `--nv` to access the GPUs inside the container. It is a good practice to double check that such is the case:: Singularity> nvidia-smi Wed May 3 18:48:19 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Quadro P6000 Off| 00000000:37:00.0 Off | Off | | 26% 18C P8 8W / 250W| 0MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Quadro P6000 Off| 00000000:AF:00.0 Off | Off | | 26% 19C P8 8W / 250W| 0MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Quadro P6000 Off| 00000000:D8:00.0 Off | Off | | 26% 21C P8 8W / 250W| 0MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ We can now prepare the variables to execute LAMMPS. The first variable is export `OMP_NUM_THREADS`. This variable controls the number of OpenMP threads that a process can create. We have 3 GPU cards and 24 CPU cores. As we will see in a minute a good distribution is to assigned one GPU to one MPI process and use OpenMP to use the extra cores via OpenMP threads. Most of the processing is taking place on the GPU so it is not critical but a good way for trying to maximize the use of resources. Execute this command:: export OMP_NUM_THREADS=8 The next command depends on the Hardware Capabilities of the GPU we are using. The NVIDIA P6000 support software architecture (gencode) SM 6.1 The container offers several builds of LAMMPS:: Singularity> ls /usr/local/lammps/ sm60 sm70 sm75 sm80 sm86 sm90 The selection of which build to use must be consistent with the highest gencode supported by the hardware. This table could Guide you about the selection of the appropiated gencode .. list-table:: Title :widths: 11 11 11 11 11 11 11 11 12 :header-rows: 1 * - Fermi - Kepler - Maxwell - Pascal - Volta - Turing - Ampere - Ada (Lovelace) - Hopper * - sm_20 - sm_30 - sm_50 - sm_60 - sm_70 - sm_75 - sm_80 - sm_89 - sm_90 * - - sm_35 - sm_52 - sm_61 - sm_72 (Xavier) - - sm_86 - - sm_90a (Thor) * - - sm_37 - sm_53 - sm_62 - - - sm_87 (Orin) - - What matters to us is that the P6000 will not support a gencode other than sm60. We will setup the variable `LD_LIBRARY_PATH` to search for the lammps variables on the right location Execute:: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lammps/sm60/lib We are now ready to execute LAMMPS: The following command will ask for 3 MPI processes 3 GPU cards to run the input file above:: Singularity> mpirun -n 3 /usr/local/lammps/sm60/bin/lmp -k on g 3 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 4 -var z 8 -in in.lj.txt The code runs is just a few seconds. The output looks like this:: LAMMPS (3 Nov 2022) KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:106) will use up to 3 GPU(s) per node Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance. Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance. using 1 OpenMP thread(s) per MPI task Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance. Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance. Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance. Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance. Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962 Created orthogonal box = (0 0 0) to (268.73539 134.3677 268.73539) 1 by 1 by 3 MPI processor grid Created 8192000 atoms using lattice units in orthogonal box = (0 0 0) to (268.73539 134.3677 268.73539) create_atoms CPU = 0.964 seconds Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule Neighbor list info ... update: every = 20 steps, delay = 0 steps, check = no max neighbors/atom: 2000, page size: 100000 master list distance cutoff = 2.8 ghost atom cutoff = 2.8 binsize = 2.8, bins = 96 48 96 1 neighbor lists, perpetual/occasional/extra = 1 0 0 (1) pair lj/cut/kk, perpetual attributes: full, newton off, kokkos_device pair build: full/bin/kk/device stencil: full/bin/3d bin: kk/device Setting up Verlet run ... Unit style : lj Current step : 0 Time step : 0.005 Per MPI rank memory allocation (min/avg/max) = 417.8 | 419.8 | 423 Mbytes Step Temp E_pair E_mol TotEng Press 0 1.44 -6.7733681 0 -4.6133683 -5.0196694 100 0.75927734 -5.761232 0 -4.6223161 0.19102612 Loop time of 3.78732 on 3 procs for 100 steps with 8192000 atoms Performance: 11406.477 tau/day, 26.404 timesteps/s, 216.301 Matom-step/s 64.0% CPU use with 3 MPI tasks x 1 OpenMP threads MPI task timing breakdown: Section | min time | avg time | max time |%varavg| %total --------------------------------------------------------------- Pair | 0.054429 | 0.054921 | 0.055782 | 0.3 | 1.45 Neigh | 0.57971 | 0.60578 | 0.63878 | 3.2 | 16.00 Comm | 0.51117 | 0.53519 | 0.55821 | 2.6 | 14.13 Output | 0.00041937 | 0.00451 | 0.0078808 | 4.6 | 0.12 Modify | 2.5216 | 2.5246 | 2.5287 | 0.2 | 66.66 Other | | 0.06236 | | | 1.65 Nlocal: 2.73067e+06 ave 2.73706e+06 max 2.7274e+06 min Histogram: 2 0 0 0 0 0 0 0 0 1 Nghost: 361536 ave 368174 max 348360 min Histogram: 1 0 0 0 0 0 0 0 0 2 Neighs: 0 ave 0 max 0 min Histogram: 3 0 0 0 0 0 0 0 0 0 FullNghs: 2.05098e+08 ave 2.05575e+08 max 2.04853e+08 min Histogram: 2 0 0 0 0 0 0 0 0 1 Total # of neighbors = 6.1529347e+08 Ave neighs/atom = 75.109066 Neighbor list builds = 5 Dangerous builds not checked Total wall time: 0:00:08 Executing LAMMPS from a batch job --------------------------------- We can use all that we learn above to convert the execution into a submission script Write a submission script that we will call here `runjob.slurm`:: #!/bin/bash #SBATCH --job-name=LAMMPS #SBATCH -p comm_gpu_inter #SBATCH -G 3 #SBATCH -t 2:00:00 #SBATCH -c 24 module purge module load singularity cd $SLURM_SUBMIT_DIR pwd singularity exec --nv /shared/containers/NGC_LAMMPS_patch_3Nov2022.sif ./run_lammps.sh We need also another file that will set the variables and run LAMMPS inside the container:: The file is called `run_lammps.sh` and the content is this:: #!/bin/bash export OMP_NUM_THREADS=8 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lammps/sm60/lib mpirun -n 3 /usr/local/lammps/sm60/bin/lmp -k on g 3 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 4 -var z 8 -in in.lj.txt We need to make this file executable using the command:: $> chmod +x run_lammps.sh Now we can submit the job and wait for the result. Submit the job with the command:: $> sbatch runjob.slurm We will get two files `log.lammps` that is the traditional log from LAMMPS and a file that looks like `slurm-194660.out`. The output from LAMMPS that instead of being shown on the screen is stored on a file with the corresponding JobID The file looks similar to the output from our interactive execution:: tcogq003:~/scratch/LAMMPS$ cat slurm-194660.out Removing gcc version 9.3.0 : lang/gcc/9.3.0 Removing git version 2.29.1 : dev/git/2.29.1 /gpfs20/scratch/gufranco/LAMMPS LAMMPS (3 Nov 2022) KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:106) will use up to 3 GPU(s) per node Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance. Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance. Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance. Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance. Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance. Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance. using 1 OpenMP thread(s) per MPI task Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962 Created orthogonal box = (0 0 0) to (268.73539 134.3677 268.73539) 1 by 1 by 3 MPI processor grid Created 8192000 atoms using lattice units in orthogonal box = (0 0 0) to (268.73539 134.3677 268.73539) create_atoms CPU = 0.963 seconds Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule Neighbor list info ... update: every = 20 steps, delay = 0 steps, check = no max neighbors/atom: 2000, page size: 100000 master list distance cutoff = 2.8 ghost atom cutoff = 2.8 binsize = 2.8, bins = 96 48 96 1 neighbor lists, perpetual/occasional/extra = 1 0 0 (1) pair lj/cut/kk, perpetual attributes: full, newton off, kokkos_device pair build: full/bin/kk/device stencil: full/bin/3d bin: kk/device Setting up Verlet run ... Unit style : lj Current step : 0 Time step : 0.005 Per MPI rank memory allocation (min/avg/max) = 417.8 | 419.8 | 423 Mbytes Step Temp E_pair E_mol TotEng Press 0 1.44 -6.7733681 0 -4.6133683 -5.0196694 100 0.75927734 -5.761232 0 -4.6223161 0.19102612 Loop time of 3.78997 on 3 procs for 100 steps with 8192000 atoms Performance: 11398.509 tau/day, 26.385 timesteps/s, 216.149 Matom-step/s 63.1% CPU use with 3 MPI tasks x 1 OpenMP threads MPI task timing breakdown: Section | min time | avg time | max time |%varavg| %total --------------------------------------------------------------- Pair | 0.054214 | 0.054977 | 0.056248 | 0.4 | 1.45 Neigh | 0.57713 | 0.60327 | 0.63866 | 3.3 | 15.92 Comm | 0.50607 | 0.54467 | 0.57153 | 3.8 | 14.37 Output | 0.00047028 | 0.012894 | 0.03579 | 14.3 | 0.34 Modify | 2.5112 | 2.5169 | 2.5282 | 0.5 | 66.41 Other | | 0.05729 | | | 1.51 Nlocal: 2.73067e+06 ave 2.73706e+06 max 2.7274e+06 min Histogram: 2 0 0 0 0 0 0 0 0 1 Nghost: 361536 ave 368174 max 348360 min Histogram: 1 0 0 0 0 0 0 0 0 2 Neighs: 0 ave 0 max 0 min Histogram: 3 0 0 0 0 0 0 0 0 0 FullNghs: 2.05098e+08 ave 2.05575e+08 max 2.04853e+08 min Histogram: 2 0 0 0 0 0 0 0 0 1 Total # of neighbors = 6.1529347e+08 Ave neighs/atom = 75.109066 Neighbor list builds = 5 Dangerous builds not checked Total wall time: 0:00:08