LAMMPS
======

Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a molecular 
dynamics software code from Sandia National Laboratories.
LAMMPS makes use of hybrid parallelization techniques being able to use multicore parallelism (OpenMP),
Message Passing Interface (MPI) for distributed parallel communication and accelerators such as GPUs 
The code is free and open-source software, distributed under the terms of the GNU General Public License.

In Molecular dynamics the forces that act over particules have a limited range.
For computing efficiency, LAMMPS uses neighbor lists (Verlet lists) to keep track of nearby particles. 
The lists are optimized for systems with particles that repel at short distances, 
so that the local density of particles never grows too large.

Another stragety for optimization in large systems is domain decomposition.
LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small 3d sub-domains, 
one of which is assigned to different processes. 
Processes communicate and store ghost atom information for atoms that border their subdomain. 
LAMMPS is most efficient (in a parallel computing sense) for systems whose particles fill a 3D rectangular 
box with approximately uniform density.

In this tutorial we will demonstrate how to use a hybrid combination for LAMMPS.
We will combine MPI, OpenMP and GPU acceleration for a simple benchmark case.

Creating the input file
------------------------

Select a good location for executing the simulation.
For example use $SCRATCH/LAMMPS::

	$> mkdir $SCRATCH/LAMMPS
	$> cd $SCRATCH/LAMMPS

Download the two examples from the LAMMPS webserver::

	$> wget https://lammps.sandia.gov/inputs/in.lj.txt

The example is very simple, alternatively you can just copy and paste the input file below::

    # 3d Lennard-Jones melt

    variable    x index 1
    variable    y index 1
    variable    z index 1

    variable    xx equal 20*$x
    variable    yy equal 20*$y
    variable    zz equal 20*$z

    units       lj
    atom_style  atomic

    lattice     fcc 0.8442
    region      box block 0 ${xx} 0 ${yy} 0 ${zz}
    create_box  1 box
    create_atoms    1 box
    mass        1 1.0

    velocity    all create 1.44 87287 loop geom

    pair_style  lj/cut 2.5
    pair_coeff  1 1 1.0 1.0 2.5

    neighbor    0.3 bin
    neigh_modify    delay 0 every 20 check no

    fix     1 all nve

    run     100


Requesting an interactive job
-----------------------------

For the purpose of this example we will use an interactive job.
Down below we will show how to achieve the same result via a batch job
We start by requesting and entire GPU compute node.
On Thorny Flat GPU nodes have 3 GPU cards and 24 CPU cores.
That is what we are requesting here for a job with a wall time of 2 hours::

    $> srun -p comm_gpu_inter -G 3 -t 2:00:00 -c 24 --pty bash

After a few seconds we are redirected to one of the GPU compute nodes.
We can check which are the GPU cards that were assigned to us with the command `nvidia-smi`::

	tcogq003:~/scratch/LAMMPS$ nvidia-smi
	Wed May  3 18:37:41 2023       
	+---------------------------------------------------------------------------------------+
	| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
	|-----------------------------------------+----------------------+----------------------+
	| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
	| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
	|                                         |                      |               MIG M. |
	|=========================================+======================+======================|
	|   0  Quadro P6000                    Off| 00000000:37:00.0 Off |                  Off |
	| 26%   18C    P8                8W / 250W|      0MiB / 24576MiB |      0%      Default |
	|                                         |                      |                  N/A |
	+-----------------------------------------+----------------------+----------------------+
	|   1  Quadro P6000                    Off| 00000000:AF:00.0 Off |                  Off |
	| 26%   19C    P8                8W / 250W|      0MiB / 24576MiB |      0%      Default |
	|                                         |                      |                  N/A |
	+-----------------------------------------+----------------------+----------------------+
	|   2  Quadro P6000                    Off| 00000000:D8:00.0 Off |                  Off |
	| 26%   21C    P8                8W / 250W|      0MiB / 24576MiB |      0%      Default |
	|                                         |                      |                  N/A |
	+-----------------------------------------+----------------------+----------------------+
																							 
	+---------------------------------------------------------------------------------------+
	| Processes:                                                                            |
	|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
	|        ID   ID                                                             Usage      |
	|=======================================================================================|
	|  No running processes found                                                           |
	+---------------------------------------------------------------------------------------+


We can see 3 GPU cards. They are **NVIDIA Quadro P6000** which are our older cards.
It is important to verify this becase that will determine which capabilites are available on 
the cards and the corresponding build of LAMMPS that we can use

Our next step is to load singularity as the LAMMPS we will use is an image optimized for GPUs.
Execute this command::

	tcogq003:~/scratch/LAMMPS$ module load singularity 

Now we are ready to enter into the filesystem of the container.
The container we will use is ``NGC_LAMMPS_patch_3Nov2022.sif`` located in `/shared/containers` 

Get a shell inside the cntainer with the command::

    tcogq003:~/scratch/LAMMPS$ singularity shell --nv /shared/containers/NGC_LAMMPS_patch_3Nov2022.sif 

Remember to use the argument `--nv` to access the GPUs inside the container. 
It is a good practice to double check that such is the case::

	Singularity> nvidia-smi 
	Wed May  3 18:48:19 2023       
	+---------------------------------------------------------------------------------------+
	| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
	|-----------------------------------------+----------------------+----------------------+
	| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
	| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
	|                                         |                      |               MIG M. |
	|=========================================+======================+======================|
	|   0  Quadro P6000                    Off| 00000000:37:00.0 Off |                  Off |
	| 26%   18C    P8                8W / 250W|      0MiB / 24576MiB |      0%      Default |
	|                                         |                      |                  N/A |
	+-----------------------------------------+----------------------+----------------------+
	|   1  Quadro P6000                    Off| 00000000:AF:00.0 Off |                  Off |
	| 26%   19C    P8                8W / 250W|      0MiB / 24576MiB |      0%      Default |
	|                                         |                      |                  N/A |
	+-----------------------------------------+----------------------+----------------------+
	|   2  Quadro P6000                    Off| 00000000:D8:00.0 Off |                  Off |
	| 26%   21C    P8                8W / 250W|      0MiB / 24576MiB |      0%      Default |
	|                                         |                      |                  N/A |
	+-----------------------------------------+----------------------+----------------------+
																							 
	+---------------------------------------------------------------------------------------+
	| Processes:                                                                            |
	|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
	|        ID   ID                                                             Usage      |
	|=======================================================================================|
	|  No running processes found                                                           |
	+---------------------------------------------------------------------------------------+

We can now prepare the variables to execute LAMMPS.

The first variable is export `OMP_NUM_THREADS`. This variable controls the number of OpenMP 
threads that a process can create. We have 3 GPU cards and 24 CPU cores. As we will see in 
a minute a good distribution is to assigned one GPU to one MPI process and use OpenMP to use 
the extra cores via OpenMP threads. 
Most of the processing is taking place on the GPU so it is not critical but a good way for 
trying to maximize the use of resources.
Execute this command::

	export OMP_NUM_THREADS=8

The next command depends on the Hardware Capabilities of the GPU we are using.
The NVIDIA P6000 support software architecture (gencode) SM 6.1

The container offers several builds of LAMMPS::

	Singularity> ls /usr/local/lammps/
	sm60  sm70  sm75  sm80	sm86  sm90

The selection of which build to use must be consistent with the highest gencode supported by the hardware.

This table could Guide you about the selection of the appropiated gencode

.. list-table:: Title
   :widths: 11 11 11 11 11 11 11 11 12
   :header-rows: 1

   * - Fermi
     - Kepler
     - Maxwell
     - Pascal
     - Volta
     - Turing
     - Ampere
     - Ada (Lovelace)
     - Hopper
   * - sm_20
     - sm_30
     - sm_50
     - sm_60
     - sm_70
     - sm_75
     - sm_80
     - sm_89
     - sm_90
   * -
     - sm_35
     - sm_52
     - sm_61
     - sm_72 (Xavier)
     -
     - sm_86
     -
     - sm_90a (Thor)
   * -
     - sm_37
     - sm_53
     - sm_62
     -
     -
     - sm_87 (Orin)		
     -
     -

What matters to us is that the P6000 will not support a gencode other than sm60.
We will setup the variable `LD_LIBRARY_PATH` to search for the lammps variables on the right location
Execute::

	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lammps/sm60/lib

We are now ready to execute LAMMPS:
The following command will ask for 3 MPI processes 3 GPU cards to run the input file above::

	Singularity> mpirun -n 3 /usr/local/lammps/sm60/bin/lmp  -k on g 3 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 4 -var z 8 -in in.lj.txt 

The code runs is just a few seconds.
The output looks like this::

	LAMMPS (3 Nov 2022)
	KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:106)
	  will use up to 3 GPU(s) per node
	Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
	Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
	  using 1 OpenMP thread(s) per MPI task
	Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
	Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
	Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
	Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
	Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
	Created orthogonal box = (0 0 0) to (268.73539 134.3677 268.73539)
	  1 by 1 by 3 MPI processor grid
	Created 8192000 atoms
	  using lattice units in orthogonal box = (0 0 0) to (268.73539 134.3677 268.73539)
	  create_atoms CPU = 0.964 seconds
	Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
	Neighbor list info ...
	  update: every = 20 steps, delay = 0 steps, check = no
	  max neighbors/atom: 2000, page size: 100000
	  master list distance cutoff = 2.8
	  ghost atom cutoff = 2.8
	  binsize = 2.8, bins = 96 48 96
	  1 neighbor lists, perpetual/occasional/extra = 1 0 0
	  (1) pair lj/cut/kk, perpetual
		  attributes: full, newton off, kokkos_device
		  pair build: full/bin/kk/device
		  stencil: full/bin/3d
		  bin: kk/device
	Setting up Verlet run ...
	  Unit style    : lj
	  Current step  : 0
	  Time step     : 0.005
	Per MPI rank memory allocation (min/avg/max) = 417.8 | 419.8 | 423 Mbytes
	   Step          Temp          E_pair         E_mol          TotEng         Press     
			 0   1.44          -6.7733681      0             -4.6133683     -5.0196694    
		   100   0.75927734    -5.761232       0             -4.6223161      0.19102612   
	Loop time of 3.78732 on 3 procs for 100 steps with 8192000 atoms

	Performance: 11406.477 tau/day, 26.404 timesteps/s, 216.301 Matom-step/s
	64.0% CPU use with 3 MPI tasks x 1 OpenMP threads

	MPI task timing breakdown:
	Section |  min time  |  avg time  |  max time  |%varavg| %total
	---------------------------------------------------------------
	Pair    | 0.054429   | 0.054921   | 0.055782   |   0.3 |  1.45
	Neigh   | 0.57971    | 0.60578    | 0.63878    |   3.2 | 16.00
	Comm    | 0.51117    | 0.53519    | 0.55821    |   2.6 | 14.13
	Output  | 0.00041937 | 0.00451    | 0.0078808  |   4.6 |  0.12
	Modify  | 2.5216     | 2.5246     | 2.5287     |   0.2 | 66.66
	Other   |            | 0.06236    |            |       |  1.65

	Nlocal:    2.73067e+06 ave 2.73706e+06 max  2.7274e+06 min
	Histogram: 2 0 0 0 0 0 0 0 0 1
	Nghost:         361536 ave      368174 max      348360 min
	Histogram: 1 0 0 0 0 0 0 0 0 2
	Neighs:              0 ave           0 max           0 min
	Histogram: 3 0 0 0 0 0 0 0 0 0
	FullNghs:  2.05098e+08 ave 2.05575e+08 max 2.04853e+08 min
	Histogram: 2 0 0 0 0 0 0 0 0 1

	Total # of neighbors = 6.1529347e+08
	Ave neighs/atom = 75.109066
	Neighbor list builds = 5
	Dangerous builds not checked
	Total wall time: 0:00:08

Executing LAMMPS from a batch job
---------------------------------

We can use all that we learn above to convert the execution into a submission script
Write a submission script that we will call here `runjob.slurm`::

    #!/bin/bash

    #SBATCH --job-name=LAMMPS
    #SBATCH -p comm_gpu_inter 
    #SBATCH -G 3 
    #SBATCH -t 2:00:00 
    #SBATCH -c 24

    module purge
    module load singularity

    cd $SLURM_SUBMIT_DIR
    pwd
    singularity exec --nv /shared/containers/NGC_LAMMPS_patch_3Nov2022.sif ./run_lammps.sh

We need also another file that will set the variables and run LAMMPS inside the container::
The file is called `run_lammps.sh` and the content is this::

	#!/bin/bash

	export OMP_NUM_THREADS=8
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lammps/sm60/lib

	mpirun -n 3 /usr/local/lammps/sm60/bin/lmp  -k on g 3 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 4 -var z 8 -in in.lj.txt 

We need to make this file executable using the command::

	$> chmod +x run_lammps.sh

Now we can submit the job and wait for the result.
Submit the job with the command::

	$> sbatch runjob.slurm 

We will get two files `log.lammps` that is the traditional log from LAMMPS and a file that looks like `slurm-194660.out`.
The output from LAMMPS that instead of being shown on the screen is stored on a file with the corresponding JobID 

The file looks similar to the output from our interactive execution::

	tcogq003:~/scratch/LAMMPS$ cat slurm-194660.out 
	Removing gcc version 9.3.0 : lang/gcc/9.3.0
	Removing git version 2.29.1 : dev/git/2.29.1
	/gpfs20/scratch/gufranco/LAMMPS
	LAMMPS (3 Nov 2022)
	KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:106)
	  will use up to 3 GPU(s) per node
	Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
	Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
	Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
	Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
	Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
	Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 6.1 , this will likely reduce potential performance.
	  using 1 OpenMP thread(s) per MPI task
	Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
	Created orthogonal box = (0 0 0) to (268.73539 134.3677 268.73539)
	  1 by 1 by 3 MPI processor grid
	Created 8192000 atoms
	  using lattice units in orthogonal box = (0 0 0) to (268.73539 134.3677 268.73539)
	  create_atoms CPU = 0.963 seconds
	Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
	Neighbor list info ...
	  update: every = 20 steps, delay = 0 steps, check = no
	  max neighbors/atom: 2000, page size: 100000
	  master list distance cutoff = 2.8
	  ghost atom cutoff = 2.8
	  binsize = 2.8, bins = 96 48 96
	  1 neighbor lists, perpetual/occasional/extra = 1 0 0
	  (1) pair lj/cut/kk, perpetual
		  attributes: full, newton off, kokkos_device
		  pair build: full/bin/kk/device
		  stencil: full/bin/3d
		  bin: kk/device
	Setting up Verlet run ...
	  Unit style    : lj
	  Current step  : 0
	  Time step     : 0.005
	Per MPI rank memory allocation (min/avg/max) = 417.8 | 419.8 | 423 Mbytes
	   Step          Temp          E_pair         E_mol          TotEng         Press     
			 0   1.44          -6.7733681      0             -4.6133683     -5.0196694    
		   100   0.75927734    -5.761232       0             -4.6223161      0.19102612   
	Loop time of 3.78997 on 3 procs for 100 steps with 8192000 atoms

	Performance: 11398.509 tau/day, 26.385 timesteps/s, 216.149 Matom-step/s
	63.1% CPU use with 3 MPI tasks x 1 OpenMP threads

	MPI task timing breakdown:
	Section |  min time  |  avg time  |  max time  |%varavg| %total
	---------------------------------------------------------------
	Pair    | 0.054214   | 0.054977   | 0.056248   |   0.4 |  1.45
	Neigh   | 0.57713    | 0.60327    | 0.63866    |   3.3 | 15.92
	Comm    | 0.50607    | 0.54467    | 0.57153    |   3.8 | 14.37
	Output  | 0.00047028 | 0.012894   | 0.03579    |  14.3 |  0.34
	Modify  | 2.5112     | 2.5169     | 2.5282     |   0.5 | 66.41
	Other   |            | 0.05729    |            |       |  1.51

	Nlocal:    2.73067e+06 ave 2.73706e+06 max  2.7274e+06 min
	Histogram: 2 0 0 0 0 0 0 0 0 1
	Nghost:         361536 ave      368174 max      348360 min
	Histogram: 1 0 0 0 0 0 0 0 0 2
	Neighs:              0 ave           0 max           0 min
	Histogram: 3 0 0 0 0 0 0 0 0 0
	FullNghs:  2.05098e+08 ave 2.05575e+08 max 2.04853e+08 min
	Histogram: 2 0 0 0 0 0 0 0 0 1

	Total # of neighbors = 6.1529347e+08
	Ave neighs/atom = 75.109066
	Neighbor list builds = 5
	Dangerous builds not checked
	Total wall time: 0:00:08