NAMD

Nanoscale Molecular Dynamics (NAMD, formerly Not Another Molecular Dynamics Program) is computer software for molecular dynamics simulation, written using the Charm++ parallel programming model. It is noted for its parallel efficiency and is often used to simulate large systems (millions of atoms). It has been developed by the collaboration of the Theoretical and Computational Biophysics Group (TCB) and the Parallel Programming Laboratory (PPL) at the University of Illinois at Urbana–Champaign.

In this tutorial we will demonstrate the very basics of running NAMD on our clusters. NAMD is particularly adapted for several paradigms in parallel computing from multithreading to distributed parallel programming and support for GPUs.

The purpose of this tutorial is not to teach the chemistry and parameters more suited for a given problem. We will just show run NAMD jobs on our clusters. NAMD offers a user guide to understand how to prepare simulations from the chemical point of view.

https://www.ks.uiuc.edu/Research/namd/2.13/ug/

The tutorial will use a couple of examples from the set of benchmarks available for NAMD

https://www.ks.uiuc.edu/Research/namd/utilities/apoa1.tar.gz

https://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz

Downloading the input files

Select a good location for downloading the inputs and executing the simulations. For example use $SCRATCH/NAMD:

mkdir $SCRATCH/NAMD
cd $SCRATCH/NAMD

Download the two examples from the NAMD UIUC webserver:

wget https://www.ks.uiuc.edu/Research/namd/utilities/apoa1.tar.gz
wget https://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz

The examples are compressed, execute this command to uncompress them on the current folder:

tar -zxvf apoa1.tar.gz
tar -zxvf stmv.tar.gz

Preparing the submission script

We will use apoa1 for demonstrate how to create a submission script. ApoA1 has 92K atoms and should run efficiently on a single node.

First, go to the recently created folder apoa1 and start creating a file runjob.pbs with your text editor of choice:

$> cd apoa1
$> vim runjob.slurm

The content of the submission could be like this:

#!/bin/bash

#SBATCH --job-name=NAMD
#SBATCH -N 1
#SBATCH -n 2
#SBATCH -p standby

module purge
module load atomistic/namd/2.13_multicore

cd $SLURM_SUBMIT_DIR
namd2 +p2 +setcpuaffinity apoa1.namd

The first line is called a shebang and indicates that the file below was written in bash, one of the several shell interpreters in Linux, for this simple example, there is nothing particular for bash and other interpreters are possible (csh, ksh, dash) or the system /bin/sh that in the case of Spruce is a symbolic link to bash itself.

The next 4 lines start with #SBATCH. Those are comments for bash but are interpreted by sbatch and they are important for requesting resources and configuring several aspects of non-interactive jobs.

First is the name of the job “NAMD”, totally arbitrary string to identify the job. Second is the line requesting resources. In this case we are requesting 2 cores on a single node. Third, is the queue where the job will execute, “standby” is actually the default queue so the line is not necessary here, but it is important to use it if you plan to run on a different queue. The fourth line attach the standard error with the standard output. NAMD will almost never write on the standard error, so using this command prevents us from having empty files at the end of each simulation.

The next section clean the list of modules and load the corresponding module for NAMD. In this case we are using atomistic/namd/2.13_multicore that is intended only for executions on a single node.

Next line changes the working directory to the place where the submission script and inputs are located and from we will submit the job.

Finally the command for NAMD. namd2 +p2 +setcpuaffinity apoa1.namd. The command tells NAMD to use 2 cores and make those two cores sit on the same socket optimizing the usage of Cache Level 3 in some cases. The input file for NAMD is apoa1.namd and that completes the submission script.

Submitting the job

Submit the job with the following command:

$> sbatch runjob.slurm

You should get a line indicating the JobID, I got this:

4731853

You can use the number to monitor the state of the job.

For example executing:

$ squeue -u $USER

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME

4731853            <username>    Running     2     3:59:35  Fri Nov  1 13:14:12

1 active job             2 of 3360 processors in use by local jobs (0.06%)
                          1 of 175 nodes active      (0.57%)

The job is now running and received 4 hours to complete, that is the walltime on standby queue.

With showq you can monitor the status of your jobs, if they are in queue or running.

This is an small job that completes in minutes. Once the job is complete you will see new files on the folder where you execute the job.

The file NAMD.o<JobID> will contain all the standard output and error produced by NAMD during execution. For the job above, my file is NAMD.o4731853

Read the file with care, pay attention to warnings and errors. For the sake of simplicity we will just show the last lines:

ENERGY:     499     20158.0682     19847.7298      5722.4089       178.6361        -337058.4824     23219.2403         0.0000         0.0000     45439.2653        -222493.1338       165.2956   -267932.3991   -222047.2061       165.2956          -2020.8690     -2376.3392    921491.4634     -2020.8690     -2376.3392

Info: Benchmark time: 2 CPUs 0.384412 s/step 4.44921 days/ns 542 MB memory
TIMING: 500  CPU: 195.068, 0.381542/step  Wall: 195.075, 0.381408/step, 0 hours remaining, 542.000000 MB of memory in use.
ETITLE:      TS           BOND          ANGLE          DIHED          IMPRP               ELECT            VDW       BOUNDARY           MISC        KINETIC               TOTAL           TEMP      POTENTIAL         TOTAL3        TEMPAVG            PRESSURE      GPRESSURE         VOLUME       PRESSAVG      GPRESSAVG

ENERGY:     500     20974.8941     19756.6569      5724.4523       179.8271        -337741.4189     23251.1007         0.0000         0.0000     45359.0789        -222495.4088       165.0039   -267854.4877   -222061.0906       165.0039          -3197.5170     -2425.4142    921491.4634     -3197.5170     -2425.4142

WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 500
WRITING COORDINATES TO OUTPUT FILE AT STEP 500
The last position output (seq=-2) takes 0.002 seconds, 542.000 MB of memory in use
WRITING VELOCITIES TO OUTPUT FILE AT STEP 500
The last velocity output (seq=-2) takes 0.002 seconds, 542.000 MB of memory in use
====================================================

WallClock: 197.820267  CPUTime: 196.267166  Memory: 542.000000 MB
[Partition 0][Node 0] End of program

This information is useful for adjusting your request for resources for additional and similar jobs. In particular this job too around 4 minutes and 540 MB of memory.

Running NAMD on GPUs

Now we will demonstrate the changes to be done when running with GPUs Go one folder up where the tar.gz and apoa1 folder is located:

cd $SCRATCH/NAMD

Make a copy of the apoa1 folder, we will introduce a few changes to the submission script:

cp -r apoa1 apoa1-CUDA
cd apoa1-CUDA
rm -rf NAMD.o*

Edit the submission script like this:

#!/bin/bash

#PBS -N NAMD
#PBS -l nodes=1:ppn=2
#PBS -q comm_gpu
#PBS -j oe

module purge
module load atomistic/namd/2.13_multicore-CUDA

nvidia-smi

cd $PBS_O_WORKDIR

namd2 +p2 +setcpuaffinity apoa1.namd

We are now using comm_gpu as the queue, that will give us machines with GPUs and load the module for NAMD that support GPUs

Submit the job as usual:

qsub runjob.pbs

After completion you will see a file NAMD.o<JobID> with a content like this:

Fri Nov  1 13:36:10 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          Off  | 00000000:08:00.0 Off |                    0 |
| N/A   27C    P0    49W / 225W |      0MiB /  4743MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          Off  | 00000000:24:00.0 Off |                    0 |
| N/A   41C    P0    49W / 225W |      0MiB /  4743MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20m          Off  | 00000000:27:00.0 Off |                    0 |
| N/A   31C    P0    50W / 225W |      0MiB /  4743MiB |     94%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode:  2 threads
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.8.2-0-g26d4bd8-namd-charm-6.8.2-build-2018-Jan-11-30463
Warning> Randomization of virtual memory (ASLR) is turned on in the kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Charm++> Running on 1 unique compute nodes (16-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
Info: Built with CUDA version 9010
Did not find +devices i,j,k,... argument, using all
Pe 1 physical rank 1 binding to CUDA device 1 on salg0001.hpc.wvu.edu: 'Tesla K20m'  Mem: 4743MB  Rev: 3.5  PCI: 0:24:0
Pe 0 physical rank 0 binding to CUDA device 0 on salg0001.hpc.wvu.edu: 'Tesla K20m'  Mem: 4743MB  Rev: 3.5  PCI: 0:8:0
Info: NAMD 2.13 for Linux-x86_64-multicore-CUDA
Info:
...
...
...
ENERGY:     499     20158.0880     19847.6155      5722.4116       178.6402        -337057.2604     23219.0226         0.0000         0.0000     45438.8148        -222492.6677       165.2939   -267931.4825   -222046.7421       165.2939          -2020.7666     -2376.2036    921491.4634     -2020.7666     -2376.2036

Warning: Energy evaluation is expensive, increase outputEnergies to improve performance.
Info: Benchmark time: 2 CPUs 0.0134606 s/step 0.155794 days/ns 685.648 MB memory
TIMING: 500  CPU: 6.86896, 0.0139479/step  Wall: 6.88698, 0.013933/step, 0 hours remaining, 685.648438 MB of memory in use.
ETITLE:      TS           BOND          ANGLE          DIHED          IMPRP               ELECT            VDW       BOUNDARY           MISC        KINETIC               TOTAL           TEMP      POTENTIAL         TOTAL3        TEMPAVG            PRESSURE      GPRESSURE         VOLUME       PRESSAVG      GPRESSAVG

ENERGY:     500     20974.9322     19756.5540      5724.4551       179.8309        -337740.2041     23250.9041         0.0000         0.0000     45358.6460        -222494.8818       165.0023   -267853.5278   -222060.5633       165.0023          -3197.5089     -2425.3239    921491.4634     -3197.5089     -2425.3239

WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 500
WRITING COORDINATES TO OUTPUT FILE AT STEP 500
The last position output (seq=-2) takes 0.002 seconds, 701.625 MB of memory in use
WRITING VELOCITIES TO OUTPUT FILE AT STEP 500
The last velocity output (seq=-2) takes 0.002 seconds, 701.641 MB of memory in use
====================================================

WallClock: 9.114665  CPUTime: 8.700677  Memory: 701.644531 MB
[Partition 0][Node 0] End of program

Notice the reduction in execution time by using the GPUs