Bioinformatics: Using Bowtie2¶

This tutorial will show how to use Job Arrays to manage the alignment of a big pair of sequencing data.

For the purpose of the tutorial, consider that you receive a pair of files like this:

Estrogen_24_1.clean.fq

Estrogen_24_2.clean.fq

And you have a reference like:

Oncorhynchus_mykiss_chr.fa

This tutorial will show you how to do the alignments concurrently by splitting the fq files and use BSseeker and bowtie2 for the alignment. When the job is done we use samtools to merge the results in a single BAM file.

1. Transfer the files¶

The files refered above are really big, the first step is to send those files to the cluster. The files are too big for store them on your home account so you have to use the scratch space. There are two ways of sending the files to the cluster. Using command line tools like rsync or using the Globus connector to sent those files from your computer to the cluster.

If you are using Linux or Mac on your side, the command to send those files will be,

rsync -av Estrogen_24_1.clean.fq Estrogen_24_2.clean.fq Oncorhynchus_mykiss_chr.fa [username]@spruce.hpc.wvu.edu:/scratch/[username]

Globus is another alternative, follow the instructions: [Transferring_Files]

2. Preparing folders¶

It is always good practice to keep some structure on the files that you will use on your research. We suggest creating a few directories, for the references, for the big sequence fq files, output and temporary files.

cd /scratch/[username]

mkdir REFS OUTPUT TMP WGBS

Now we need to move the big fq files to WGBS and the reference to REFS

mv Estrogen_24_1.clean.fq Estrogen_24_2.clean.fq WGBS

mv Oncorhynchus_mykiss_chr.fa REFS

3. Splitting the files¶

Our first couple of jobs consists on splitting the two big fq files. We can get an idea about how big they are by counting the number of lines

$ wc -l WGBS/Estrogen_24*.fq
``  670407612 WGBS/Estrogen_24_1.clean.fq``
``  670407612 WGBS/Estrogen_24_2.clean.fq``
`` 1340815224 total``

They have the same number of lines, more than 670 million lines. It is important to realize that you cannot simple cut those files arbitrarily. Each 4 lines makes a block of information that cannot be splitted

@ST-E00144:275:HW5VJCCXX:7:1101:2564:1327 1:N:0:NTGAGCCA
NGAGAATATTTTTGGTTATTTTTTTTAGTTTTTGATTTGGAAGATTTATTAGATAGAGAATATTGTTGGTTATTTTTTTTAGTTTTTGATTTGGAGGATGTAGTAAATAGATAATATTTTTGTTGTGTATTTTTTTTAGTTTTTGTGTTG
+
#AAAAA7-7A

!/bin/bash
PBS -N SPLIT
PBS -l nodes=1:ppn=1
PBS -l walltime=4:00:00
PBS -m ae
PBS -q standby

cd $PBS_O_WORKDIR

mkdir -p WGBS/split echo $file split -d -a 3 -l 2400000 WGBS/${file} WGBS/split/${file}_

The submission script will create a job called “SPLIT” using one node (nodes=1) and one process per node (ppn=1). We set a walltime in 4 hours, the limit for the standby queue. We suppose to submit this job from the scratch directory. So the first step is to move into that directory using cd $PBS_O_WORKDIR, we create a folder for the split files inside WGBS. The split command:

split -d -a 3 -l 2400000 WGBS/${file} WGBS/split/${file}_

The split command will create split files with 3 digits (-a 3), starting from 000. It will split every 2.4 million lines, the file indicated under the file variable that will will use next. All those split files will be stored at WGBS/split using the same name as the original but with a suffix being underscored “_” and the 3 digits mentioned before.

We submit the jobs like this:

qsub runjob_split.pbs -v file=Estrogen_24_1.clean.fq

qsub runjob_split.pbs -v file=Estrogen_24_2.clean.fq

Those two jobs takes around 30 minutes in one of our compute nodes. Once that splitting is complete you can see many files populating the WGBS/split folder. Those are the files we will use in our next step

4. Using BSseeker2¶

The webpage of the project is here: 1 Copy the link and download the software, at the time of writing this (April 2017), the latest version was 2.1.0 and the download URL was:

wget ``\ ```https://github.com/BSSeeker/BSseeker2/archive/BSseeker2-v2.1.8.tar.gz <https://github.com/BSSeeker/BSseeker2/archive/BSseeker2-v2.1.8.tar.gz>`__

Once you have the code, uncompress it with:

tar -zxvf BSseeker2-v2.1.8.tar.gz

The submission script (runjob.pbs) will look like:

#!/bin/sh

#PBS -N ALIGN_${PBS_ARRAYID}
#PBS -l nodes=1:ppn=4
#PBS -l walltime=04:00:00
#PBS -l feature='!smp'
#PBS -m ae
#PBS -q standby
#PBS -t 0-279

module load compilers/python/2.7.13
module load genomics/bowtie2

if [ ${PBS_ARRAYID} -lt 10 ]
then
    JAID=00${PBS_ARRAYID}
elif [ ${PBS_ARRAYID} -lt 100 ]
then
    JAID=0${PBS_ARRAYID}
else
    JAID=${PBS_ARRAYID}
fi

SCRATCH=/scratch/[username]
REFS=${SCRATCH}/REFS
WGBS=${SCRATCH}/WGBS
TEMP=${SCRATCH}/TMP
OUTP=${SCRATCH}/OUTPUT
FQ1=${WGBS}/split/Estrogen_24_1.clean.fq_$JAID
FQ2=${WGBS}/split/Estrogen_24_2.clean.fq_$JAID
FA=${REFS}/Oncorhynchus_mykiss_chr.fa

echo 'Job start: '`date`
cd $PBS_O_WORKDIR
cd BSseeker2_v2.1.0

mkdir -p ${TEMP}

python bs_seeker2-align.py -1 $FQ1 -2 $FQ2 -m 3 --aligner=bowtie2 -o ${OUTP}/Estrogen_24.${JAID}.align -f sam -g $FA --temp_dir=${TEMP}

echo 'Job complete: '`date`

There a number of details to mention here. We cut the fq files in 2.4 million lines in order to be sure we can process them in our 4 hour walltime. The Job Array start with 0 and ends at 279. This piece of code ensures 3 digit numbers filling with leading zeros as needed:

if [ ${PBS_ARRAYID} -lt 10 ]
then
    JAID=00${PBS_ARRAYID}
elif [ ${PBS_ARRAYID} -lt 100 ]
then
    JAID=0${PBS_ARRAYID}
else
    JAID=${PBS_ARRAYID}
fi

The rest of the script defines locations for references, the split files and output. Finally it launches BSseeker2. Remember to replace [username] with your username.

SCRATCH=/scratch/[username]
REFS=${SCRATCH}/REFS
WGBS=${SCRATCH}/WGBS
TEMP=${SCRATCH}/TMP
OUTP=${SCRATCH}/OUTPUT
FQ1=${WGBS}/split/Estrogen_24_1.clean.fq_$JAID
FQ2=${WGBS}/split/Estrogen_24_2.clean.fq_$JAID
FA=${REFS}/Oncorhynchus_mykiss_chr.fa

echo 'Job start: '`date`
cd $PBS_O_WORKDIR
cd BSseeker2_v2.1.0

mkdir -p ${TEMP}

python bs_seeker2-align.py -1 $FQ1 -2 $FQ2 -m 3 --aligner=bowtie2 -o ${OUTP}/Estrogen_24.${JAID}.align -f sam -g $FA --temp_dir=${TEMP}

echo 'Job complete: '`date`

The command bowtie creates several threads, to ensure enough cores we set ppn=4.

4. Checking completion for all jobs¶

Sometimes jobs failed for one reason or another. It could be hard to check the output one by one to see if some job needs to be executed again. This small python script could help you with that task (BSresubmit.py)

#!/usr/bin/env python

import argparse
import os

if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Process some integers.')
    parser.add_argument('-output_directory', '-d', metavar='PATH', type=str, help='Directory for search output files', default='.')
    parser.add_argument('-prefix', '-p', metavar='PREFIX', type=str, help='Filenames must be prefix+NUM+suffix', required=True)
    parser.add_argument('-suffix', '-s', metavar='SUFFIX', type=str, help='Filenames must be prefix+NUM+suffix', required=True)
    parser.add_argument('-min_split_number', '-m', metavar='N', type=int, help='min number of split', default=0)
    parser.add_argument('-max_split_number', '-n', metavar='N', type=int, help='max number of split', required=True)

    args = parser.parse_args()

    failures=[]
    for i in range(args.min_split_number, args.max_split_number+1):
        filename="%s%s%s%03d%s" % (args.output_directory,os.sep, args.prefix, i, args.suffix)
        if not os.path.isfile(filename):
            failures.append(i)
            continue

        rf=open(filename)
        data=rf.readlines()
        if not 'END' in data[-1]:
            failures.append(i)

    if len(failures)==0:
        print("All the jobs are complete in the range")
    else:
        print("Some jobs need to be resubmitted:")
        print(failures)

In order to check if all the jobs are complete execute:

./BSresubmit.py -d OUTPUT -p Estrogen_24. -s .align.bs_seeker2_log -n 279

Once you have all the data you can move forward and merge the files using samtools: