Source: https://github.com/markziemann/bioinformatics_intro_workshop
Image credit: Futurama.
Previously we ran a fairly lightweight R based workflow on HPC, but bioinformatics can also involve computationally intensive jobs such as processing terabytes of raw data from hundreds of samples, or characterising interactions between millions of SNPs.
In contrast to what we’ve done previously, this intensive work can be divided into many smaller and more manageable steps. These steps can be distributed across many machines in a HPC/supercomputer to expedite the work.
HPC clusters use schedulers to decide what jobs are able to run and they are designed to distribute resources in a fair way while optimising utilisation. Typically, jobs are allocated until resources are at capacity and then jobs are placed in a queue. There are a few different schedulers including Sun Grid Engine, PBS and Slurm. Today we’ll be using Slurm.
Jargon that we’ll be using:
Head node: when you log into a HPC it is typically into a head node. Don’t run any heavy analysis here. It is typically used for lightweight stuff like viewing, organising and transfering files/results.
Cluster: A bunch of computers all connected to a central scheduler.
Batch: a set of computational tasks that will be distributed.
Job: Non-interactive computational task - eg: a script. Jobs are typically submitted in batches.
Compute node: A computer in the cluster whose purpose is to run jobs. Clusters have many compute nodes and different types such as high RAM, GPU or specific CPU architecture.
Cores/threads: A modern computer typically has multiple CPU cores, processors that are able to do independent tasks. Laptops may have 2-8 cores and high end servers may have 64 or more. Most CPUs also have a feature called hyperthreading, which allows each CPU to process two parallel tasks (threads).
Parallel processing: As a modern computer has many CPU threads, jobs can be distributed across threads of a machine. But there is some overhead cost to this, so jobs of 8-16 threads seem to perform well and beyond 32, there’s little benefit to adding further parallel threads.
Pipeline: A set of scripts that involves set-up of the jobs, execution and collection of the results.
I/O: Input/output operations, such as reading data from disk or SSD. Given the data-intensive nature of modern bioinformatics, I/O performance is just as important as CPU performance.
Memory: The amount of RAM allocated for the task. It is important that you know how much RAM will be used for your tasks, as this is a key scarce resource on many HPC systems.
Storage: Data storage space. This may be have certain quotas and limitations on some HPCs.
CPU Cache: This is the amount of short term memory that is on the CPU. A larger cache allows the CPU to do more work per cycle, and so having a higher cache gives better performance.
Before we jump into today’s task - take a look at Burnet’s HPC documentation: http://burnethpcdocs.domain.internal.burnet.edu.au/getting-started
In the example workflow, we’ll demonstrate how to distribute BLAST searches over several worker nodes. But we will build up our workflow from the basics using these steps:
Make an interactive session to confirm a small analysis script works properly.
Use the scheduler to process a few small datasets.
Scale-up the job to complete the task efficiently.
Collect the results for downstream analysis.
Start an interactive session with 4 threads and 16 GB RAM for 3 hours.
We need some input files for BLAST to work. Inside a new working folder, run the following bash chunks.
First for the genome reference.
wget -N --no-check-certificate "https://ftp.ensemblgenomes.org/pub/bacteria/release-42/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa.gz"
gunzip -kf Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa.gz
makeblastdb -dbtype nucl -in Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa -input_type fasta -parse_seqids -title ecoli
Now for the set of annotated genes.
wget -N --no-check-certificate "https://ftp.ensemblgenomes.org/pub/bacteria/release-42/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/cds/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa.gz"
gunzip -kf Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa.gz
perl -pe '/^>/ ? print "\n" : chomp' \
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa \
> Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.unwrap.fa
Your folder should look like this:
$ ls
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa.gz
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.unwrap.fa
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa.gz
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa.ndb
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa.nhr
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa.nin
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa.nog
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa.nos
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa.not
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa.nsq
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa.ntf
Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa.nto
Great, now we can subset a few sequences.
sed 1d Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.unwrap.fa \
| head > sample.fa
Then perform a BLASTN search.
blastn -evalue 0.001 \
-db Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa \
-query sample.fa \
-outfmt 6 > blastres1.tsv
Check the output with cat.
Now count the number of CDS sequences with grep.
Then divide the fasta file into 10-20 smaller chunks. Here is an example which first makes the file 1 sequence per line, then creates files with 400 lines each.
paste -s -d '\t' Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.unwrap.fa \
| sed 's/>/\n>/g' | split -l400 - seqchunk.fa.
They’re still 1 line per record, so it is best to return it to the 2 lines per record format for BLAST.
for SEQ in seqchunk.fa* ; do
SEQ2=$(echo $SEQ | sed 's/seqchunk.fa/seqchunk2.fa/')
sed 's/\t/\n/g' $SEQ | grep -v ^$ > $SEQ2
done
Now run a blast search with one of the fasta files.
blastn -evalue 0.001 \
-num_threads 4 \
-db Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa \
-query seqchunk2.fa.aa \
-outfmt 6 > blastres.fa.aa
Inspect the results.
Use nano to save this text in a file and call it my_job1.sbatch
#!/bin/bash
#SBATCH --job-name=my_job # Name of the job
#SBATCH --partition=standard # Name of the partition to use
#SBATCH --ntasks=1 # One task per array job
#SBATCH --cpus-per-task=4 # Four CPU threads per job
#SBATCH --mem=16G # Memory per job
#SBATCH --time=0-01:00:00 # Max allowed time is 1 hr
#SBATCH --output=$PWD/log/MyJob-%j.out # Path for standard output
#SBATCH --error=$PWD/log/MyJob-%j.err # Path for standard error
# ===============
# Explanation
# ===============
# This script processes one input file for BLAST search
# ===============
pwd
rm blastres.fa.aa
time blastn -evalue 0.001 \
-num_threads 4 \
-db Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa \
-query seqchunk2.fa.aa \
-outfmt 6 > blastres.fa.aa
if [ -r blastres.fa.aa ] ; then
echo "Job completed successfully!"
else
echo "Error: echo Job failed."
fi
Again, inspect the results. Also inspect the logs which will be in the log folder.
Use nano to make the sbatch script below and save it as my_array1.sbatch.
#!/bin/bash
#SBATCH --job-name=my_job # Name of the job
#SBATCH --partition=standard # Name of the partition to use
#SBATCH --ntasks=1 # One task per array job
#SBATCH --cpus-per-task=4 # Four CPU threads per job
#SBATCH --mem=16G # Memory per job
#SBATCH --time=0-01:00:00 # Max allowed time is 1 hr
#SBATCH --output=log/MyJob-%j.out # Path for standard output
#SBATCH --error=log/MyJob-%j.err # Path for standard error
#SBATCH --array=1-11 # Create 5 array tasks (indices 1-5)
# ===============
# Explanation
# ===============
# This script processes 11 input files for BLAST search
# ===============
# Get list of input files
INPUT_FILES=$(find . | grep seqchunk2 )
mkdir results
INPUT_FILE=$(echo $INPUT_FILES | cut -d ' ' -f$SLURM_ARRAY_TASK_ID )
OUTPUT_FILE="results/result_task_${SLURM_ARRAY_TASK_ID}.tsv"
echo "Job Array Number: ${SLURM_ARRAY_TASK_ID}"
echo "Processing file: ${INPUT_FILE}"
# Do the work
time blastn -evalue 0.001 \
-num_threads 4 \
-db Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna_sm.toplevel.fa \
-query ${INPUT_FILE} \
-outfmt 6 > ${OUTPUT_FILE}
echo "Finished successfully!"
For our work to be reproducible, it needs to be automated. Scripting each step is self-documenting and helps if you need to run it again in the future.
Your task is to make a script main.sh that does:
Preparation: The sequence preparation, including 11 chunks.
Execution: kicks off the sbatch array AND waits for 11 results files to be completed
Downstream: reads all the results files and counts how many sequences have 2 or more hits.
Finally, check the logs to ensure all jobs completed properly.
Another use of SLURM job arrays is to test a range of parameters for testing.
For simulations where you need to repeat a routine many times, this can be used to specify a set seed for pseudorandom sampling.
Here is an example of a parameter called ALPHA being substituted.
# ----------------------------------------------------------------------
# Example 2: Use different parameters for each task
# ----------------------------------------------------------------------
# You can also use the array index to set different parameters
ALPHA_VALUES=(0 0.1 0.5 1.0 2.0 5.0)
ALPHA=${ALPHA_VALUES[$SLURM_ARRAY_TASK_ID]}
echo "Using parameter alpha = ${ALPHA}"