Bioinformatics data skills workshop - Session 9: Using SnakeMake workflow manager on HPC

Source: https://github.com/markziemann/bioinformatics_intro_workshop

Snakemake🤝SLURM.

Background

In previous sessions, we containerised an R based workflow, and used Slurm to make use of the many compute nodes on a HPC system.

For huge and long-running workflows, a workflow manager might be preferred, and they have a few key advantages:

They support containers.
They support heterogenous computing; including a mix of HPC, cloud and others.
They provide detailed reporting of job errors.
Re-running the workflow only repeats the necessary steps. Ie: successfully completed steps before the error are not re-run.
If changes are made to the code or input data, only the affected steps are re-run.

This makes workflow managers extremely useful for large omics projects, for example when processing NGS data from hundreds or thousands of samples.

The example dataset today involves processing NGS data of malaria genome sequences for the purpose of understanding the population genetic structure. The steps involved are fastq trimming, mapping reads to the reference genome and variant calling.

Overview of today’s activity

Use conda to install python, snakemake and plugin
Set up cluster information in ~/.config/snakemake/slurm/config.yaml
Get the “malaria” working folder which has the reference genome and raw data
Build apptainer-based dependencies trimmomatic, bwa, samtools and bcftools
Write the Snakefile
Index the reference genome
Execute the workflow and check the results

Part 1 - Setting up Snakemake

On the cluster, use sinteractive to make a 2 hour interactive session with 4 CPU threads and 16 GB RAM.

sinteractive -c 4 -p long --mem 16G --time 02:00:00

Snakemake isn’t installed, but we can use conda to do that.

Let’s see if conda is installed

module avail

module spider

Looks like miniconda is there.

module load miniconda

This will bring us our “base” environment.

Next, use conda deactivate to get back out of the base environment. In other words, if you have (base) in your command prompt, you won’t get the right python version in your environment.

conda deactivate

We should create a named environment for a recennt Python release, where this workflow will be executed.

conda create -n py312 python=3.12

List available user environments.

conda env list

Now activate the new environment.

conda activate py312

Check the python verson.

python --version

Now snakemake can be installed here. The slurm plugin will allow snakemake to use the slurm scheduler which will divide the workload over the nodes of the cluster.

pip install snakemake
pip install snakemake-executor-plugin-slurm

Now try snakemake -h to show the help screen.

Part 2 - Setting up cluster information

Create the folder for the configuration file to exist.

mkdir -p ~/.config/snakemake/slurm/

Now we need to create a text file. Here, I do it with nano.

nano ~/.config/snakemake/slurm/config.yaml

Here is the configuration that I have been using:

# ~/.config/snakemake/slurm/config.yaml

cluster:
  mkdir -p logs/{rule} &&
  sbatch
    --partition={resources.partition}
    --cpus-per-task={threads}
    --mem={resources.mem_mb}
    --time={resources.time}
    --job-name=smk-{rule}-{wildcards}
    --output=logs/{rule}/{rule}-{wildcards}-%j.out
    --error=logs/{rule}/{rule}-{wildcards}-%j.err
default-resources:
  - slurm_account=mark.ziemann
  - partition=standard
  - mem_mb=4000
  - time="02:00:00"
jobs: 100
use-conda: true
restart-times: 3
max-jobs-per-second: 1
max-status-checks-per-second: 10
local-cores: 1
latency-wait: 60
keep-going: true
rerun-incomplete: true
printshellcmds: true
scheduler: greedy

Exit and save nano with Ctrl+x.

Part 3 - Get the “malaria” working folder

It contains the fastq sequence files, the reference genome and a few handy scripts.

wget https://ziemann-lab.net/public/bioinfo_workshop/malaria.zip
unzip malaria.zip
cd malaria
ls

Part 4 - Build apptainer tools

We will need to build the key dependencies that interact with the data.

I have written a script that does this automatically for us.

cat build_deps.sh

You will see that each tool has a tag name which gives us the version number.

You can see that apptainer can be used for individual small tools in this way, which is different to the use last week, where we had a monolithic image that contained all of the apps. Each approach has some benefits and drawbacks.

Execute the script

bash build_deps.sh

Part 5 - Prepare the Snakefile

The Snakefile that you see in the downloaded malaria folder details our workflow.

Let’s read through and understand it all.

Part 6 - Index the genome

The workflow doesn’t include a bwa index, so let’s do that now with our bwa apptainer.

With a normal bwa install it would be run like this:

bwa index ref/Plasmodium_falciparum.ASM276v2.dna_sm.toplevel.fa.gz

But with the apptainer it is done like this:

apptainer run --writable-tmpfs bwa.sif index ref/Plasmodium_falciparum.ASM276v2.dna_sm.toplevel.fa.gz

To check that the index was generated, check the files present in the ref folder:

ls ref

You should see five new files with different suffices including amb, ann, bwt, pac and sa.

Part 7 - Run the workflow

If everything has been set up properly so far, you can try to execute the workflow.

Firstly we will do it without SLURM scheduling, allocating 4 CPU threads on a single machine.

snakemake --jobs 4

Now we’ll try running it making use of SLURM cluster, allowing distribution of jobs over multiple nodes.

In order to submit jobs to the cluster, we need to be working from the login node.

We also need to delete the recent output.

snakemake --delete-all-output

Ensure you have the right conda environment loaded:

conda activate py312

Then execute it:

snakemake --executor slurm --jobs 10 --default-resources runtime=2h --latency-wait 60

It is also good practice to view the logs. These are in a hidden folder called ‘.snakefile/log.’ Use ls -la to list all folder contents including hidden files and folders, then cd .snakefile/log to navigate there and use less to inspect the log contents.

Part 8 - Extending the work

The pipeline stops after generating the BAM files, but a typical genomics pipeline would do some downstream work on it, like variant calling. If you want to develop your skills further, I recommend you extend the pipeline further by index the bam files with samtools and call variants with bcftools.