Source: https://github.com/markziemann/bioinformatics_intro_workshop
Snakemakeđ¤SLURM.
In previous sessions, we containerised an R based workflow, and used Slurm to make use of the many compute nodes on a HPC system.
For huge and long-running workflows, a workflow manager might be preferred, and they have a few key advantages:
This makes workflow managers extremely useful for large omics projects, for example when processing NGS data from hundreds or thousands of samples.
The example dataset today involves processing NGS data of malaria genome sequences for the purpose of understanding the population genetic structure. The steps involved are fastq trimming, mapping reads to the reference genome and variant calling.
Use conda to install python, snakemake and plugin
Set up cluster information in ~/.config/snakemake/slurm/config.yaml
Get the âmalariaâ working folder which has the reference genome and raw data
Build apptainer-based dependencies trimmomatic, bwa, samtools and bcftools
Write the Snakefile
Index the reference genome
Execute the workflow and check the results
On the cluster, use sinteractive to make a 2 hour interactive session with 4 CPU threads and 16 GB RAM.
sinteractive -c 4 -p long --mem 16G --time 02:00:00
Snakemake isnât installed, but we can use conda to do that.
Letâs see if conda is installed
module avail
or
module spider
Looks like miniconda is there.
module load miniconda
This will bring us our âbaseâ environment.
Next, use conda deactivate to get back out of the base environment. In other words, if you have (base) in your command prompt, you wonât get the right python version in your environment.
conda deactivate
We should create a named environment for a recennt Python release, where this workflow will be executed.
conda create -n py312 python=3.12
List available user environments.
conda env list
Now activate the new environment.
conda activate py312
Check the python verson.
python --version
Now snakemake can be installed here. The slurm plugin will allow snakemake to use the slurm scheduler which will divide the workload over the nodes of the cluster.
pip install snakemake
pip install snakemake-executor-plugin-slurm
Now try snakemake -h to show the help screen.
Create the folder for the configuration file to exist.
mkdir -p ~/.config/snakemake/slurm/
Now we need to create a text file. Here, I do it with nano.
nano ~/.config/snakemake/slurm/config.yaml
Here is the configuration that I have been using:
# ~/.config/snakemake/slurm/config.yaml
cluster:
mkdir -p logs/{rule} &&
sbatch
--partition={resources.partition}
--cpus-per-task={threads}
--mem={resources.mem_mb}
--time={resources.time}
--job-name=smk-{rule}-{wildcards}
--output=logs/{rule}/{rule}-{wildcards}-%j.out
--error=logs/{rule}/{rule}-{wildcards}-%j.err
default-resources:
- slurm_account=mark.ziemann
- partition=standard
- mem_mb=4000
- time="02:00:00"
jobs: 100
use-conda: true
restart-times: 3
max-jobs-per-second: 1
max-status-checks-per-second: 10
local-cores: 1
latency-wait: 60
keep-going: true
rerun-incomplete: true
printshellcmds: true
scheduler: greedy
Exit and save nano with Ctrl+x.
It contains the fastq sequence files, the reference genome and a few handy scripts.
wget https://ziemann-lab.net/public/bioinfo_workshop/malaria.zip
unzip malaria.zip
cd malaria
ls
We will need to build the key dependencies that interact with the data.
I have written a script that does this automatically for us.
cat build_deps.sh
You will see that each tool has a tag name which gives us the version number.
You can see that apptainer can be used for individual small tools in this way, which is different to the use last week, where we had a monolithic image that contained all of the apps. Each approach has some benefits and drawbacks.
Execute the script
bash build_deps.sh
The Snakefile that you see in the downloaded malaria folder details our workflow.
Letâs read through and understand it all.
The workflow doesnât include a bwa index, so letâs do that now with our bwa apptainer.
With a normal bwa install it would be run like this:
bwa index ref/Plasmodium_falciparum.ASM276v2.dna_sm.toplevel.fa.gz
But with the apptainer it is done like this:
apptainer run --writable-tmpfs bwa.sif index ref/Plasmodium_falciparum.ASM276v2.dna_sm.toplevel.fa.gz
To check that the index was generated, check the files present in the ref folder:
ls ref
You should see five new files with different suffices including amb, ann, bwt, pac and sa.
If everything has been set up properly so far, you can try to execute the workflow.
Firstly we will do it without SLURM scheduling, allocating 4 CPU threads on a single machine.
snakemake --jobs 4
Now weâll try running it making use of SLURM cluster, allowing distribution of jobs over multiple nodes.
In order to submit jobs to the cluster, we need to be working from the login node.
We also need to delete the recent output.
snakemake --delete-all-output
Ensure you have the right conda environment loaded:
conda activate py312
Then execute it:
snakemake --executor slurm --jobs 10 --default-resources runtime=2h --latency-wait 60
It is also good practice to view the logs. These are in a hidden folder called â.snakefile/log.â Use ls -la to list all folder contents including hidden files and folders, then cd .snakefile/log to navigate there and use less to inspect the log contents.
The pipeline stops after generating the BAM files, but a typical genomics pipeline would do some downstream work on it, like variant calling. If you want to develop your skills further, I recommend you extend the pipeline further by index the bam files with samtools and call variants with bcftools.