Bioinformatics data skills workshop - Session 7: Containerising workflows on shared computers

Source: https://github.com/markziemann/bioinformatics_intro_workshop

Previously we developed a basic RNA-seq workflow for discovery of differential expression in lung cancer.

In this session we will use container technology to make the workflow reproducible, with the complication that on HPC and other shared systems we do not have administrator “root” permissions.

Why containerisation is important

Like other software, our analytical tools receive regular upgrades to fix bugs and include new features.

This results in research code that could yield different results if the dependancies are not exactly defined.

Although it is common practice to describe major version of our software stack including Python, R and packages in publications, there are still cases where differences in system dependancies impact final results [1].

In order to make the work reproducible, we need to capture the computing environment. This can be done with virtual machines and containers. Virtual machines are the ability to run a full operating system (guest) within another system (host). These require a lot of drive space and incur high compute overhead in terms of memory and CPU power. On the other hand, containers are more minimal and share most components of the operating system from the host. They also are more lightweight in terms of memory and incur virtually no reduction in performance as compared to native execution. Therefore, containers are a good way for researchers to encapsulate their research workflow. Each project gets it own container “image” which we can preserve into the future to ensure that our work will be reproducible. This means that we don’t need to stress too much about changes that will occur if we upgrade software versions running on the host, like R and Python. There is an argument that you may not need to install software like R and Python on your daily driver, as you can rely sufficiently on containers.

Approaching research as a software engineering problem

There are distinct phases of software development.

Software development stages

Development and Testing: here, the developers make changes to software to improve software. They may work individually or in small teams, and do some testing of the new code, working iteratively to meet their goals. New code is passed on to a separate individual or team for testing. Tests are made for individual components as well as the software stack as a whole. Changes may be suggested and will go back to step 1 for improvements.
Staging: here, the code is deployed to an identical server environment as the production, and the app is exposed to realistic workloads. For example, a social media app has a front end interface, back end database, and many concurrent users. Extensive stress testing occurs at this stage with a dedicated team of QA engineers.
Production: if tests are passed, then the software can be deployed.

So how does this relate to researchers?

When we write scripts in bash, R or another language, generating charts and results, we can call this “Development”. We can test this extensively on the same computing environment to make sure the figures look good and we are generating all the desired results to support the publication. You may send analysis reports to your supervisor, who is your “tester”. They will suggest changes and you will iteratively complete the workflow.

Once you think the elements of the workflow are working nicely, you can put it into its own container environment and run from start to end. This can be considered “staging”. It should complete without errors and with predictable dependable results. At this stage, the work could be part of a submitted manuscript. Peer-reviewers may suggest changes, which may send the project back to development.

Final publication can be considered “production”.

Options for containerised workflows

Docker is the most well known containerisation system and works well for research software with one caveat: it requires root (admin) permissions. This isn’t a problem if you have your own workstation, or if you have a permissive shared system, like a server that belongs to your lab, but most HPCs do not allow Docker.

There are ways to circumvent this, for example you could use another computer where you have root access to create the Docker image and then you can transfer that image as a file to the HPC, where you can use alternative softwares like Singularity or Udocker to execute rootless workflows.

But not everyone has another computer, so it would be convenient to be able to create such an image on the HPC from scratch. The Apptainer tool (a derivative of Singularity) is able to do this.

Apptainer logo

Here we will use Apptainer to create an image to run our transcriptome analysis.

Let’s take a look at the Apptainer Documentation about how to make a custom image from scratch.

What about conda you say?

Conda is no longer free for organisations with >200 employees, which includes most universities and research institutes.

You can use conda on your personal device for free, but be warned that if you use it on your organisation’s IT systems, that the organisation might be liable for unpaid licence fees.

Conda also doesn’t resolve the problem of system dependancies impacting final results.

Before we start

“Knit” your R Markdown script to ensure that it completes as expected. Inspect the html and fix any issues now.

Customising your container image

If you need R in your image, my recommendation is to start with an immage that already has R. Thankfully, there are many options:

Rocker - which has R and Rstudio installed.
Bioconductor - which has many of the necessary system dependancies for Bioconductor packages.

Let’s consider the transcriptome workflow we used previously, we need to catalog system packages and R packages we need in the container.

Here is an example definition file suitable for our transcriptome study. You can see that it installs system packages first, then a python based app called multiQC, then it installs R packages from CRAN and then Bioconductor. After all that stuff is installed, it then clones the project repo from GitHub. So that project repo will need to exist and be publicly available.

BootStrap: docker
From: bioconductor/bioconductor_docker:RELEASE_3_19

%post
  # Update apt-get
  apt-get update \
    && apt-get upgrade -y \
    && apt-get install -y nano git libncurses-dev xorg openbox \
    && pip install multiqc \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

  # Install CRAN packages
  Rscript -e 'install.packages(c("RhpcBLASctl","zoo","reshape2","gplots","MASS","scales","eulerr","kableExtra","vioplot","pkgload","beeswarm"))'

  # Install bioconductor packages
  Rscript -e 'BiocManager::install(c("DESeq2","fgsea","limma","topconfects","mitch"))'

  # get a clone of the codes
  git clone https://github.com/markziemann/myproject.git

  # Set the container working directory
  DIRPATH=/myproject
  mkdir -p $DIRPATH
  cd $DIRPATH

%environment
  export DIRPATH=/myproject

%runscript
  cd $DIRPATH
  exec /bin/bash "$@"

%startscript
  cd $DIRPATH

%labels
   Author Mark Ziemann
   Date 22 Aug 2024

If you don’t want to use GitHub resource like this, you can replace that line with a new %files section like this:

%files
myproject /

It will copy the myproject folder to the root directory of the image.

Save the definition file as Apptainer.def.

Build with Apptainer

Using the new definition file, build the image.

apptainer build Apptainer.sif Apptainer.def

It will create a singularity image file (sif). While the def file is very small, like 4 kB, the image could be up to 2 GB in size.

Launch the Apptainer

And once the image is built, we can start running the workflow. But before we do that we need to provision ourselves an interactive session on the hpc.

The following command will give you an interactive session with 4 cores, 16 GB RAM for 24 hrs.

srun -c 4 -p long --mem 16G --time 1-00:00:00 -I60 -N 1 -n 1 --pty bash -i

Now create a tmux session so that your session is not cancelled when disconnecting SSH. If you have not used tmux before here is a blog post showing the key commands.

tmux new -s myworkflow

Now you can launch the image and start interacting with it using bash.

apptainer run --writable Apptainer.sif

If you get an error, try replacing --writable with --writable-tmpfs. You will hopefully find yourself with a bash prompt in the Apptainer. Use ls to see the contents of the folder, you will hopefully see the R Markdown file.

To execute it, open an R command line and use the following command, substituting the name of the R Markdown file.

rmarkdown::render("myworkflow.Rmd")

From here, the results can be copied to the host filesystem like this:

file.copy("myworkflow.html","~/projects/myproject/")

Note that Apptainer automatically mounts the user’s home directory in the container, which makes transfers relatively easy.

Copy this file to your personal computer using scp and inspect it with a web browser.

If it worked correctly, you can exit R, the Apptainer, and interactive session. Alternatively, use squeue to find the JOBID of the interactive session and then scancel to terminate it.

Final thoughts

Congrats for getting this far. Now you have a highly reproducible workflow. So long as you have the sif file, input data and accessible code repository, this workflow will perform exactly the same on any computer that can run Apptainer.

Keep the html report as evidence that the work was conducted. Upload it to LabArchives.

Loading big datasets into the container is another challenge which has a few different solutions:

Download from central repository
Obtain from home directory
Create the Apptainer image with the full data set included

If your work is going to be published, try the following checklist.

Firstly put the source data into a suitable repository. For example RNA-seq data would be suited to NCBI GEO submission.

Secondly put the following onto Zenodo or FigShare for long-term archiving:

Code repository snapshot including instructions to future users on how to reproduce the whole thing.
HTML reports as evidence that the work was done.
Apptainer sif file to provide a stable computing environment.
A README file that documents these objects and how they relate to the published paper.

Session information

For reproducibility.

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.36     R6_2.5.1          fastmap_1.2.0     xfun_0.45        
##  [5] cachem_1.1.0      knitr_1.48        htmltools_0.5.8.1 rmarkdown_2.27   
##  [9] lifecycle_1.0.4   cli_3.6.3         sass_0.4.9        jquerylib_0.1.4  
## [13] compiler_4.4.1    tools_4.4.1       evaluate_0.24.0   bslib_0.7.0      
## [17] yaml_2.3.9        rlang_1.1.4       jsonlite_1.8.8

1. Vallet N, Michonneau D, Tournier S. Toward practical transparent verifiable and long-term reproducible research using guix. Sci Data. 2022;9.