Bioinformatics data skills workshop - Session 12: Streamlined proteomics analysis using FragPipe

Source: https://github.com/markziemann/bioinformatics_intro_workshop

Background

Proteomics involves the investigation of protein mixtures, normally derived from cell extracts and tissues. The most common approach to this is via mass spectrometry, which can simultaneously measure distinct peptide ions. While it is possible to sample materials in solid state using MALDI-TOF, for most biospecimens, the preferred approach is to apply MS to solutions separated using high performance liquid chromatography (HPLC). The measurements obtained include the retention time from the column, the mass spectrum expressed in mass divided by charge (m/z) and the intensity. Tandem mass spec allows for the peptides to strike the detector and be fragmented into smaller components and undergo a second round of mass spec, allowing for each parent ion to be profiled in more detail, as each peptide has a unique fragmentation pattern. This fragmentation pattern is the m/z of the various fragments and their intensities, which is unique to each peptide. Data dependent acquisition (DDA) is a data aquisition technique that looks at particular parts of the HPLC curve and particular ion species at high detail. In contrast, data independent acquisition (DIA) scans all possible spectra consistently. DIA provides more complex data, and allows for discovery of novel species, which is why we will be focusing on DIA in this workshop.

Due to this complexity, and the huge size of the datasets, proteomics analysis is particularly challenging. We won’t be able to do a deep dive into all aspects of proteomics in this series, however we can introduce you to FragPipe, a pipeline which is designed to take care of most of the use cases of proteomics. In this workshop, we will be using fragpipe to process raw proteomics data, generate the intensity data and load this dataset into R.

We decided early on that we wanted to containerise our workflow to make it portable so that we could use it on our HPC.

How the Docker image was made

Fragpipe is meant to be an end-to-end pipeline, but it has multiple problems. Firstly, it has some critical dependencies, which are covered by a restrictive licence, meaning we are unable to find a publicly available working image, also we are unable to publish a public one either. That said, I can show you the steps we used to make our image and I can make it available to you internally if you work or study at Burnet or Deakin.

You can see that this image requires a great number of dependencies which took my group several weeks of work to iron out.

FROM ubuntu:24.04
USER root
ARG DEBIAN_FRONTEND=noninteractive

# update and upgrade packages
RUN apt-get -y update --fix-missing \
    && apt-get -y upgrade \
    && apt-get -y update \
    && apt-get -y autoremove

# install mono
RUN apt-get -y install ca-certificates gnupg
RUN gpg --homedir /tmp --no-default-keyring --keyring /usr/share/keyrings/mono-official-archive-keyring.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 3FA7E0328081BFF6A14DA29AA6A19B38D3D831EF
RUN echo "deb [signed-by=/usr/share/keyrings/mono-official-archive-keyring.gpg] https://download.mono-project.com/repo/ubuntu stable-focal main" | tee /etc/apt/sources.list.d/mono-official-stable.list
RUN apt-get -y update
RUN apt-get -y install mono-devel

RUN apt install -y software-properties-common
RUN add-apt-repository ppa:deadsnakes/ppa
RUN add-apt-repository ppa:dotnet/backports
RUN apt-get -y update

# install dependencies
RUN apt-get -y install \
    git \
    python3.11 \
    python3-pip \
    tar \
    unzip \
    wget \
    openjdk-17-jdk \
    vim \
    dotnet-runtime-6.0

# install python packages
RUN pip uninstall --break-system-packages easypqp \
    && pip install --break-system-packages git+https://github.com/Nesvilab/easypqp.git@master \
    && pip install --break-system-packages lxml \
    && pip install --break-system-packages plotly \
    && pip install --break-system-packages kaleido \
    && pip install --break-system-packages narwhals \
    && pip install --break-system-packages pyarrow \
    && pip install --break-system-packages pypdf2

# create a directory with 777 permission and set it to the work directory
RUN mkdir /fragpipe_bin
RUN chmod 777 /fragpipe_bin
WORKDIR /fragpipe_bin

# create directories
RUN mkdir tmp
RUN chmod 777 tmp

# download and install fragPipe
RUN wget https://github.com/Nesvilab/FragPipe/releases/download/23.1/FragPipe-23.1-linux.zip -P fragpipe-23.1
RUN unzip fragpipe-23.1/FragPipe-23.1-linux.zip -d fragpipe-23.1
RUN chmod -R 777 /fragpipe_bin

# add dependencies
RUN mkdir -p /fragpipe_bin/dependencies/
COPY MSFragger-4.3 /fragpipe_bin/dependencies/MSFragger-4.3
COPY IonQuant-1.11.11 /fragpipe_bin/dependencies/IonQuant-1.11.11
COPY diaTracer-1.3.3 /fragpipe_bin/dependencies/diaTracer-1.3.3

# set environment variables
ENV JAVA_HOME /usr/lib/jvm/java-17-openjdk-amd64/
RUN export JAVA_HOME

This image is called mziemann/fragpipe_mod2 and I used the command docker image save > fragpipe_mod2.tar to save the image to a tar archive. I then used the command apptainer build fragpipe.sif docker-archive:fragpipe.tar to make an apptainer image that we can use on shared systems like the Burnet HPC.

Other things FragPipe needs to run

Raw data - we’re using files generated using the Orbitrap Astral Mass Spectrometer in DIA mode.
Manifest file - it is a list of raw proteome data.
Workflow file
Peptide library

Raw data

Raw data for this workshop is available from PRIDE database under accession number PXD058340. We also have a local copy which you can use.

You can make a copy for youself. Just be aware that the raw data is huge (56GB). Here is the link. /home/mark.ziemann/projects/proteomics/raw_data

cp -r /home/mark.ziemann/projects/proteomics/raw_data .

Manifest file

Use nano or another editor to create a file called sampletest.manifest-fp.manifest Below, ensure there is one tab between the first two fields and two tabs between the second and third fields. This file shows FragPipe where the input files are.

/home/projects/proteomics/PXD058340/20240911_AST0_NEO4_IAH_collab_SAG_U2OS_Frac_Control_Chrom_01.raw  chr1    DIA
/home/projects/proteomics/PXD058340/20240911_AST0_NEO4_IAH_collab_SAG_U2OS_Frac_Control_Chrom_02.raw  chr2    DIA
/home/projects/proteomics/PXD058340/20240911_AST0_NEO4_IAH_collab_SAG_U2OS_Frac_Control_Chrom_03.raw  chr3    DIA
/home/projects/proteomics/PXD058340/20240911_AST0_NEO4_IAH_collab_SAG_U2OS_Frac_Control_Chrom_04.raw  chr4    DIA
/home/projects/proteomics/PXD058340/20240911_AST0_NEO4_IAH_collab_SAG_U2OS_Frac_Control_Sol_01.raw  sol1    DIA
/home/projects/proteomics/PXD058340/20240911_AST0_NEO4_IAH_collab_SAG_U2OS_Frac_Control_Sol_02.raw  sol2    DIA
/home/projects/proteomics/PXD058340/20240911_AST0_NEO4_IAH_collab_SAG_U2OS_Frac_Control_Sol_03.raw  sol3    DIA
/home/projects/proteomics/PXD058340/20240911_AST0_NEO4_IAH_collab_SAG_U2OS_Frac_Control_Sol_04.raw  sol4    DIA

Workflow file

The workflow file is 375 lines, so I can’t describe it in detail here, but it has parameters for all the programs that FragPipe uses.

On the HPC, use the following command to copy it:

cp /home/mark.ziemann/projects/proteomics/fragpipe_work/sampletest.workflow .

Use nano or less to inspect it. You’ll see that the first parameter is the path to a peptide library. This needs to point to the location of your peptide library inside the container.

Peptide library

Fragpipe requires a reference proteome to search against. Here I am using Human UniProt with some added decoys. Decoys here are synthetic sequences which are the reverse of the human proteome. The filename you should have is UP000005640_9606_withDECOYS.fasta. In this file, the entries starting with >rev_sp are reverse sequences, while those beginning with >sp| are normal sequences.

Familiarise yourself with the Apptainer pipeline

Use this apptainer command to enter a container. See that the current working directory is bound.

apptainer run --writable-tmpfs fragpipe.sif bash

Find the fragpipe executable.

cd /fragpipe_bin/fragpipe-23.1/fragpipe-23.1/bin
ls

check the location of the manifest file, workflow file, raw data files and peptide library file are all correct.

Run the pipeline

Now we can execute the workflow (after provisioning a server with 64GB RAM and 16 threads).

sinteractive -c 8 -p standard --mem 64G --time 12:00:00 --nodelist=bnt-hpcn-04

Before running the container again, set the locale, otherwise we will get an error.

export LANG=C.UTF-8
export LC_ALL=C.UTF-8
apptainer run --writable-tmpfs fragpipe.sif bash

Now in the container, navigate to /fragpipe_bin/fragpipe-23.1/fragpipe-23.1/bin then execute the command:

./fragpipe --headless --workflow ~/projects/proteomics/sampletest.workflow \
  --manifest ~/projects/proteomics/sampletest.manifest.fp-manifest \
  --workdir ~/projects/proteomics/workingdir01 \
  --config-tools-folder /fragpipe_bin/dependencies/ \
  --ram 60 --threads 8

This might take a few hours to complete. Notice that beside the raw files, you will see .mzML and fragtmp files. In the meantime, take a look at the dataset processed previously.

Here’s one we prepared earlier

There are a lot of output files. The most import ones are:

easypqp_files (PDF reports)
MSBooster (charts and reports)
dia-quant-output: The protein quantifications are here

⋅⋅⋅⋅* report.gg_matrix.tsv intensities on gene level (9386 genes detected)

….* report.pg_matrix.tsv intensities on the protein level (9390 proteins detected)

….* report.pr_matrix.tsv Intentities on the peptide level (121010 peptides detected)

Short guide to downstream analysis

Essential steps:

Read report.pg_matrix.tsv into R, select the numerical data and decide on the row name schema.
Calculate missingness by sample and by gene.
Filter proteins and samples based on missingness.
Decide on imputation approach.
PCA, filter outlier samples.
Log transformation and quantile normalisation:

#Log10 transformation
z_log10 <- log10(z_complete + 1)
cat("Log10 transformation completed\n")

#Between-array normalization
z_normalized <- normalizeBetweenArrays(z_log10, method = "quantile",
  targets = NULL, cyclic.method = "fast")

PCA again on normalised data.
Make sample sheet and run differential analysis with limma.

# Design matrix needs a sample_sheet
design <- model.matrix(~ treatment, data = sample_sheet)
# Limma analysis
fit <- lmFit(z_normalized, design)
fit <- eBayes(fit)
# Results summary
summary(decideTests(fit))
# Extract all results
results <- topTable(fit, coef = 2, number = Inf)

Volcano plot, heatmap of top proteins.
Pathway enrichment analysis.
Other analyses: for example looking at post-translational mods, alt-splicing, etc.