Abstract
Background
Inference of biological pathway activity via gene set enrichment
analysis is frequently used in the interpretation of clinical and other
omics data. With the proliferation of new omics profiling approaches
and ever-growing size of data sets generated, there is a lack of tools
available to perform and visualise gene set enrichments in analyses
involving multiple contrasts.
Results
To address this, we developed mitch, an R package for multi-contrast
gene set enrichment analysis. It uses a rank-MANOVA statistical
approach to identify sets of genes that exhibit joint enrichment across
multiple contrasts. Its unique visualisation features enable the
exploration of enrichments in up to 20 contrasts. We demonstrate the
utility of mitch with case studies spanning multi-contrast RNA
expression profiling, integrative multi-omics, tool benchmarking and
single-cell RNA sequencing. Using simulated data we show that mitch has
similar accuracy to state of the art tools for single-contrast
enrichment analysis, and superior accuracy in identifying
multi-contrast enrichments.
Conclusion
mitch is a versatile tool for rapidly and accurately identifying and
visualising gene set enrichments in multi-contrast omics data. Mitch is
available from Bioconductor
([27]https://bioconductor.org/packages/mitch).
Keywords: Bioconductor package, Differential expression, Gene
regulation, Multi-omics, Single-cell profiling, Pathway analysis, Gene
set enrichment analysis, Multivariate statistics
Background
Functional enrichment analysis describes the various ways that
summarised omics data can be used to infer differential expression (DE)
of molecular pathways, or more broadly sets of genes that are
functionally linked [[28]1]. Enrichment analysis is increasingly being
applied to understand patterns of regulation in diseases and may be
useful in better classification of patients into subgroups that could
benefit from more specific treatments [[29]2]. Indeed, it is reported
that measurement of sets of genes rather than individual genes provides
a better ratio of signal to noise and more accurate patient
classification [[30]3]. Commonly, gene sets are curated to have similar
molecular or biological function, or be part of the same biochemical or
signaling pathway; but can also be derived from empirical omics and in
silico studies.
Most commonly used pathway enrichment analysis methods fall into three
categories; over-representation analysis (ORA), functional class
sorting (FCS) and pathway topology (PT) methods [[31]1, [32]4, [33]5].
Over-representation analysis involved the intersection of genes meeting
a prespecified significance and/or fold change threshold with a library
of gene sets. Statistically higher or lower enrichment is determined
with hypergeometric, Fisher exact or other test. Functional class
scoring is different because it uses all detected genes in the
calculation of pathway regulation, as it does not involve a
significance cutoff. There are several valid approaches to this, but
all involve scoring of genes by their differential expression, followed
by a statistical test to detect enrichment at the upper and lower
extremes of the range. PT methods are similar to FCS methods except
they take into consideration additional information about how the genes
within a set relate to one another. For example taking into account
that biological pathways contain both activators and inhibitors, or
that genes in a set are correlated or anticorrelated. PT methods are
limited in some cases by a lack of fine-grained pathway knowledge as
well as differences in pathway mechanisms in cell types under study
[[34]4]. Although PT methods are extremely useful, they are not a focus
of this study.
One of the first FCS tools to be described was Gene Set Enrichment
Analysis (GSEA). In “preranked” mode, this method summarises DE
findings (eg: fold change and/or p-value) into a single value and then
detects enrichment of gene set members at the extremes of this profile.
Statistical significance is established by permuting the profile,
quantifying how frequent the detected enrichment is in a randomised
profile [[35]6]. Pathway analysis research has since been focused on
improving the usability, accuracy and efficiency of tools that analyse
single omics data sets. For example, FCS tools (geneSetTest, Roast,
CAMERA) have been added to the Limma package, providing a GSEA-like
functionality entirely in the R/Bioconductor environment [[36]7].
CAMERA is able to estimate and correct for inter-gene correlation that
biases enrichment tests [[37]8]. SetRank adjusts for false positives
that arise from overlaps in gene sets [[38]9]. Algorithmic advances
included in the FGSEA package have allowed a ~ 50 fold increase in
permutation calculation speed in pre-ranked enrichment detection in
contrast to GSEA [[39]10] which will be important as gene set databases
continue to grow. It has also been shown that ensemble methods of
enrichment analysis yield higher accuracy than any individual method
alone [[40]11]. Furthermore, there is an emerging interest in tools
that calculate pathway expression in individual samples, allowing for
granular analysis of variability between samples in large groups (eg:
PLAGE, GSVA and ssGSEA) [[41]12–[42]14].
Databases such as Gene Expression Omnibus (GEO) are expanding rapidly
[[43]15], enabling comparison of many omics studies at the same time
given the appropriate analysis tools. As omics profiling techniques
continue to diversify and become more widely used, multi-omics studies
are becoming more common. For example, the share of multi-omics data
sets, called “Superseries” in GEO has increased, from only 4.6% of
series in 2005–2009 to 8.1% in Jan 2016 to Aug 2019. In addition,
single-cell profiling has grown explosively in the past 5 years thanks
to developments in droplet and nanowell technology, facilitating the
deconvolution of cell identities in development and in response to
stimuli and disease. These trends highlight a need for tools capable of
analysing high-dimensional omics data involving many contrasts,
profiling technologies and cell types.
The first approach described for multi-contrast FCS analysis is based
upon Hotelling’s T^2 statistic with two contrasts, and more generally
Multiple Analysis of Variance (MANOVA) when considering more than two
contrasts. With simulated data, MANOVA test compares favourably with
respect to sensitivity and specificity in contrast to other
multivariate tests available at the time [[44]16]. An alternative
approach, Multi Dimensional Gene Set Analysis (MD-GSA) was later
devised, and proposes the use of logistic regression for bidimensional
FCS analysis [[45]17]. Although originally intended to analyse multiple
contrasts on the same experimental platform, the MANOVA test is equally
applicable to pathway level integrative analysis of multi-omics data.
For example, joint FCS analysis of ranked proteome and transcriptome
data [[46]18]. A MANOVA based FCS test was implemented in the MAVTgsa R
package, however it is slow due to the use of a computationally
intensive permutation procedure, and lacks visualisation features key
to interpreting high-dimensional data [[47]19].
To overcome these limitations, we developed mitch, an R/Bioconductor
package that facilitates multi-contrast FCS analysis using a
rank-MANOVA approach. We demonstrate the utility of mitch in a variety
of use cases including enrichment analysis of multi-omics and single
cell transcriptomics. Using simulated data, we benchmark accuracy and
execution time of mitch.
Implementation
Overview
We provide a schematic diagram and example code to demonstrate a
typical mitch workflow from DE tables through to enrichment results
(Fig. [48]1). mitch consists of five functions; mitch_import,
gmt_import, mitch_calc, mitch_plots and mitch_report, which are
described in the sections below.
Fig. 1.
[49]Fig. 1
[50]Open in a new tab
Overview of the mitch workflow. The mitch package consists of five
functions (left). The mitch_import function recognises the format of
commonly used differential omics tools such as DESeq2, edgeR and limma
and performs ranking of each contrast, to create a multi contrast rank
table. If user’s would like to use a different ranking scheme,
mitch_import can be bypassed in favour of a custom ranking approach.
The gmt_import function reads gene matrix transposed files (GMT). The
mitch_calc function determines the degree of enrichment of each gene
set in the multi contrast table, yielding a mitch result object. The
mitch_report function produces a single HTML format report of results
containing several tables and charts. The mitch_plots function
generates high resolution PDF containing charts derived from the mitch
results. Example minimal mitch analysis code (right) to determine the
enrichment of gene sets obtained from a GMT file in two dimensions,
represented as two edgeR top tables
DE scoring and import
To facilitate rank based FCS analysis, DE results for each gene need to
be summarised into a single score. mitch_import has the ability to
import data from a range of upstream DE packages used in
transcriptomics, epigenetics and proteomics (Table [51]1) [[52]7,
[53]20–[54]43]. Where available, mitch uses the DE test statistic for
each gene/protein if available, otherwise calculating the directional
significance score (D) defined as:
[MATH:
D=−log10nominalp−value
×signlog2FC :MATH]
Table 1.
mitch can import profiling data generated by a wide range of upstream
tools
Target application Tool Reference Function Ranking metric
RNA-seq (and other applications of count based quantification) edgeR
[[55]20] topTable() “logFC” and “PValue”
DESeq2 [[56]21] results() “stat”
ABSSeq [[57]22] results() “foldChange” and “pvalue”
topConfects [[58]23]
edger_confects()
limma_confects()
“confect”
fishpond/Swish [[59]24] swish() “stat”
NOIseq [[60]25] noiseq() “ranking”
Ballgown [[61]26] stattest() “fc” and “pval”
TCC [[62]27] getResult() “m.value” and “p.value”
Sleuth [[63]28] sleuth_results() “b” and “pval”
Cufflinks [[64]29] cuffdiff “test_stat”
Expression microarray limma [[65]8] topTable() “t”
DEDS [[66]30] topgenes() “t”
scRNA-seq (and other applications of barcoded cell based count
quantification) Seurat [[67]31] FindMarkers() “avg_logFC” and “p_val”
Muscat [[68]32] pbDS() “logFC” and “p_val”
scde [[69]33] scde.expression.difference() “Z”
MAST [[70]34] zlm() “Coef” and “Pr(>Chisq)”
DEsingle [[71]35] DEtype() “foldchange” and “pvalue”
Methylation array missMethyl [[72]36] topTable() “t”
DMRcate [[73]37] extractRanges() “meanbetafc” and “Stouffer”
Differential proteomics DEP [[74]38] get_results() “ratio” and “p.val”
msmsTests [[75]39] msms.glm.pois(), msms.glm.qlll() or msms.edgeR()
“LogFC” and “p.value”
plgem [[76]40] plgem.deg() “PLGEM.STN” and “p.value”
SDAMS [[77]41] SDA() “beta” and “pv_2part”
DEqMS [[78]42] DEqMS “t”
Differential binding DiffBind [[79]43] dba.report() “Fold” and
“p.value”
[80]Open in a new tab
If a different upstream tool is used or if users prefer to use a
different DE scoring approach, mitch allows import of “prescored” data.
By default, only the genes that are detected in all contrasts are
included, but this behaviour can be modified to accommodate sparse
datasets such as single cell transcriptomics. During import, users may
specify a two-column table that relates gene identifiers in the DE
analysis to those in the gene sets. Genes are then ranked from most
up-regulated to most down-regulated in each contrast. Gene ranks are
centred around the midpoint for each contrast, where the test
statistic/directional significance score is zero.
Gene set definition
A gene set library for use with this implementation must be a named
list of character vectors. The gmt_import function reads gene matrix
transposed (GMT) system files and is based upon a function originally
written for the clusterProfiler package [[81]44] that is interoperable
with FGSEA [[82]10].
Multi-contrast enrichment analysis
The mitch_calc function performs the calculation of multidimensional
enrichment and post-hoc univariate enrichments. Only gene sets with 10
or more members present in all contrasts are included by default
although the minimum set size threshold can be altered as desired. The
base R manova() and summary.manova() functions are used to calculate
and report the probability that genes in a set show a multidimensional
enrichment as compared to genes not in the set using the
Pillai–Bartlett test statistic [[83]45]. The maximum number of
contrasts (dimensions) handled by this function is 69. If only one DE
profile is provided, then mitch will perform an ANOVA test using the
aov() function. The p-values are adjusted for multiple comparisons
using the false discovery rate (FDR) method of Benjamini and Hochberg
[[84]46]. Separately, the enrichment score (s) of each gene set is
calculated in each contrast as described previously [[85]18].
[MATH: s=2R1−R2/
n :MATH]
Where R[1] is the mean rank of genes in the set, R[2] is the mean rank
of genes not in the set and n is the number of genes overall. With two
or more contrasts, S is defined as the higher dimensional but
non-directional enrichment score and is calculated as the Pythagorean
distance from the origin.
On Unix based systems, these calculations are distributed on multiple
cores to take advantage of multi-threaded computers and save time. End
users can prioritise results in three ways; (i) based on statistical
significance (low p-value), (ii) effect size (large S) or (iii)
standard deviation (SD) of s values across contrasts. SD prioritisation
may be of use when searching for gene sets with discordant regulation.
End users may also select the number of gene sets for which detailed
reports are to be generated downstream; with a default of 50.
Visualisation of results
The mitch_plots function generates several plots in high resolution
PDF. The mitch_report function generates an HTML report with the same
outputs, but in a lower resolution to facilitate easy sharing of
results. These visualisation functions are limited to 20 or fewer
contrasts. Outputs contain scatterplots of DE scores derived from the
directional p-value method, filled contour plots of ranked profiles,
histogram of gene set sizes, scatter plot of effect size measured by S
distance and statistical significance measured as -log[10](FDR MANOVA),
and a pairs plot of s values for all gene sets. In addition, detailed
plots are generated for a specified number of gene sets according to
the prioritisation approach selected. These include pairwise filled
contour plots, pairwise scatter plots and violin plots of enrichments
in each contrast. These plots are generated with base R tools or
ggplot2 [[86]45, [87]47]. The HTML output is a self contained report
with results tables and charts. Some of these are interactive charts
and are generated using the echarts4r package [[88]48].
Methods
Case study 1: multi-contrast enrichment analysis of RNA-seq data
RNA-seq data from a previous study with GEO accession [89]GSE109140
[[90]49] were obtained via DEE2 [[91]50]. Transcript-level counts were
aggregated to gene level counts using the Tx2gene function of the
getDEE2 R package (obtained 2019-10-25). Genes with fewer than 10 reads
per sample were excluded from analysis. Two DE contrasts were
performed. In contrast 1, normal (5.5 mM) and high (20 mM) glucose were
compared. In contrast 2, the effect of 1.0 mM valproic acid (VPA) was
assessed in the high glucose condition. DE analysis was performed with
DESeq2 v1.22.2 and profiles were imported with mitch. Gene sets used in
this study were obtained from Reactome [[92]51]. These and all
subsequent numerical analyses were performed in R (v3.6.1) [[93]45].
In order to test whether mitch controls type I errors (false positives)
appropriately, three types of randomisation were performed. (i) Shuffle
the names of genes in the profile. This retains the correlation
structure of the profile dataset. (ii) Shuffle the profile data values.
This doesn’t preserve profile correlation structure. (iii) Create
random gene sets by sampling gene names from the profile. Gene sets
sizes are equal to those obtained from Reactome. The above were
repeated 1000 times with the set seed varied between 1 and 1000.
Case study 2: multi-omics enrichment analysis
Datasets corresponding to A549 (adenocarcinomic human alveolar basal
epithelial cell) with and without exposure to 1 h 100 nM dexamethasone
were selected to showcase the application of mitch to multi-omics data
(listed in Supplementary Table [94]1) [[95]52]. ChIP-seq and ATAC-seq
alignment files in BAM format were downloaded from the ENCODE website.
FeatureCounts v1.6.4 [[96]53] was used to count reads mapped to regions
within 1 kbp of transcriptional start sites. These coordinates were
generated using GTFtools [[97]54] from GENCODE v29 annotations
[[98]55]. RNA-seq gene expression counts were downloaded from the
ENCODE web site. ChIP-seq, ATAC-seq and RNA-seq underwent differential
analysis with DESeq2 v1.22.2 after excluding genes with fewer than 10
reads per sample on average across each experiment. Data were imported
with mitch and enrichment analysis was performed with Reactome gene
sets as above.
Case study 3: comparing enrichment results downstream of different DE tools
RNA-seq data from a previous study with GEO accession [99]GSE93236
[[100]56] were obtained via DEE2. Transcript-level counts were
aggregated to gene level counts as above. Non-target control and Set7
knock down datasets were compared using different DE tools; DESeq2
(v1.22.2), edgeR glmLRT and QL (v3.24.3), voom-limma (v3.38.3) and
ABSSeq (v1.36.0). mitch was used for enrichment analysis using Reactome
gene sets. UpSetR package v1.3.3 was used to intersect gene sets that
were FDR < 0.05 in each DE tool profile [[101]57]. Pairwise
correlation, heatmap, violin and bar charts were generated in R.
Case study 4: enrichment analysis of single cell sequencing data
Single cell RNA-seq expression data derived from peripheral blood
mononuclear cells exposed to interferon beta or vehicle control
[[102]58] were obtained, preprocessed and underwent differential state
analysis using the “pseudobulk” method as described in the Muscat
v0.99.9 vignette [[103]32]. Spearman correlation (ρ) of DE values are
presented as a heatmap. Mitch was performed with Reactome gene sets and
sets with FDR MANOVA < 0.05 were prioritized based on significance,
magnitude of S and SD of s.
Accuracy of single and dual contrast enrichment detection
In order to establish the accuracy of mitch in comparison to other
tools for enrichment analysis, we used a simulated RNA-seq data
approach. A human RNA-seq data set with accession number ERR2539161
with 367 M reads assigned to genes was downloaded from DEE2. We
simulated a typical RNA-seq experiment with a control/case design with
3 replicates. The starting dataset was downsampled repeatedly to 10 M,
40 M and 100 M reads followed by multiplication by a random noise
factor. Noise factors were randomly generated by sampling with the
rnorm function with a mean of 2 and a set SD between 0 and 0.6 followed
by log[2] transformation. For enrichment analysis testing, 1000 gene
sets were created by randomly sampling 50 gene names. In each
simulation 25 randomly selected gene sets were set to be upregulated,
and another 25 were set to be downregulated. Fold changes of 2
and − 0.5 were incorporated into the ‘case’ profiles by multiplication
of the fold change with the gene counts. If a gene was selected to be
both up and downregulated, then no fold change was incorporated. Count
matrices then underwent differential analysis with DESeq2 and
downstream enrichment analysis with hypergeometric test (phyper, base R
v3.6.1), FGSEA (v1.11.1) and mitch. For hypergeometric test, genes with
DESeq2 FDR values < 0.05 were included in over-representation analysis.
For FGSEA and mitch, the DESeq2 test statistic was used for ranking. In
FGSEA, 1000 permutations were performed. Gene sets with FDR values
< 0.05 were considered significant and contributed to the calculation
of precision, recall and F1 score. Mean results are shown after 500
simulations.
For dual contrast enrichment, data were simulated as above, one control
group and two case groups were created to generate two contrasts.
FGSEA, mitch and MD-GSA (v1.18.0) were compared.
Accuracy in detecting multi-contrast enrichment
A random differential expression profile in five dimensions (contrasts)
was simulated by repeatedly shuffling ranks for the hyperglycemia data.
A library of 1000 gene sets, each with 50 members was created as above
and 20 of these gene sets were selected for differential expression.
The ranks of the gene set members were shifted based upon a
prespecified s value using the equation.
[MATH: R2−R1=n×s/2
:MATH]
Values for s in the five contrasts were generated from a normal
distribution with a mean of zero and SD varied between 0 and 0.25. This
equates to mean absolute values of s between 0 and 0.2. After mitch
analysis, a 5% FDR threshold was applied to calculate precision, recall
and F1 score. The simulation was repeated 1000 times for each value of
SD.
Execution time
A typical mitch analysis was defined as consisting of a profiling of
20,000 genes with a gene set library of 1000 sets, and each set
consisting of 50 members. Gene names, data points and gene sets were
randomly generated. Execution time was measured on a 3.8 GHz AMD Ryzen
Threadripper 1900 × 8-core (16 thread) processor with 64 GB RAM running
Ubuntu 18.04 and R v3.6.1. For comparison, FGSEA was run with 1000 or
2000 permutations. MAVTgsa v1.3 was run with 1000 permutations and
MD-GSA using default parameters. Parameters including number of
contrasts, number of genes in the profile, number of gene sets and size
of gene sets were also varied to determine impact on mitch execution
time.
Results
Case study 1: multi-contrast enrichment analysis of RNA-seq data
A common use case for mitch is the multi-contrast gene set enrichment
analysis of transcriptome data. To demonstrate this, we applied mitch
to RNA-seq data initially described by Felisbino et al. [[104]49],
consisting of two contrasts; (i) low glucose (LG) versus high glucose
(HG); and (ii) HG versus HG with valproic acid (HGVPA). The goal of
this study was to identify individual genes and Reactome gene sets
dysregulated due to high glucose stimulation in hepatocytes and
attenuated with VPA, a clinically prescribed histone deacetylase
inhibitor.
There were 15,240 genes with detectable expression in both contrasts.
DE scoring with the Wald test statistic provided by DESeq revealed gene
expression differences were larger in response to VPA (y-axis) as
compared to HG (x-axis) (Fig. [105]2a), although genes were evenly
distributed in all four quadrants (Fig. [106]2b) and the contrasts were
not strongly correlated (Spearman’s ρ = 0.010). After exclusion of 967
gene sets with fewer than 10 detected members, 1296 gene sets underwent
multi-contrast enrichment analysis with mitch. From the 1296 Reactome
gene sets considered, 561 gene sets received FDR MANOVA< 0.05 (Fig.
[107]2c). There were 372 sets with FDR MANOVA< 0.01. A plot of effect
size versus statistical significance for each gene set (Fig. [108]2d)
demonstrates the three gene sets with the greatest effect size, while
satisfying FDR MANOVA< 0.05, are not highly ranked when prioritising
solely on statistical significance. Bidimensional enrichment plots for
the top three gene sets based on statistical significance and effect
size further show differences in types of associations identified (Fig.
[109]2e). When prioritising by statistical significance, top gene sets
are likely to be larger (contain more genes) but more dispersed; while
prioritisation by effect size emphasizes smaller gene sets with larger
magnitude changes.
Fig. 2.
[110]Fig. 2
[111]Open in a new tab
Multi-contrast enrichment analysis of RNA-seq with mitch. a The scoring
metric, D, of every gene in the two contrasts, LG vs HG and HG vs
HGVPA. b A filled contour plot of all genes after ranking. c Enrichment
of Reactome gene sets in the two dimensional space. d Plot of gene set
effect size (S) and significance. S is defined as the Pythagorean
distance from the origin to each point in (c). Significance is measured
as the -log[10](FDR MANOVA). e Density plots for the three top
significant gene sets (blue box) and three gene sets with largest
effect size (red box)
To demonstrate appropriate control of false positives, three
randomisation procedures were performed on bidimensional profiling data
shown in Fig. [112]2. Shuffling gene names 1000 times resulted in an
average of 0.141 and 0.024 false positives per run at FDR < 0.05 and
FDR < 0.01 respectively (Fig. [113]3a and b). Shuffling the profile
data resulted in an average of 0.213 and 0.038 false positives per run
at FDR < 0.05 and FDR < 0.01 respectively (Fig. [114]3c and d).
Randomisation of gene sets resulted in an average of 0.028 and 0.007
false positives per run at FDR < 0.05 and FDR < 0.01 respectively (Fig.
[115]3e and f). Randomisation shows mitch appropriately controls for
false positives.
Fig. 3.
[116]Fig. 3
[117]Open in a new tab
Data randomisation demonstrates robust control of false positives. Data
shown in Fig. [118]2 underwent three types of randomisation. (a and b)
Results of shuffling of gene names in the profile followed by mitch
analysis with an FDR threshold of 0.05 and 0.01. c and d Results of
shuffling profile data points, two contrasts shuffled independently. e
and f Randomisation of gene sets by sampling gene names from the
profile. All procedures were repeated 1000 times
Case study 2: multi-omic enrichment analysis
Another common use case for mitch is in enrichment analysis of
multi-omics data. Previously, the ENCODE consortium have performed
multi-omics profiling of dexamethasone (an anti-inflammatory
corticosteroid drug) on adenocarcinomic human alveolar basal epithelial
cell line A549 [[119]52]. We obtained RNA-seq, ATAC-seq and promoter
based ChIP-seq for CTCF, H3K4me3, NR3C1 and POL2RA profiling data for
dexamethasone treated and control samples (datasets listed in
Supplementary Table [120]1), followed by differential analysis and then
mitch. We found that overall, promoter based NR3C1 occupancy was most
positively correlated with POL2RA and negatively correlated with CTCF
occupancy. As expected, RNA expression differences were positively
correlated with NR3C1, H3K4me3, POL2RA and ATAC-seq signal
(Fig. [121]4a). Selected gene sets with the largest effect size (S)
include peptide chain elongation, adenylate cyclase inhibition and
common pathway of clot formation, while the gene sets with the smallest
FDR adjusted p-values included metabolism of RNA, translation and
infectious disease (Fig. [122]4b). Adenylate cyclase inhibition genes
were associated with increased occupancy of CTCF and chromatin
accessibility (inferred from ATAC-seq), but lower RNA expression,
H3K4me3 and NR3C1 occupancy (Fig. [123]4c). Common pathway of fibrin
clot formation genes were elevated in H3K4me3, NR3C1, and to a lesser
extent in RNA expression. Metabolism of RNA and translation were
elevated in POL2RA occupancy and RNA expression (Fig. [124]4d).
Dysregulation of the adenylate cyclase, clot formation pathway and
effect on protein synthesis are consistent with the known effects of
glucocorticoids [[125]59–[126]62]. This highlights the ability of mitch
to identify enrichments in multi-omics datasets, and recover known
biology.
Fig. 4.
[127]Fig. 4
[128]Open in a new tab
Multi-omics analysis of A549 cells treated with dexamethasone. a
Pairwise filled contour plots of ranked profiles shows underlying
correlations. b Plot of gene set effect size and significance. c
Example gene sets that have large effect sizes. d Example gene sets
with small FDR values. For c and d, grey areas denote the distribution
of ranks of all detected genes, with median and quartiles depicted by
the wide boxplot. Distribution of gene set members is shown by the
black violin, with median and interquartile ranges given by the narrow
boxplot
Case study 3: comparing enrichment results downstream of different DE tools
When benchmarking an RNA-seq bioinformatic pipeline it is useful to
compare the gene set level results of a single RNA-seq contrast
analysed with different DE tools, to determine what effect tool
selection has on final results. To demonstrate this, bulk RNA-seq data
corresponding to Set7 knock-down and non-target control samples
[[129]56] was processed using DESeq2, edgeR glmLRT, edgeR QL, ABSSeq
and voom-limma followed by mitch analysis with Reactome gene sets.
After DE analysis, there were variable numbers of DE genes at the 5%
FDR cutoff (DESeq2: 5150, edgeR glmLRT: 5721, edgeR QL: 5910,
voom-limma: 5903 and ABSSeq: 2253). After mitch analysis also with a 5%
FDR cutoff, there were variable numbers of differentially regulated
gene sets, with ABSseq showing the fewest (Fig. [130]5a). Only 56 gene
sets were common to all DE tools, but the majority (108) were common to
all except ABSSeq. A pairs plot of gene set s values comparing data
from each tool shows the results of DESeq2, edgeR glmLRT, edgeR QL and
voom-limma are virtually identical (Pearson r > 0.99), while results
from ABSSeq are somewhat different (Pearson r ~ 0.95) (Fig. [131]5b).
Sorting gene sets by SD of s values reveals several gene sets that
exhibit stronger downregulation in DESeq2, edgeR glmLRT, edgeR QL and
voom-limma as compared to ABSSeq (Fig. [132]5c). The peptide chain
elongation gene set is a prime example, where the majority of genes are
downregulated when analysed with DESeq2, edgeR glmLRT, edgeR QL and
voom-limma, but appear unchanged when analysed with ABSSeq (Fig.
[133]5d). This difference in collective regulation by ABSSeq is clear
when the gene set is visualised as a violin plot (Fig. [134]5e). As a
consequence, the statistical significance of this gene set is lower for
ABSSeq compared to the other DE tools (Fig. [135]5f). These results are
generally consistent with previous findings that show ABSSeq is more
conservative than other differential RNA-seq tools [[136]22]. This
result highlights that choice of DE tools subtly impacts enrichment
results and these can be explored using mitch.
Fig. 5.
[137]Fig. 5
[138]Open in a new tab
Comparison of gene set enrichment enrichment results downstream of
different DE tools on the same RNA-seq dataset. a UpSet plot of gene
sets produced by different DE tools FDR < 0.05, as calculated by
unidimensional mitch. b Pairs plot of mitch s values for each gene set
as processed by different DE tools. Upper triangle shows Pearson’s r. c
A heatmap of 10 gene sets with the highest SD of s values across
different DE tools, scaled by row. d Heatmap of individual gene members
of the peptide chain elongation gene set, scaled by row. e Violin plot
of enrichments of the peptide chain elongation gene set. f Observed
nominal ANOVA p-value of the peptide chain elongation gene set after
analysis with different DE tools
Case study 4: enrichment analysis of single cell sequencing data
Single cell RNA sequencing (scRNA-seq) allows the parallel profiling of
hundreds to thousands of individual cells in a sample. As in standard
bulk RNA-seq, contrasts between experimental conditions can be made,
with the major difference that scRNA-seq provides information on cell
identity (also known as “cell type”). Generally, scRNA-seq DE tools
provide either test statistic or fold change and p-value information
for each gene of each cell identity. Here, mitch can be applied to
perform enrichment analysis by considering the DE profiles of each cell
identity as an independent contrast. In order to demonstrate this,
scRNA-seq data derived from peripheral blood mononuclear cells exposed
to interferon beta or vehicle control as described by Kang et al
[[139]58] underwent clustering and differential analysis with Muscat
[[140]30] to yield “pseudobulk” DE tables for each cell identity. After
scoring DE values, correlation analysis identified Spearman’s ρ between
0.23 and 0.57 between cell identities, with lymphocytes grouped
together, dendritic cells grouped with monocytes, and megakaryocytes
appearing as an outgroup (Fig. [141]6a).
Fig. 6.
[142]Fig. 6
[143]Open in a new tab
Applying mitch to perform enrichment analysis of scRNA-seq data. a
Heatmap of Spearman correlation values for DE scores for each cell
identity. b Heatmaps of mitch s scores for top 25 Reactome gene sets
after prioritisation by significance, magnitude of S and SD of s. All
gene sets shown are FDR MANOVA < 0.05. c Example enrichment plots for
three gene sets with high SD of s values
Again, mitch analysis was performed with Reactome gene sets. Of the
2263 gene sets present, 1629 were excluded due to the detection of
fewer than 10 members. Of the 607 sets remaining, 77 were
differentially regulated (FDR MANOVA< 0.05). Next, we prioritised the
results three ways; (i) significance, (ii) magnitude of S, and (iii) SD
of s values (Fig. [144]6b). When prioritising by significance,
interferon signaling was observed to be upregulated as expected, but
there were many housekeeping gene sets that were observed to be
downregulated, including TCA cycle, rRNA processing and translation.
When prioritising by magnitude of S, there was a larger number of
upregulated gene sets involved in immune responses observed. In
general, these gene sets demonstrated coordinated regulation in
response to interferon beta stimulation that was consistent between
cell identities. The value of scRNA-seq over bulk is the ability to
detect cell identities responding differently to a stimulus, so it may
be useful to prioritise by the observed SD of s across cell identities.
Using this approach, we identified several gene sets with discordant
cell identity responses to interferon beta, that would be impossible to
detect with bulk sequencing (Fig. [145]6c). For example “RNA polymerase
III chain elongation” was downregulated in monocytes specifically,
“Attenuation phase” was upregulated in monocytes and B cells but not in
megakaryocytes, and “Sema4D in semaphorin signaling” was downregulated
in megakaryocytes specifically. This result highlights the utility of
mitch in analysing single cell profiling data and the impact of
different prioritisation schemes.
Accuracy of single and dual contrast enrichment detection
To test the accuracy of mitch to detect single-contrast enrichments, we
undertook a simulation study. Our goal was to determine the performance
of mitch and other enrichment tests (FGSEA and hypergeometric test)
over a range of typical RNA-seq conditions by varying the sequencing
depth and degree of inter-sample variation. We simulated DE to 5% of
randomly generated gene sets with equal numbers of sets up and
down-regulated (see Methods). Members of those gene sets were given
log[2] fold changes of 1 and − 1 to simulate expression changes. Count
matrices underwent DE analysis and gene set enrichment testing with a
5% FDR threshold to calculate precision and recall of these tools,
calculated by comparing ground truth values to the observed results
(Fig. [146]7a). As expected, DE results from DESeq2 yielded smaller
p-values when sequencing depth was greater and inter-sample variation
was smaller. This resulted in overall better precision and recall of
gene set enrichment results in simulations involving greater sequencing
depth and smaller inter-sample variation. In tests with low variance,
the hypergeometric test was the most precise, however higher variance
caused a severe reduction in recall. In contrast, FGSEA and mitch were
more robust to higher variance especially with higher sequencing depth.
When variance was low, the accuracy of mitch was similar to FGSEA, but
with higher variance, mitch showed superior recall. Potentially,
FGSEA’s recall could be improved by using a greater number of
permutations.
Fig. 7.
[147]Fig. 7
[148]Open in a new tab
Precision, recall and F1 values for enrichment analysis of simulated
RNA-seq datasets. Simulations are based on n = 3 control and case
replicates with different sequencing depth (10, 40 and 100 million
reads) and different degrees of added variation (SD). The mean of 500
simulations is shown. a Evaluation of single contrast enrichment. b
Evaluation of dual contrast enrichment
Next, we applied this approach to the problem of identifying
enrichments in two contrasts. We planned to compare the accuracy of
FGSEA (run twice), with mitch, mdgsa and MAVTgsa, but thoroughly
evaluating MAVTgsa was impractical due to the long computational time
(Fig. [149]9b). We found that FGSEA recall was lower than expected, but
this was improved by increasing the number of permutations to 2000.
Averaged over the 9 different conditions, mitch demonstrated the
highest precision (0.956), recall (0.994) and F1 score (0.974) as
compared to the other tools (Fig. [150]7b). These results demonstrate
that mitch has slightly better accuracy than existing tools for single
and dual contrast enrichment analysis.
Fig. 9.
[151]Fig. 9
[152]Open in a new tab
Benchmarking mitch execution time. a single-contrast enrichment
analysis with mitch and FGSEA (1000 permutations) on up to 16 CPU
threads. b Comparison of mitch, MAVTgsa, MD-GSA and FGSEA execution
times with two to five contrasts. c Effect of number of contrasts and
genes in the profiling data. (D) Effect of gene set size and number of
sets
Accuracy of multi-contrast gene set enrichment detection
Next, we sought to quantify the accuracy of mitch in distinguishing DE
gene sets in a simulated five-contrast dataset. Member genes of 20 gene
sets were shifted by precomputed s values sampled from a range of
values with a mean of zero and SD varied from 0 to 0.25 (Methods).
After mitch analysis with a 5 and 1% FDR cutoff, precision, recall and
F1-score were calculated (Fig. [153]8). When SD = 0, ie in completely
random data, no false positive DE gene sets were found after 1000
replications. As expected, recall increased with larger SD values.
False positives showed a non linear relationship with SD. In the 5% FDR
trial, precision showed a minimum of 86% when SD values were set to
0.05, however at a more strict 1% FDR cutoff, the minimum precision
value was 95%. F1 scores indicated high accuracy with SD values above
0.15, which corresponds to mean absolute s values of 0.08 or higher. To
put this into context of a real dataset, from the 318 gene sets in case
study 2 with FDR < 0.05, there were 12.6% with mean absolute s values
lower than 0.08 (Supplementary Figure [154]1). For trials with SD set
to ≥0.15, F1 scores were 0.9991 or higher. These findings support the
accuracy of mitch in identifying DE gene sets in multidimensional data.
Fig. 8.
[155]Fig. 8
[156]Open in a new tab
mitch accuracy in distinguishing DE gene sets in a simulated
five-contrast dataset. The x-axis shows the set SD for sampling s
values for each gene set for each contrast. x-axis can be interpreted
as increasing enrichment strength. Values shown are the mean of 1000
replications
Execution time
As execution time is a consideration in big data applications, we
measured the execution times of mitch in typical applications.
Initially, single-contrast enrichment analysis was tested. A random
profile consisting of 20,000 genes and queried with a gene set library
of 1000 sets, and each set consisting of 50 members. On a single CPU
thread, single-contrast mitch analysis was completed in 10 s; but using
8 threads this was reduced to 2.5 s. Using an additional 8 threads did
not speed up execution further (Fig. [157]9a). FGSEA which is known for
its speed, completed the analysis in 0.51 s using the default 1000
permutations with a single thread. Next, the speed of mitch was
compared to MAVTgsa, MD-GSA and FGSEA for dual contrast enrichment. We
found mitch was 16–22 times faster than MD-GSA and 2000–2500 times
faster than MAVTgsa, although not as fast as running FGSEA twice (Fig.
[158]9b). Next, the effect of increasing the number of genes profiled
and the number of contrasts on mitch execution time was assessed. The
number of genes in the profiling data had a linear effect on mitch
execution time, but adding extra contrasts gave a sub-linear increase
in execution time (Fig. [159]9c). When the number and size of gene sets
was manipulated, we found the number of gene sets gave a linear
increase in mitch execution time whereas an increase in the size of
gene sets gave a sub-linear increase in execution time (Fig. [160]9d).
This result indicates that although mitch is slower than FGSEA for
single contrast analysis, mitch enables large-scale enrichment analyses
within a reasonable time.
Discussion
Previously, we have used the concept of rank MANOVA enrichment and
visualisation in several studies of cardiovascular disease,
pharmacology, aging and neurological disease (eg: [[161]63, [162]64]),
but only recently has the software become available as a package for
wider use. In the process of packaging the software, we have added
additional features that will enhance its utility. We have made mitch
interoperable with many popular upstream analysis tools, especially
those from the Bioconductor community [[163]65]. We have made use of
the many and varied visualisation features available in the R
environment including filled contour plots, heatmaps, violin plots and
taken advantage of interactive charts made possible with HTML embedded
JavaScript bindings provided by the echarts4r package [[164]48].
Although mitch was initially developed to compare gene expression
signatures in a multi-contrast RNA-seq experiment, it has applications
beyond this. Mitch is ideally suited to multi-omics data, as
demonstrated in case study 2 above that takes advantage of ENCODE
profiling data to identify pathway-level regulatory events associated
with dexamethasone exposure. In case study 3, we evaluated the impact
of DE tool selection on enrichment results, but this approach could
equally be applied to choices of other upstream data processing steps
such as choice of mapping, quality control and normalisation methods.
Perhaps the most exciting application for mitch is in the burgeoning
field of single cell biology as in case study 4. After data clustering
by cell identity and differential state analysis, this type of data can
undergo set enrichment analysis. Although unidimensional enrichment
tools such as GSEA are already being applied to scRNA-seq data, there
are some limitations. The MANOVA approach of mitch is better able to
detect enrichments that are subtle but consistent across profiles.
Moreover mitch natively summarises the results of its multi-contrast
analysis, which means less work for the end user. The different
prioritisation modes allows users to focus on findings that are
statistically robust, associated with large effect sizes or discordant
among cell identities.
Single contrast enrichment simulations show that mitch is as accurate
as FGSEA, and that both these methods have better performance over a
wider range of input data than the hypergeometric over-representation
test (Fig. [165]7a); similar findings have been noted previously
[[166]66]. Randomisation analysis of dual-contrast data shows that
mitch yields very few false positives (Fig. [167]3). In dual-contrast
analysis, mitch accuracy is superior to MD-GSA and FGSEA, although
FGSEA accuracy could potentially be improved by using more permutations
(Fig. [168]7b).
While in this paper we have limited our analyses here to human
biological pathways curated by Reactome, mitch is capable of using gene
sets from any source and organism for which such sets are available
(eg: [[169]67]). In summary, the functionality provided by mitch makes
it a versatile and powerful tool for rapidly distilling pathway level
information from large omics datasets.
Availability and requirements
Project name: mitch.
Project home page: [170]http://bioconductor.org/packages/mitch
Operating system(s): Linux, MacOS and Windows.
Programming language: R.
Other requirements: R v4.0, Bioconductor 3.11.
License: Creative Commons Attribution-ShareAlike 4.0 International
Public License.
Any restrictions to use by non-academics: None.
Supplementary information
[171]12864_2020_6856_MOESM1_ESM.pdf^ (112.2KB, pdf)
Additional file 1: Supplementary Figure 1. Mean absolute s scores in
case study 2. Results of the multi-omics analysis shown in Fig. [172]4
underwent filtering based on a 5% FDR threshold before calculating the
mean absolute s values in the six-dimensional analysis (left). Next,
the number of gene sets with mean absolute s values greater or lower
than 0.08 was counted (right) as below this level, mitch precision is
lower than expected.
[173]12864_2020_6856_MOESM2_ESM.docx^ (16KB, docx)
Additional file 2: Supplementary Table 1. Multi-omics data derived from
control and dexamethasone treated A549 cells obtained from the ENCODE
Project web page.
Acknowledgements