Abstract
Background
Pathway enrichment analysis is a useful tool to study biology and
biomedicine, due to its functional screening on well-defined biological
procedures rather than separate molecules. The measurement of
malfunctions of pathways with a phenotype change, e.g., from normal to
diseased, is the key issue when applying enrichment analysis on a
pathway. The differentially expressed genes (DEGs) are widely focused
in conventional analysis, which is based on the great purity of
samples. However, the disease samples are usually heterogeneous, so
that, the genes with great differential expression variance (DEVGs) are
becoming attractive and important to indicate the specific state of a
biological system. In the context of differential expression variance,
it is still a challenge to measure the enrichment or status of a
pathway. To address this issue, we proposed Integrative Enrichment
Analysis (IEA) based on a novel enrichment measurement.
Results
The main competitive ability of IEA is to identify dysregulated
pathways containing DEGs and DEVGs simultaneously, which are usually
under-scored by other methods. Next, IEA provides two additional
assistant approaches to investigate such dysregulated pathways. One is
to infer the association among identified dysregulated pathways and
expected target pathways by estimating pathway crosstalks. The other
one is to recognize subtype-factors as dysregulated pathways associated
to particular clinical indices according to the DEVGs’ relative
expressions rather than conventional raw expressions. Based on a
previously established evaluation scheme, we found that, in particular
cohorts (i.e., a group of real gene expression datasets from human
patients), a few target disease pathways can be significantly
high-ranked by IEA, which is more effective than other state-of-the-art
methods. Furthermore, we present a proof-of-concept study on Diabetes
to indicate: IEA rather than conventional ORA or GSEA can capture the
under-estimated dysregulated pathways full of DEVGs and DEGs; these
newly identified pathways could be significantly linked to prior-known
disease pathways by estimated crosstalks; and many candidate
subtype-factors recognized by IEA also have significant relation with
the risk of subtypes of genotype-phenotype associations.
Conclusions
Totally, IEA supplies a new tool to carry on enrichment analysis in the
complicate context of clinical application (i.e., heterogeneity of
disease), as a necessary complementary and cooperative approach to
conventional ones.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-2188-7)
contains supplementary material, which is available to authorized
users.
Background
Being a computational approach based on the prior knowledge, pathway
enrichment analysis is widely used in the study of genotype-phenotype
associations [[29]1]. Biological pathway as a set of interactive genes
(and a few of their interactions with biomolecules) produces particular
cellular response/outcome by executing a series of functional cascades.
It is curated by experts from wide range of science fields [[30]2,
[31]3] so that can supply more creditable functional details than
general GO module or network module. Different from exploring the
unknown or indeterminate functions by network module, pathway-centered
analysis always makes an effort to capture the permutation of
established functions (e.g., KEGG pathways [[32]2, [33]3]) in the
change of phenotypes (e.g., from normal to diseased). As a key approach
of pathway-centered analysis, the pathway enrichment analysis or
well-known gene set enrichment analysis (GSEA) [[34]1] can identify
dysregulated pathway by qualitatively measuring the changed status of a
pathway [[35]4].
In the pathway enrichment analysis, the dysregulation of a pathway is
the most important issue [[36]5], and should be mathematically defined
and measured well [[37]6]. It can estimate the conditional enrichment
or status of a pathway, which is assumed to be associated with
particular phenotypes. Current researches generally use genes with
significantly differential expressions or differential correlations to
evaluate the extent of the dysregulation of a pathway. One kind of
conventional method is evaluating the dysfunction of pathways in
different conditions [[38]7–[39]9], such as FiDePa (Finding Deregulated
Paths Algorithm) [[40]10], SPIA (Signaling Pathway Impact Analysis)
[[41]11] and iPEAP (Integrative Pathway Enrichment Analysis Platform)
[[42]12]. The other kind is using pathways to characterize individual
samples [[43]13, [44]14], like CORGs [[45]15] and Pathifier [[46]16].
Generally, all these methods focus on the genes with differential
expression and their enrichments in pathways (i.e., the analysis in the
context of differential expression) [[47]17, [48]18], which assume the
samples are of good purity in genotype-phenotype association study.
However, in the study of complicated phenotypes, e.g., cancer study, a
relevant problem is the samples with the same disease phenotype might
be full of different unknown subtypes due to disease heterogeneity
[[49]19]. It is necessary to detect genes with new features observable
in the complicated disease samples, and enhance the pathway enrichment
analysis to be applicable in such previously unexpected situation
[[50]20].
Actually, there are new expression features extracted in recent
studies, e.g., genes with differential expression variances [[51]21,
[52]22]. In the context of differential expression variance, it is
still a challenge to measure the enrichment or status of a pathway. A
solution to this problem can promote the efficiency of pathway
enrichment analysis on genotype-phenotype association because it will
consider more complete information about the expression changes of
pathway genes. It can also provide new insights on the biological
pathways by integrating additional expression and network features. In
this work, we propose a multiple-label based enrichment analysis to
detect such dysregulated pathways, which simultaneously takes into
account the genes with differential expression (a label as DEGs) and
genes with differential expression variance (the other label as DEVGs)
together (Fig. [53]1).
Fig. 1.
Fig. 1
[54]Open in a new tab
Major differences between the measurements of dysregulated pathways
used in conventional enrichment analysis and integrative enrichment
analysis (IEA)
Obviously, the hypothesis underlying IEA is that the dysregulated
pathways involved in disease heterogeneity would be full of DEGs and/or
DEVGs. That means the identified pathways by IEA would be disease
pathways or their up-streams/down-streams (e.g., heterogeneity-relevant
pathways or subtype-relevant pathways). However, current methods in
pathway enrichment analysis only expect to give high-rank to disease
pathways (e.g., target pathways in approach evaluation). When IEA
identifies up-streams/down-streams of disease pathways, it further
assistantly supplies a network of pathways to recover a global
functional map and infer the associations among disease pathways and
subtype-relevant pathways. Noted, the biological meaning of the edge in
such network of pathways is the pathway crosstalk, which is just an
important biological mechanism or functional relationship among
pathways [[55]23–[56]26]. Conventional researches tend to simply
determine a pathway crosstalk by the overlapped genes in two pathways
[[57]27], which disregard the statistical significance of the genes and
interactions involved in the pathway crosstalk. By contrast, DEGs and
DEVGs in one pathway can be used as seeds, and further detected their
interactive genes in the candidate crosstalking pathways by a random
walk restart algorithm [[58]28]. The significance of a pathway
crosstalk can be finally evaluated by the genes involved in this
crosstalk as their enrichments in two pathways (i.e., the proposed
multiple-label based enrichment).
Based on the above concepts and mathematical models, a new
pathway-centered analysis framework, the integrative enrichment
analysis (IEA), is implemented as (i) pathway enrichment score
calculated by the hypergeometric test on differential genes (DEGs and
DEVGs); (ii) pathway crosstalk ranked by the random walk and
hypergeometric test on rewired molecule networks; (iii)
pathway-phenotype association and subtype-factors determined by DEVGs
in pathways. According to a previously established evaluation scheme
[[59]29], we found that, in particular cohorts (i.e., a group of real
gene expression datasets from human patients), a few target disease
pathways can be significantly high-ranked by IEA, which supplied the
evidences of the deviation-based disease characteristics (i.e., disease
subtypes), and IEA is more effective than other state-of-the-art
methods in this condition. Furthermore, by a proof-of-concept study, we
shows the details of IEA on analyzing real transcriptional data related
to complex diseases, e.g., Diabetes and Colorectal cancer. IEA indeed
captures the previously under-estimated pathways full of DEVGs and
DEGs. These newly identified dysregulated pathways would be
heterogeneity-relevant pathways and are found to be significantly
linked to disease pathways (i.e., target pathways in conventional
analysis) by estimated crosstalks. Many candidate subtype-factors are
also recognized as DEVGs or pathways associated with the risk of
subtypes of genotype-phenotype associations. Totally, IEA supplies a
new way of over-representation approach [[60]30] to carry on enrichment
analysis in the complicate context of clinical application (i.e.,
differential expression and differential expression variance), and
could be easily expanded to functional class scoring or pathway
topology based approaches [[61]31–[62]34], which will be a necessary
complementary and cooperative approach to conventional ones [[63]35].
The Matlab scripts of the software named IEApackage and some
alternative R scripts have been deposited in GitHub and accessed in
[64]https://github.com/bluesky2009/integrative-enrichment-analysis.
This software has been developed and tested in Windows 7 or Windows 8,
and Matlab 2010 or Matlab 2012.
Methods
Generally, enrichment analysis includes three categories of methods:
over-representation approach, functional class scoring and pathway
topology based approaches. Although these methods are all focusing on
evaluating the phenotype-associated pathway, they would be based on
different hypothesis. This work and the proof-of-concept study are
based on the over-representation approach, which measures the
dysregulation extent of a pathway according to the number of
dysregulated genes in this pathway. Traditional methods only evaluated
the DEGs in a pathway; by contrast, IEA evaluates the DEGs and DEVGs in
a pathway. Thus, the meaning of the statistic for the integration of
IEA is as completely as possible to measure the dysregulation extent of
a pathway according to the number of dysregulated genes (DEGs & DEVGs)
in this pathway, which have been well defined and introduced in
follows.
Differential gene expression and differential expression variance
Given a gene x has expression profiles in control and case samples as X
and X’ respectively, the expression variance of this gene in control
and case condition are E((X-u)^2) and E((X’-u’)^2) respectively. Here,
u and u’ are average expressions of gene x in control and case samples
respectively. Then, the conventional criterion and measurement of genes
with differential expression (named as DEGs) are:
[MATH: H0:EX=EX’;H0rejected; :MATH]
1
where X or X’ are the original/raw expression levels. Noted, the
differential expression includes up-regulation (the expressions of
genes in case samples are larger than those in control samples) and
down-regulation (the expressions of genes in case samples are less than
those in control samples).
Except for these DEGs (e.g., genes rejected by Student’s T-test in
significance test), the genes with differential expression variance are
also discriminative features [[65]21, [66]36]. The expression variance
concerned features, e.g., bimodal gene expression, is already known as
an important expression pattern in the control of a transition of
biological systems [[67]37], such as: disease development, cellular
differentiation, and phase transition. However, the differential
expression variance of genes has not been studied in a systematic way
to the best of our knowledge, especially for its usage in the pathway
enrichment analysis. The differential expression of genes, used in
conventional enrichment analysis, requires the gene’s expressions under
different conditions to distribute around different mean expression
levels (seeing above formula 1). By contrast, differential expression
variance of genes (named as DEVGs) can be defined as the genes’
deviations being significantly different under dissimilar conditions
(deviation means the distances between a gene’s original expression
levels and its mean expression level), such as:
[MATH: H0:EX‐u=EX’‐u’;H0rejected;andH0:EX=EX’;<
mi mathvariant="normal">H0notrejected :MATH]
2
where X or X’ is the original expression level, |X-u| or |X’-u’| is the
relative expression level.
Noted, the differential expression variance includes tight-regulation
(the expression variances of genes in case samples are less than those
in control samples) and relax-regulation (the expression variances of
genes in case samples are larger than those in control samples). And
importantly, as defined above, the DEVGs have excluded DEGs, or there
is no overlap between DEVGs and DEGs in this work. That means, when one
gene has both differential expression and differential expression
variance, this gene is thought as DEG in priority in order to be
consistent with conventional analysis; and, of course, this kind of
genes are worthy of deep research in future work.
Actually, given X or X’ satisfy normal distribution, |X-u| or |X’-u’|
will be folded normal distribution, then the Wilcoxon rank sum test
instead of Student’s T-test is used in the significance test of DEVGs.
Integrative enrichment analysis in the context of differential expression
variance
Obviously, the conventional enrichment analysis limits to estimate the
extent of differential expression rather than differential expression
variance. When considering the contribution of DEVGs on pathway’s
dysregulation, it is necessary to refine the conventional approach to
take into account the DEGs and DEVGs together. Naturally, an easiest
strategy is to put DEGs and DEVGs together as the same dysregulated
genes and use conventional hypergeometric test to obtain the P-value.
However, this will disregard the respective distribution of DEGs and
DEVGs in a target pathway and in the whole transcriptome. Thus, we
extended the hypergeometric test on two kinds of enriched genes
simultaneously as bellows. Our approach, noted as HT2 (hypergeometric
test on the model of the drawn of two group balls), still depends on
the hypergeometric distribution and uses P-value to measure the
dysregulation of a pathway in the context of differential expression
variance.
Briefly seen in Table [68]1, given there are expression data on total N
genes, and x[1] DEGs and x[2] DEVGs selected respectively. For some
pathway, k[1] and k[2] genes from pathway members (totally y genes)
have differential expression and differential expression variance
respectively. Then the significance of deregulated genes as DEGs or
DEVGs enriched in this pathway can be estimated by formula 3. This
P-value also ranges from zero to one. The less the P-value is, the
larger dysregulation extent the pathway has, when the significantly
larger number of genes in this pathway show differential expression or
differential expression variance.
[MATH: PX1=k
mi>1,X2=k2=x1k1x2k2N−x1−x2y−k1−k2Ny<
/mtd>PX1>k
mi>1,X2>k2=1−∑<i1
,i2>∈0x1×0x2−k1,x
mi>1×(k2,
x2]PX1=i
mi>1,X2=i2=1−∑<i1
,i2>∈0x1×0x2−k1,x
mi>1×(k2,
x2]x1i1x2i2N−x1−x2y−i1−i2Ny<
/mrow> :MATH]
3
Table 1.
The statistic of DEGs and DEVGs for pathway enrichment analysis in the
context of differential expression variance
Pathway Others All
DEG k[1] x[1]-k[1] x[1]
DEVG k[2] x[2]-k[2] x[2]
Others y-k[1]-k[2] N + k[1] + k[2]-x[1]-x[2]-y N-x[1]-x[2]
All y N-y N
[69]Open in a new tab
Estimating pathway crosstalks to link the dysregulated pathways identified by
IEA and prior-known disease pathways
The first assistant down-stream analysis method of IEA is to link the
dysregulated pathways identified by IEA and some prior-known disease
pathways. Obviously, IEA tends to detect the dysregulated pathways
related to disease subtypes. These pathways would be disease pathways
as currently known, or the up-stream/down-stream of the disease
pathways. Conventional pathway enrichment usually analyses single
pathway rather than multiple ones. But, the pathway crosstalk, as a
pair of pathways, also plays important roles in the change of
phenotypes [[70]25]. An enrichment analysis of such pathway crosstalk
requires evaluating the enrichment of interactive genes from two
pathways correspondingly. And the pathway map based on such estimated
pathway crosstalks is just an additional computational method to
assistantly supply a bridge between subtype-relevant pathways (i.e.,
IEA recognized pathways) and disease-relevant pathways (i.e., Target
pathways from disease database KEGG).
Given several genes in a pathway as seeds, IEA uses random walk to find
their partner genes in the other pathway. In fact, random walk with
restart (RWR) is a well-known ranking algorithm for candidate gene
prioritization [[71]28]. It supplies the probability of searching the
random walker at nodes in the steady state, so that, it can give a
measure of proximity between source nodes (e.g., genes as seeds in a
pathway) and other nodes in molecule network (e.g., genes in the
candidate pathway with crosstalk).
Let N be the adjacency matrix of a gene network with node set V and
edge set E, in which the element N[ij] equals one if e(i, j) ∈ E (where
e(i, j) represents the interaction between genes/nodes i and j), or
zero otherwise. Based on the topological structure of the gene network,
the transition matrix T can be calculated. Each element in the
transition matrix is denoted as T[ij] and represents the probability of
transition from node i to node j. The value of T[ij] can be given by
one of two ways as follows, the first one is topology-weighted and the
second one is correlation-weighted.
[MATH:
Tij=Nij<
/mrow>di,ifeij∈E0,otherwise,wheredi=∑j∈VNij :MATH]
[MATH:
Tij=wij
Nijwi,ifeij∈E0,otherwise,wherewi=∑j∈VwijNij
mi> :MATH]
The RWR algorithm [[72]28] updates the probability vectors by
[MATH:
Pk+1=
1−λTPk+λP<
mn>0,k>0 :MATH]
where T is the transition matrix and p[0] is the initial probability
vector with the sum of the probabilities as one. In p[0], all the
source nodes are assigned equal probabilities and other nodes are given
zero. P[∞] is obtained when the algorithm is convergent. If
P[∞](i) > P[∞](j), node i is thought to be more proximate to source
nodes than node j does.
Thus, a two-way RWR approach (twRWR) is proposed to search the genes
involved in two interactive pathways and estimate their enrichment for
evaluating the pathway crosstalk. The steps of two-way RWR include:
* (i)
For each pathway u, its DEGs and DEVGs are used as source
nodes/genes, and RWR is used to rank the genes in known molecule
network, e.g., protein association network collected from STRING
database [[73]38].
* (ii)
In the high-ranked genes from above RWR analysis, the genes
belonging to pathway v are the partner genes interactive with
source genes. Based on the sources genes and their partner genes,
the enrichment of those interactive genes (E[uv]) in pathways u and
v can be evaluated by our HT2 approach, i.e., P-value in formula 3.
* (iii)
For every pathway, the analysis in steps (i) and (ii) is repeated.
Then, given a pathway pair (u,v), it is a pathway crosstalk only
when E[uv] and E[vu] are both significant. Finally, the map of
pathways consist of those selected pathway crosstalks, where a node
represents a pathway and an edge represents a pathway crosstalk.
Screening subtype-factor of genotype-phenotype associations based on DEVGs
and dysregulated pathways supplied by IEA
The second assistant down-stream analysis method of IEA is to screen
subtype-factors according to the available clinical indices. As stated
above, IEA focus on the DEVGs and their involved pathways, and these
genes and pathways are thought as signatures of potential subtypes of
heterogeneous samples. However, these hidden subtypes might have not
been identified or formalized in clinics. To evaluate such new
signatures or subtypes, one direct strategy is to measure the
correlation between genetic signatures (e.g., DEVGs or dysregulated
pathways) and clinical indices (e.g., age or bmi). If one signature is
significantly related to some clinical index, the subtype represented
by such signature would be medical meaningful as to be observable in
clinics and this signature is also called as subtype-factor related to
particular clinical index. The approach to identify such
subtype-factors is described in bellows.
For each pathway, its DEVGs are used to group case (or control) samples
into two clusters, when the case (or control) samples have high varying
expression compared to control (or case) samples. That means these
genes have over-expression in one group of samples and under-expression
in the other group of samples. This pathway would be a candidate
subtype-factor when these two sample clusters are discriminative on
some clinical index. On this condition, a clinical subtype of samples
is thought to be related to a given clinical index, which is
represented by a subtype-factor (e.g., a DEVG or a dysregulated pathway
from IEA). Obviously, the clinical subtype of a particular sample might
be contributed by many subtype-factors (i.e., many pathways). Given a
known phenotype (e.g., a clinical index), a few subtype-factors
correlated with this phenotype can be found, although which just
reveals only the tip of the iceberg for the subtypes of
genotype-phenotype associations.
Particularly, different from conventional un-supervised clustering for
subtype identification, a supervised-like clustering approach (SLC) is
proposed to identify subtype-factors on the level of pathways. Firstly,
the case samples can be grouped into two clusters according to their
features’ values (i.e., DEVGs’ expressions) compared to those values of
control samples: on each feature (DEVG), one group of samples have
larger values than controls meanwhile the other group of samples have
less values than the same controls, or vice versa. That means, a
hyperplane determined by a few control samples could separate the
samples space into two sub-spaces, and case samples in each of two
sub-spaces are grouped into one cluster. Secondly, some clinical
information of samples can be used to evaluate the potential subtype
represented by such two clusters of case samples. If the clinical
values of these two groups of samples have significant difference, a
clinical subtype of genotype-phenotype association (e.g., the
correlation between clinical indices and pathway DEVGs) is identified
and the corresponding pathway is a subtype-factor corresponding to the
given clinical index.
Practically, the SLC algorithm on a pathway is implemented as bellows:
* (i)
Discrete the expressions of DEVGs of case samples into binary
vector based on the values of controls: for a DEVG, if its
expression value is larger than the mean of controls, it is one in
the binary vector; otherwise, it is zero.
* (ii)
Clustering case samples based on the binary vectors by conventional
methods as hierarchical clustering or K-means, which obtains two
sample clusters.
* (iii)
Calculating the significance of difference between clinical indices
among above two sample clusters. If the difference is significant,
this pathway is identified as a subtype-factor of the association
between the given pathway and clinical index.
Results and discussion
The evaluation of biological meaning of IEA by method comparison
IEA is proposed to evaluate dysregulated pathways by differential gene
expression and differential expression variance together. Differential
expression variance has been reported as a new and important expression
change during a phenotype change [[74]36], e.g., diseases. In this
work, the biological hypotheses underlying IEA is that, the
dysregulated pathways full of genes with differential expression
variance would be subtype-relevant pathways. Although subtype-relevant
pathways for particular complex disease are unclear in current pathway
databases, e.g., KEGG, it is still able to investigate if prior-known
disease pathways in KEGG would be subtype-relevant and if IEA can
identify them. In the previous study of gene-set analysis [[75]29], a
comparison scheme has been built to evaluate the performances of
different enrichment analysis methods (e.g., ORA or GSEA) based on
multiple expression datasets about complex diseases. Different from
previous general comparison, we focus on the comparisons by
approach-specific datasets, in order to mainly evaluate the biological
meaning of IEA.
According to the comparison protocol [[76]29], we ran total eight
representative enrichment analysis methods on 36 GEO datasets with
target pathways in KEGG, and obtained the rank of target pathway
estimated by each method on each dataset; then, for each dataset, we
rank the eight methods according to their prioritization performance or
sensitivity performance [[77]29], and this dataset is assigned as a
specific-data for the Top-K methods (K is set 3); thus, all
specific-data for one method can consist of K-order approach-specific
dataset. Generally, on one method’s approach-specific dataset, this
method should have best or comparable performances than other methods,
so that, the biological characteristics assumed by this given method
would significantly displayed on these datasets. Therefore, we can use
this strategy to investigate the biological meaning of IEA in real
datasets. In bellows, we firstly summarize the biological hypothesis
hold by different state-of-the-art enrichment analysis methods and
their respective quantitative measurements, and then discuss the
comparison between IEA and others.
* (i)
PLAGE: it assumes the activity of pathway rather than the
expression of pathway genes determines the activated or inhibited
status of pathways under different conditions; and the pathway
activity is measured by an activity score as the weights of a
metagene extracted from all pathway genes by SVD (singular value
decomposition) [[78]39].
* (ii)
GSVA: it proposes the change of pathway activity between control
and case should be evaluated at the level of samples, e.g.,
considering the variation of pathway activity over a sample
population; and the pathway activity is measured by so-called GSVA
score as a function of the expressions of genes inside and outside
the pathway, and these scores are assessed similarly as GSEA by
using the Kolmogorov-Smirnov (KS) like random walk statistic
[[79]40].
* (iii)
PADOG: it assumes that, if the genes highly specific to a given
pathway occur differential expressions, the respective pathway
would be truly relevant in that condition; thus, a new gene set
score is calculated as the mean of absolute values of weighted
moderated gene t-scores where the gene weights are designed to be
large for the genes appearing in few pathways and small for genes
that appear in many pathways [[80]41].
* (iv)
GLOBALTEST: it holds an assumption that, if a group of genes (e.g.,
pathway genes) can be used to predict the clinical outcome, the
expression patterns of such gene group must differ for dissimilar
clinical outcomes; thus, it uses generalized linear model to give
one P-value for a group of genes, not a P-value for each gene,
which can be applied to estimate the enrichment of a given pathway
[[81]42].
* (v)
MRGSE: it proposes that the high ranks of expression changes (e.g.,
fold-change) of genes can indicate the differential expression of a
set of genes (e.g., pathway genes); and the enrichment score or the
test statistic of a pathway is the mean rank of this gene set,
i.e., the average of the ranks of t-statistics of pathway genes
[[82]43].
* (vi)
GSA: it is similar to GSEA, and proposes two improvements as the
maximal average statistic for summarizing gene-sets, and
restandardization for accurate enrichment inferences [[83]44].
* (vii)
ORA: it takes into account the number of differentially expressed
genes observed in a pathway as indicators of pathway states;
generally, it uses a basic contingency table to test the
association between the differential expression status of a gene
(e.g., differentially expressed gene, or not) and its membership in
a given gene set (e.g., pathway gene, or not), which can be
measured by the P-value of a hypergeometric test [[84]45].
* (viii)
IEA: it is proposed in this work to generally consider the
contribution of expression variance in a dysregulated pathway; as
one implementation, this work takes into account the number of DEGs
and DEVGs observed in a pathway as indicators of pathway states; it
is designed to test the association between the differential
expression/differential expression variance status of a gene and
their memberships in a given gene set, which can be measured by the
P-value from proposed HT2 approach in this work.
First of all, we can cluster the above eight approaches by their
performances on all datasets to investigate the general association
among different methods. As shown in Figs. [85]2 and [86]3, the
similarity among any two methods is measured by four kinds of
criterion: the first one is whether the ranks given by two methods on
the same dataset are also the same (i.e., Euclidean distance on ranks
in Fig. [87]2a); the second one is whether the ranks given by two
methods have the same change tendency among different datasets (i.e.,
Correlation distance on ranks in Fig. [88]2b); the third one is whether
the P-values given by two methods on the same dataset are also the same
(i.e., Euclidean distance on P-values in Fig. [89]3a); and the last one
is whether the P-values given by two methods have the same change
tendency among different datasets (i.e., Correlation distance on
P-values in Fig. [90]3b). Obviously, GSA and PADOG are both based on
conventional GSEA, so that they are similar; the proposed IEA is based
on ORA, thus, they also have similar performances on different
datasets; PLAGE and GLOBALTEST are closely clustered together, one
reason is that they both estimate a score from all pathway genes rather
than individual genes (i.e., PLAGE uses weights of a metagene extracted
from all pathway genes by SVD, and GLOBALTEST uses generalized linear
model to give one P-value for a group of genes); in addition, MRGSE and
GSVA are much different, and also different form other methods, which
is possibly because they have specific design principles on the
measurement of pathway dysfunctions, i.e., MRGSE combines the
t-statistics of individual pathway genes meanwhile GSVA uses a score as
a function of the expressions of genes inside and outside a pathway.
Fig. 2.
Fig. 2
[91]Open in a new tab
Category of representative gene set analysis approaches based on
clustering of prioritization performance. a Method clustering based on
Euclidean distance of ranks of all pathways. b Method clustering based
on Correlation distance of ranks of all pathways
Fig. 3.
Fig. 3
[92]Open in a new tab
Category of representative gene set analysis approaches based on
clustering of sensitivity performance. a Method clustering based on
Euclidean distance of P-values of all pathways. b Method clustering
based on Correlation distance of P-values of all pathways
Then, we directly grouped the datasets according to the performance of
a given method, e.g., some datasets are included as K-order
IEA-specific datasets, only when the rank of IEA performance compared
to all methods are in the Top-K on these datasets, where K is set 3 in
this study. To quantify the performance, sensitivity (i.e., P-value)
and prioritization (i.e., rank) are adopted as previously [[93]29]. In
previous evaluation on these datasets, PADOG displays consistently
comparable performance with other methods, meanwhile, PLAGE, GLOBALTEST
and MRGSE have the best performances on some categorise of datasets
[[94]29], which already suggest the existence of approach preferences.