Abstract

Background

   Pathway enrichment analysis is a useful tool to study biology and
   biomedicine, due to its functional screening on well-defined biological
   procedures rather than separate molecules. The measurement of
   malfunctions of pathways with a phenotype change, e.g., from normal to
   diseased, is the key issue when applying enrichment analysis on a
   pathway. The differentially expressed genes (DEGs) are widely focused
   in conventional analysis, which is based on the great purity of
   samples. However, the disease samples are usually heterogeneous, so
   that, the genes with great differential expression variance (DEVGs) are
   becoming attractive and important to indicate the specific state of a
   biological system. In the context of differential expression variance,
   it is still a challenge to measure the enrichment or status of a
   pathway. To address this issue, we proposed Integrative Enrichment
   Analysis (IEA) based on a novel enrichment measurement.

Results

   The main competitive ability of IEA is to identify dysregulated
   pathways containing DEGs and DEVGs simultaneously, which are usually
   under-scored by other methods. Next, IEA provides two additional
   assistant approaches to investigate such dysregulated pathways. One is
   to infer the association among identified dysregulated pathways and
   expected target pathways by estimating pathway crosstalks. The other
   one is to recognize subtype-factors as dysregulated pathways associated
   to particular clinical indices according to the DEVGs’ relative
   expressions rather than conventional raw expressions. Based on a
   previously established evaluation scheme, we found that, in particular
   cohorts (i.e., a group of real gene expression datasets from human
   patients), a few target disease pathways can be significantly
   high-ranked by IEA, which is more effective than other state-of-the-art
   methods. Furthermore, we present a proof-of-concept study on Diabetes
   to indicate: IEA rather than conventional ORA or GSEA can capture the
   under-estimated dysregulated pathways full of DEVGs and DEGs; these
   newly identified pathways could be significantly linked to prior-known
   disease pathways by estimated crosstalks; and many candidate
   subtype-factors recognized by IEA also have significant relation with
   the risk of subtypes of genotype-phenotype associations.

Conclusions

   Totally, IEA supplies a new tool to carry on enrichment analysis in the
   complicate context of clinical application (i.e., heterogeneity of
   disease), as a necessary complementary and cooperative approach to
   conventional ones.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12864-015-2188-7)
   contains supplementary material, which is available to authorized
   users.

Background

   Being a computational approach based on the prior knowledge, pathway
   enrichment analysis is widely used in the study of genotype-phenotype
   associations [[29]1]. Biological pathway as a set of interactive genes
   (and a few of their interactions with biomolecules) produces particular
   cellular response/outcome by executing a series of functional cascades.
   It is curated by experts from wide range of science fields [[30]2,
   [31]3] so that can supply more creditable functional details than
   general GO module or network module. Different from exploring the
   unknown or indeterminate functions by network module, pathway-centered
   analysis always makes an effort to capture the permutation of
   established functions (e.g., KEGG pathways [[32]2, [33]3]) in the
   change of phenotypes (e.g., from normal to diseased). As a key approach
   of pathway-centered analysis, the pathway enrichment analysis or
   well-known gene set enrichment analysis (GSEA) [[34]1] can identify
   dysregulated pathway by qualitatively measuring the changed status of a
   pathway [[35]4].

   In the pathway enrichment analysis, the dysregulation of a pathway is
   the most important issue [[36]5], and should be mathematically defined
   and measured well [[37]6]. It can estimate the conditional enrichment
   or status of a pathway, which is assumed to be associated with
   particular phenotypes. Current researches generally use genes with
   significantly differential expressions or differential correlations to
   evaluate the extent of the dysregulation of a pathway. One kind of
   conventional method is evaluating the dysfunction of pathways in
   different conditions [[38]7–[39]9], such as FiDePa (Finding Deregulated
   Paths Algorithm) [[40]10], SPIA (Signaling Pathway Impact Analysis)
   [[41]11] and iPEAP (Integrative Pathway Enrichment Analysis Platform)
   [[42]12]. The other kind is using pathways to characterize individual
   samples [[43]13, [44]14], like CORGs [[45]15] and Pathifier [[46]16].
   Generally, all these methods focus on the genes with differential
   expression and their enrichments in pathways (i.e., the analysis in the
   context of differential expression) [[47]17, [48]18], which assume the
   samples are of good purity in genotype-phenotype association study.
   However, in the study of complicated phenotypes, e.g., cancer study, a
   relevant problem is the samples with the same disease phenotype might
   be full of different unknown subtypes due to disease heterogeneity
   [[49]19]. It is necessary to detect genes with new features observable
   in the complicated disease samples, and enhance the pathway enrichment
   analysis to be applicable in such previously unexpected situation
   [[50]20].

   Actually, there are new expression features extracted in recent
   studies, e.g., genes with differential expression variances [[51]21,
   [52]22]. In the context of differential expression variance, it is
   still a challenge to measure the enrichment or status of a pathway. A
   solution to this problem can promote the efficiency of pathway
   enrichment analysis on genotype-phenotype association because it will
   consider more complete information about the expression changes of
   pathway genes. It can also provide new insights on the biological
   pathways by integrating additional expression and network features. In
   this work, we propose a multiple-label based enrichment analysis to
   detect such dysregulated pathways, which simultaneously takes into
   account the genes with differential expression (a label as DEGs) and
   genes with differential expression variance (the other label as DEVGs)
   together (Fig. [53]1).

Fig. 1.

   Fig. 1
   [54]Open in a new tab

   Major differences between the measurements of dysregulated pathways
   used in conventional enrichment analysis and integrative enrichment
   analysis (IEA)

   Obviously, the hypothesis underlying IEA is that the dysregulated
   pathways involved in disease heterogeneity would be full of DEGs and/or
   DEVGs. That means the identified pathways by IEA would be disease
   pathways or their up-streams/down-streams (e.g., heterogeneity-relevant
   pathways or subtype-relevant pathways). However, current methods in
   pathway enrichment analysis only expect to give high-rank to disease
   pathways (e.g., target pathways in approach evaluation). When IEA
   identifies up-streams/down-streams of disease pathways, it further
   assistantly supplies a network of pathways to recover a global
   functional map and infer the associations among disease pathways and
   subtype-relevant pathways. Noted, the biological meaning of the edge in
   such network of pathways is the pathway crosstalk, which is just an
   important biological mechanism or functional relationship among
   pathways [[55]23–[56]26]. Conventional researches tend to simply
   determine a pathway crosstalk by the overlapped genes in two pathways
   [[57]27], which disregard the statistical significance of the genes and
   interactions involved in the pathway crosstalk. By contrast, DEGs and
   DEVGs in one pathway can be used as seeds, and further detected their
   interactive genes in the candidate crosstalking pathways by a random
   walk restart algorithm [[58]28]. The significance of a pathway
   crosstalk can be finally evaluated by the genes involved in this
   crosstalk as their enrichments in two pathways (i.e., the proposed
   multiple-label based enrichment).

   Based on the above concepts and mathematical models, a new
   pathway-centered analysis framework, the integrative enrichment
   analysis (IEA), is implemented as (i) pathway enrichment score
   calculated by the hypergeometric test on differential genes (DEGs and
   DEVGs); (ii) pathway crosstalk ranked by the random walk and
   hypergeometric test on rewired molecule networks; (iii)
   pathway-phenotype association and subtype-factors determined by DEVGs
   in pathways. According to a previously established evaluation scheme
   [[59]29], we found that, in particular cohorts (i.e., a group of real
   gene expression datasets from human patients), a few target disease
   pathways can be significantly high-ranked by IEA, which supplied the
   evidences of the deviation-based disease characteristics (i.e., disease
   subtypes), and IEA is more effective than other state-of-the-art
   methods in this condition. Furthermore, by a proof-of-concept study, we
   shows the details of IEA on analyzing real transcriptional data related
   to complex diseases, e.g., Diabetes and Colorectal cancer. IEA indeed
   captures the previously under-estimated pathways full of DEVGs and
   DEGs. These newly identified dysregulated pathways would be
   heterogeneity-relevant pathways and are found to be significantly
   linked to disease pathways (i.e., target pathways in conventional
   analysis) by estimated crosstalks. Many candidate subtype-factors are
   also recognized as DEVGs or pathways associated with the risk of
   subtypes of genotype-phenotype associations. Totally, IEA supplies a
   new way of over-representation approach [[60]30] to carry on enrichment
   analysis in the complicate context of clinical application (i.e.,
   differential expression and differential expression variance), and
   could be easily expanded to functional class scoring or pathway
   topology based approaches [[61]31–[62]34], which will be a necessary
   complementary and cooperative approach to conventional ones [[63]35].
   The Matlab scripts of the software named IEApackage and some
   alternative R scripts have been deposited in GitHub and accessed in
   [64]https://github.com/bluesky2009/integrative-enrichment-analysis.
   This software has been developed and tested in Windows 7 or Windows 8,
   and Matlab 2010 or Matlab 2012.

Methods

   Generally, enrichment analysis includes three categories of methods:
   over-representation approach, functional class scoring and pathway
   topology based approaches. Although these methods are all focusing on
   evaluating the phenotype-associated pathway, they would be based on
   different hypothesis. This work and the proof-of-concept study are
   based on the over-representation approach, which measures the
   dysregulation extent of a pathway according to the number of
   dysregulated genes in this pathway. Traditional methods only evaluated
   the DEGs in a pathway; by contrast, IEA evaluates the DEGs and DEVGs in
   a pathway. Thus, the meaning of the statistic for the integration of
   IEA is as completely as possible to measure the dysregulation extent of
   a pathway according to the number of dysregulated genes (DEGs & DEVGs)
   in this pathway, which have been well defined and introduced in
   follows.

Differential gene expression and differential expression variance

   Given a gene x has expression profiles in control and case samples as X
   and X’ respectively, the expression variance of this gene in control
   and case condition are E((X-u)^2) and E((X’-u’)^2) respectively. Here,
   u and u’ are average expressions of gene x in control and case samples
   respectively. Then, the conventional criterion and measurement of genes
   with differential expression (named as DEGs) are:
   [MATH: <msub><mi
   mathvariant="normal">H</mi><mn>0</mn></msub><mo>:</mo><mspace
   width="0.25em"></mspace><mi mathvariant="normal">E</mi><mfenced
   close=")" open="("><mi
   mathvariant="normal">X</mi></mfenced><mo>=</mo><mi
   mathvariant="normal">E</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="normal">X</mi><mo>’</mo></mrow></mfenced><mo>;</mo><mspace
   width="0.25em"></mspace><msub><mi
   mathvariant="normal">H</mi><mn>0</mn></msub><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">rejected</mi><mtext>;</mtext> :MATH]
   1

   where X or X’ are the original/raw expression levels. Noted, the
   differential expression includes up-regulation (the expressions of
   genes in case samples are larger than those in control samples) and
   down-regulation (the expressions of genes in case samples are less than
   those in control samples).

   Except for these DEGs (e.g., genes rejected by Student’s T-test in
   significance test), the genes with differential expression variance are
   also discriminative features [[65]21, [66]36]. The expression variance
   concerned features, e.g., bimodal gene expression, is already known as
   an important expression pattern in the control of a transition of
   biological systems [[67]37], such as: disease development, cellular
   differentiation, and phase transition. However, the differential
   expression variance of genes has not been studied in a systematic way
   to the best of our knowledge, especially for its usage in the pathway
   enrichment analysis. The differential expression of genes, used in
   conventional enrichment analysis, requires the gene’s expressions under
   different conditions to distribute around different mean expression
   levels (seeing above formula 1). By contrast, differential expression
   variance of genes (named as DEVGs) can be defined as the genes’
   deviations being significantly different under dissimilar conditions
   (deviation means the distances between a gene’s original expression
   levels and its mean expression level), such as:
   [MATH: <mtable columnalign="left"><mtr><mtd><msub><mi
   mathvariant="normal">H</mi><mn>0</mn></msub><mo>:</mo><mspace
   width="0.25em"></mspace><mi mathvariant="normal">E</mi><mfenced
   close=")" open="("><mfenced close="|" open="|"><mrow><mi
   mathvariant="normal">X</mi><mo>‐</mo><mi>u</mi></mrow></mfenced></mfenc
   ed><mo>=</mo><mi mathvariant="normal">E</mi><mfenced close=")"
   open="("><mfenced close="|" open="|"><mrow><mi
   mathvariant="normal">X</mi><mo>’</mo><mo>‐</mo><mi>u</mi><mo>’</mo></mr
   ow></mfenced></mfenced><mo>;</mo><mspace
   width="0.25em"></mspace><msub><mi
   mathvariant="normal">H</mi><mn>0</mn></msub><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">rejected</mi><mo>;</mo><mspace
   width="0.25em"></mspace></mtd></mtr><mtr><mtd><mi
   mathvariant="normal">and</mi><mspace width="0.25em"></mspace><msub><mi
   mathvariant="normal">H</mi><mn>0</mn></msub><mo>:</mo><mspace
   width="0.25em"></mspace><mi mathvariant="normal">E</mi><mfenced
   close=")" open="("><mi
   mathvariant="normal">X</mi></mfenced><mo>=</mo><mi
   mathvariant="normal">E</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="normal">X</mi><mo>’</mo></mrow></mfenced><mo>;</mo><msub><
   mi mathvariant="normal">H</mi><mn>0</mn></msub><mspace
   width="0.25em"></mspace><mi mathvariant="normal">not</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">rejected</mi></mtd></mtr></mtable> :MATH]
   2

   where X or X’ is the original expression level, |X-u| or |X’-u’| is the
   relative expression level.

   Noted, the differential expression variance includes tight-regulation
   (the expression variances of genes in case samples are less than those
   in control samples) and relax-regulation (the expression variances of
   genes in case samples are larger than those in control samples). And
   importantly, as defined above, the DEVGs have excluded DEGs, or there
   is no overlap between DEVGs and DEGs in this work. That means, when one
   gene has both differential expression and differential expression
   variance, this gene is thought as DEG in priority in order to be
   consistent with conventional analysis; and, of course, this kind of
   genes are worthy of deep research in future work.

   Actually, given X or X’ satisfy normal distribution, |X-u| or |X’-u’|
   will be folded normal distribution, then the Wilcoxon rank sum test
   instead of Student’s T-test is used in the significance test of DEVGs.

Integrative enrichment analysis in the context of differential expression
variance

   Obviously, the conventional enrichment analysis limits to estimate the
   extent of differential expression rather than differential expression
   variance. When considering the contribution of DEVGs on pathway’s
   dysregulation, it is necessary to refine the conventional approach to
   take into account the DEGs and DEVGs together. Naturally, an easiest
   strategy is to put DEGs and DEVGs together as the same dysregulated
   genes and use conventional hypergeometric test to obtain the P-value.
   However, this will disregard the respective distribution of DEGs and
   DEVGs in a target pathway and in the whole transcriptome. Thus, we
   extended the hypergeometric test on two kinds of enriched genes
   simultaneously as bellows. Our approach, noted as HT2 (hypergeometric
   test on the model of the drawn of two group balls), still depends on
   the hypergeometric distribution and uses P-value to measure the
   dysregulation of a pathway in the context of differential expression
   variance.

   Briefly seen in Table [68]1, given there are expression data on total N
   genes, and x[1] DEGs and x[2] DEVGs selected respectively. For some
   pathway, k[1] and k[2] genes from pathway members (totally y genes)
   have differential expression and differential expression variance
   respectively. Then the significance of deregulated genes as DEGs or
   DEVGs enriched in this pathway can be estimated by formula 3. This
   P-value also ranges from zero to one. The less the P-value is, the
   larger dysregulation extent the pathway has, when the significantly
   larger number of genes in this pathway show differential expression or
   differential expression variance.
   [MATH: <mtable><mtr><mtd><mi>P</mi><mfenced close=")"
   open="("><mrow><msub><mi>X</mi><mn>1</mn></msub><mo>=</mo><msub><mi>k</
   mi><mn>1</mn></msub><mo>,</mo><msub><mi>X</mi><mn>2</mn></msub><mo>=</m
   o><msub><mi>k</mi><mn>2</mn></msub></mrow></mfenced><mo>=</mo><mfrac><m
   row><mfenced close=")" open="("><mtable columnalign="center"><mtr
   columnalign="center"><mtd
   columnalign="center"><msub><mi>x</mi><mn>1</mn></msub></mtd></mtr><mtr
   columnalign="center"><mtd
   columnalign="center"><msub><mi>k</mi><mn>1</mn></msub></mtd></mtr></mta
   ble></mfenced><mfenced close=")" open="("><mtable
   columnalign="center"><mtr columnalign="center"><mtd
   columnalign="center"><msub><mi>x</mi><mn>2</mn></msub></mtd></mtr><mtr
   columnalign="center"><mtd
   columnalign="center"><msub><mi>k</mi><mn>2</mn></msub></mtd></mtr></mta
   ble></mfenced><mfenced close=")" open="("><mtable
   columnalign="center"><mtr columnalign="center"><mtd
   columnalign="center"><mi>N</mi><mo>−</mo><msub><mi>x</mi><mn>1</mn></ms
   ub><mo>−</mo><msub><mi>x</mi><mn>2</mn></msub></mtd></mtr><mtr
   columnalign="center"><mtd
   columnalign="center"><mi>y</mi><mo>−</mo><msub><mi>k</mi><mn>1</mn></ms
   ub><mo>−</mo><msub><mi>k</mi><mn>2</mn></msub></mtd></mtr></mtable></mf
   enced></mrow><mfenced close=")" open="("><mtable
   columnalign="center"><mtr columnalign="center"><mtd
   columnalign="center"><mi>N</mi></mtd></mtr><mtr
   columnalign="center"><mtd
   columnalign="center"><mi>y</mi></mtd></mtr></mtable></mfenced></mfrac><
   /mtd></mtr><mtr><mtd><mi>P</mi><mfenced close=")"
   open="("><mrow><msub><mi>X</mi><mn>1</mn></msub><mo>></mo><msub><mi>k</
   mi><mn>1</mn></msub><mo>,</mo><msub><mi>X</mi><mn>2</mn></msub><mo>></m
   o><msub><mi>k</mi><mn>2</mn></msub></mrow></mfenced><mo>=</mo><mn>1</mn
   ><mo>−</mo><mstyle displaystyle="true"><munder><mo
   stretchy="true">∑</mo><mrow><mo><</mo><msub><mi>i</mi><mn>1</mn></msub>
   <mo>,</mo><msub><mi>i</mi><mn>2</mn></msub><mo>></mo><mo>∈</mo><mfenced
   close="]" open="["
   separators=","><mn>0</mn><msub><mi>x</mi><mn>1</mn></msub></mfenced><mo
   >×</mo><mfenced close="]" open="["
   separators=","><mn>0</mn><msub><mi>x</mi><mn>2</mn></msub></mfenced><mo
   >−</mo><mfenced close="]"
   open="("><mrow><msub><mi>k</mi><mn>1</mn></msub><mo>,</mo><msub><mi>x</
   mi><mn>1</mn></msub></mrow></mfenced><mo>×</mo><mo
   stretchy="true">(</mo><msub><mi>k</mi><mn>2</mn></msub><mo>,</mo><msub>
   <mi>x</mi><mn>2</mn></msub><mo
   stretchy="true">]</mo></mrow></munder><mrow><mi>P</mi><mfenced
   close=")"
   open="("><mrow><msub><mi>X</mi><mn>1</mn></msub><mo>=</mo><msub><mi>i</
   mi><mn>1</mn></msub><mo>,</mo><msub><mi>X</mi><mn>2</mn></msub><mo>=</m
   o><msub><mi>i</mi><mn>2</mn></msub></mrow></mfenced></mrow></mstyle></m
   td></mtr><mtr><mtd><mspace
   width="0.25em"></mspace><mo>=</mo><mn>1</mn><mo>−</mo><mstyle
   displaystyle="true"><msub><mo
   stretchy="true">∑</mo><mrow><mo><</mo><msub><mi>i</mi><mn>1</mn></msub>
   <mo>,</mo><msub><mi>i</mi><mn>2</mn></msub><mo>></mo><mo>∈</mo><mfenced
   close="]" open="["
   separators=","><mn>0</mn><msub><mi>x</mi><mn>1</mn></msub></mfenced><mo
   >×</mo><mfenced close="]" open="["
   separators=","><mn>0</mn><msub><mi>x</mi><mn>2</mn></msub></mfenced><mo
   >−</mo><mfenced close="]"
   open="("><mrow><msub><mi>k</mi><mn>1</mn></msub><mo>,</mo><msub><mi>x</
   mi><mn>1</mn></msub></mrow></mfenced><mo>×</mo><mo
   stretchy="true">(</mo><msub><mi>k</mi><mn>2</mn></msub><mo>,</mo><msub>
   <mi>x</mi><mn>2</mn></msub><mo
   stretchy="true">]</mo></mrow></msub><mrow><mfrac><mrow><mfenced
   close=")" open="("><mtable columnalign="center"><mtr
   columnalign="center"><mtd
   columnalign="center"><msub><mi>x</mi><mn>1</mn></msub></mtd></mtr><mtr
   columnalign="center"><mtd
   columnalign="center"><msub><mi>i</mi><mn>1</mn></msub></mtd></mtr></mta
   ble></mfenced><mfenced close=")" open="("><mtable
   columnalign="center"><mtr columnalign="center"><mtd
   columnalign="center"><msub><mi>x</mi><mn>2</mn></msub></mtd></mtr><mtr
   columnalign="center"><mtd
   columnalign="center"><msub><mi>i</mi><mn>2</mn></msub></mtd></mtr></mta
   ble></mfenced><mfenced close=")" open="("><mtable
   columnalign="center"><mtr columnalign="center"><mtd
   columnalign="center"><mi>N</mi><mo>−</mo><msub><mi>x</mi><mn>1</mn></ms
   ub><mo>−</mo><msub><mi>x</mi><mn>2</mn></msub></mtd></mtr><mtr
   columnalign="center"><mtd
   columnalign="center"><mi>y</mi><mo>−</mo><msub><mi>i</mi><mn>1</mn></ms
   ub><mo>−</mo><msub><mi>i</mi><mn>2</mn></msub></mtd></mtr></mtable></mf
   enced></mrow><mfenced close=")" open="("><mtable
   columnalign="center"><mtr columnalign="center"><mtd
   columnalign="center"><mi>N</mi></mtd></mtr><mtr
   columnalign="center"><mtd
   columnalign="center"><mi>y</mi></mtd></mtr></mtable></mfenced></mfrac><
   /mrow></mstyle></mtd></mtr></mtable> :MATH]
   3

Table 1.

   The statistic of DEGs and DEVGs for pathway enrichment analysis in the
   context of differential expression variance
            Pathway             Others                All
   DEG    k[1]        x[1]-k[1]                   x[1]
   DEVG   k[2]        x[2]-k[2]                   x[2]
   Others y-k[1]-k[2] N + k[1] + k[2]-x[1]-x[2]-y N-x[1]-x[2]
   All    y           N-y                         N
   [69]Open in a new tab

Estimating pathway crosstalks to link the dysregulated pathways identified by
IEA and prior-known disease pathways

   The first assistant down-stream analysis method of IEA is to link the
   dysregulated pathways identified by IEA and some prior-known disease
   pathways. Obviously, IEA tends to detect the dysregulated pathways
   related to disease subtypes. These pathways would be disease pathways
   as currently known, or the up-stream/down-stream of the disease
   pathways. Conventional pathway enrichment usually analyses single
   pathway rather than multiple ones. But, the pathway crosstalk, as a
   pair of pathways, also plays important roles in the change of
   phenotypes [[70]25]. An enrichment analysis of such pathway crosstalk
   requires evaluating the enrichment of interactive genes from two
   pathways correspondingly. And the pathway map based on such estimated
   pathway crosstalks is just an additional computational method to
   assistantly supply a bridge between subtype-relevant pathways (i.e.,
   IEA recognized pathways) and disease-relevant pathways (i.e., Target
   pathways from disease database KEGG).

   Given several genes in a pathway as seeds, IEA uses random walk to find
   their partner genes in the other pathway. In fact, random walk with
   restart (RWR) is a well-known ranking algorithm for candidate gene
   prioritization [[71]28]. It supplies the probability of searching the
   random walker at nodes in the steady state, so that, it can give a
   measure of proximity between source nodes (e.g., genes as seeds in a
   pathway) and other nodes in molecule network (e.g., genes in the
   candidate pathway with crosstalk).

   Let N be the adjacency matrix of a gene network with node set V and
   edge set E, in which the element N[ij] equals one if e(i, j) ∈ E (where
   e(i, j) represents the interaction between genes/nodes i and j), or
   zero otherwise. Based on the topological structure of the gene network,
   the transition matrix T can be calculated. Each element in the
   transition matrix is denoted as T[ij] and represents the probability of
   transition from node i to node j. The value of T[ij] can be given by
   one of two ways as follows, the first one is topology-weighted and the
   second one is correlation-weighted.
   [MATH:
   <msub><mi>T</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>=</mo><mfen
   ced close="" open="{"><mtable columnalign="center"><mtr
   columnalign="center"><mtd
   columnalign="center"><mfrac><msub><mi>N</mi><mrow><mi>i</mi><mi>j</mi><
   /mrow></msub><msub><mi>d</mi><mi>i</mi></msub></mfrac><mo>,</mo><mspace
   width="0.5em"></mspace><mi mathvariant="normal">if</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">e</mi><mfenced
   close=")" open="(" separators=","><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">j</mi></mfenced><mo>∈</mo><mi
   mathvariant="normal">E</mi></mtd></mtr><mtr columnalign="center"><mtd
   columnalign="center"><mn>0</mn><mo>,</mo><mspace
   width="0.25em"></mspace><mtext></mtext><mspace
   width="0.2em"></mspace><mi mathvariant="italic">otherwise</mi><mspace
   width="0.25em"></mspace></mtd></mtr></mtable></mfenced><mspace
   width="0.25em"></mspace><mo>,</mo><mspace width="0.5em"></mspace><mi
   mathvariant="normal">where</mi><mspace
   width="0.5em"></mspace><msub><mi>d</mi><mi>i</mi></msub><mo>=</mo><msty
   le displaystyle="true"><msub><mo
   stretchy="true">∑</mo><mrow><mi>j</mi><mo>∈</mo><mi
   mathvariant="normal">V</mi></mrow></msub><mrow><msub><mi>N</mi><mrow><m
   i>i</mi><mi>j</mi></mrow></msub></mrow></mstyle> :MATH]
   [MATH:
   <msub><mi>T</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>=</mo><mfen
   ced close="" open="{"><mtable columnalign="center"><mtr
   columnalign="center"><mtd
   columnalign="center"><mfrac><mrow><msub><mi>w</mi><mrow><mi>i</mi><mi>j
   </mi></mrow></msub><msub><mi>N</mi><mrow><mi>i</mi><mi>j</mi></mrow></m
   sub></mrow><msub><mi>w</mi><mi>i</mi></msub></mfrac><mo>,</mo><mspace
   width="0.5em"></mspace><mi mathvariant="normal">if</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">e</mi><mfenced
   close=")" open="(" separators=","><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">j</mi></mfenced><mo>∈</mo><mi
   mathvariant="normal">E</mi></mtd></mtr><mtr columnalign="center"><mtd
   columnalign="center"><mn>0</mn><mo>,</mo><mspace
   width="0.37em"></mspace><mi mathvariant="italic">otherwise</mi><mspace
   width="0.25em"></mspace></mtd></mtr></mtable></mfenced><mspace
   width="0.25em"></mspace><mo>,</mo><mspace width="0.22em"></mspace><mi
   mathvariant="normal">where</mi><mspace
   width="0.5em"></mspace><msub><mi>w</mi><mi>i</mi></msub><mo>=</mo><msty
   le displaystyle="true"><msub><mo
   stretchy="true">∑</mo><mrow><mi>j</mi><mo>∈</mo><mi
   mathvariant="normal">V</mi></mrow></msub><mrow><msub><mi>w</mi><mrow><m
   i>i</mi><mi>j</mi></mrow></msub><msub><mi>N</mi><mrow><mi>i</mi><mi>j</
   mi></mrow></msub></mrow></mstyle> :MATH]

   The RWR algorithm [[72]28] updates the probability vectors by
   [MATH:
   <msub><mi>P</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msub><mo>=
   </mo><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mi>λ</mi></mrow></mfenced><mi>T</mi
   ><msub><mi>P</mi><mi>k</mi></msub><mo>+</mo><mi>λ</mi><msub><mi>P</mi><
   mn>0</mn></msub><mo>,</mo><mspace
   width="0.62em"></mspace><mi>k</mi><mo>></mo><mn>0</mn> :MATH]

   where T is the transition matrix and p[0] is the initial probability
   vector with the sum of the probabilities as one. In p[0], all the
   source nodes are assigned equal probabilities and other nodes are given
   zero. P[∞] is obtained when the algorithm is convergent. If
   P[∞](i) > P[∞](j), node i is thought to be more proximate to source
   nodes than node j does.

   Thus, a two-way RWR approach (twRWR) is proposed to search the genes
   involved in two interactive pathways and estimate their enrichment for
   evaluating the pathway crosstalk. The steps of two-way RWR include:
     * (i)
       For each pathway u, its DEGs and DEVGs are used as source
       nodes/genes, and RWR is used to rank the genes in known molecule
       network, e.g., protein association network collected from STRING
       database [[73]38].
     * (ii)
       In the high-ranked genes from above RWR analysis, the genes
       belonging to pathway v are the partner genes interactive with
       source genes. Based on the sources genes and their partner genes,
       the enrichment of those interactive genes (E[uv]) in pathways u and
       v can be evaluated by our HT2 approach, i.e., P-value in formula 3.
     * (iii)
       For every pathway, the analysis in steps (i) and (ii) is repeated.
       Then, given a pathway pair (u,v), it is a pathway crosstalk only
       when E[uv] and E[vu] are both significant. Finally, the map of
       pathways consist of those selected pathway crosstalks, where a node
       represents a pathway and an edge represents a pathway crosstalk.

Screening subtype-factor of genotype-phenotype associations based on DEVGs
and dysregulated pathways supplied by IEA

   The second assistant down-stream analysis method of IEA is to screen
   subtype-factors according to the available clinical indices. As stated
   above, IEA focus on the DEVGs and their involved pathways, and these
   genes and pathways are thought as signatures of potential subtypes of
   heterogeneous samples. However, these hidden subtypes might have not
   been identified or formalized in clinics. To evaluate such new
   signatures or subtypes, one direct strategy is to measure the
   correlation between genetic signatures (e.g., DEVGs or dysregulated
   pathways) and clinical indices (e.g., age or bmi). If one signature is
   significantly related to some clinical index, the subtype represented
   by such signature would be medical meaningful as to be observable in
   clinics and this signature is also called as subtype-factor related to
   particular clinical index. The approach to identify such
   subtype-factors is described in bellows.

   For each pathway, its DEVGs are used to group case (or control) samples
   into two clusters, when the case (or control) samples have high varying
   expression compared to control (or case) samples. That means these
   genes have over-expression in one group of samples and under-expression
   in the other group of samples. This pathway would be a candidate
   subtype-factor when these two sample clusters are discriminative on
   some clinical index. On this condition, a clinical subtype of samples
   is thought to be related to a given clinical index, which is
   represented by a subtype-factor (e.g., a DEVG or a dysregulated pathway
   from IEA). Obviously, the clinical subtype of a particular sample might
   be contributed by many subtype-factors (i.e., many pathways). Given a
   known phenotype (e.g., a clinical index), a few subtype-factors
   correlated with this phenotype can be found, although which just
   reveals only the tip of the iceberg for the subtypes of
   genotype-phenotype associations.

   Particularly, different from conventional un-supervised clustering for
   subtype identification, a supervised-like clustering approach (SLC) is
   proposed to identify subtype-factors on the level of pathways. Firstly,
   the case samples can be grouped into two clusters according to their
   features’ values (i.e., DEVGs’ expressions) compared to those values of
   control samples: on each feature (DEVG), one group of samples have
   larger values than controls meanwhile the other group of samples have
   less values than the same controls, or vice versa. That means, a
   hyperplane determined by a few control samples could separate the
   samples space into two sub-spaces, and case samples in each of two
   sub-spaces are grouped into one cluster. Secondly, some clinical
   information of samples can be used to evaluate the potential subtype
   represented by such two clusters of case samples. If the clinical
   values of these two groups of samples have significant difference, a
   clinical subtype of genotype-phenotype association (e.g., the
   correlation between clinical indices and pathway DEVGs) is identified
   and the corresponding pathway is a subtype-factor corresponding to the
   given clinical index.

   Practically, the SLC algorithm on a pathway is implemented as bellows:
     * (i)
       Discrete the expressions of DEVGs of case samples into binary
       vector based on the values of controls: for a DEVG, if its
       expression value is larger than the mean of controls, it is one in
       the binary vector; otherwise, it is zero.
     * (ii)
       Clustering case samples based on the binary vectors by conventional
       methods as hierarchical clustering or K-means, which obtains two
       sample clusters.
     * (iii)
       Calculating the significance of difference between clinical indices
       among above two sample clusters. If the difference is significant,
       this pathway is identified as a subtype-factor of the association
       between the given pathway and clinical index.

Results and discussion

The evaluation of biological meaning of IEA by method comparison

   IEA is proposed to evaluate dysregulated pathways by differential gene
   expression and differential expression variance together. Differential
   expression variance has been reported as a new and important expression
   change during a phenotype change [[74]36], e.g., diseases. In this
   work, the biological hypotheses underlying IEA is that, the
   dysregulated pathways full of genes with differential expression
   variance would be subtype-relevant pathways. Although subtype-relevant
   pathways for particular complex disease are unclear in current pathway
   databases, e.g., KEGG, it is still able to investigate if prior-known
   disease pathways in KEGG would be subtype-relevant and if IEA can
   identify them. In the previous study of gene-set analysis [[75]29], a
   comparison scheme has been built to evaluate the performances of
   different enrichment analysis methods (e.g., ORA or GSEA) based on
   multiple expression datasets about complex diseases. Different from
   previous general comparison, we focus on the comparisons by
   approach-specific datasets, in order to mainly evaluate the biological
   meaning of IEA.

   According to the comparison protocol [[76]29], we ran total eight
   representative enrichment analysis methods on 36 GEO datasets with
   target pathways in KEGG, and obtained the rank of target pathway
   estimated by each method on each dataset; then, for each dataset, we
   rank the eight methods according to their prioritization performance or
   sensitivity performance [[77]29], and this dataset is assigned as a
   specific-data for the Top-K methods (K is set 3); thus, all
   specific-data for one method can consist of K-order approach-specific
   dataset. Generally, on one method’s approach-specific dataset, this
   method should have best or comparable performances than other methods,
   so that, the biological characteristics assumed by this given method
   would significantly displayed on these datasets. Therefore, we can use
   this strategy to investigate the biological meaning of IEA in real
   datasets. In bellows, we firstly summarize the biological hypothesis
   hold by different state-of-the-art enrichment analysis methods and
   their respective quantitative measurements, and then discuss the
   comparison between IEA and others.
     * (i)
       PLAGE: it assumes the activity of pathway rather than the
       expression of pathway genes determines the activated or inhibited
       status of pathways under different conditions; and the pathway
       activity is measured by an activity score as the weights of a
       metagene extracted from all pathway genes by SVD (singular value
       decomposition) [[78]39].
     * (ii)
       GSVA: it proposes the change of pathway activity between control
       and case should be evaluated at the level of samples, e.g.,
       considering the variation of pathway activity over a sample
       population; and the pathway activity is measured by so-called GSVA
       score as a function of the expressions of genes inside and outside
       the pathway, and these scores are assessed similarly as GSEA by
       using the Kolmogorov-Smirnov (KS) like random walk statistic
       [[79]40].
     * (iii)
       PADOG: it assumes that, if the genes highly specific to a given
       pathway occur differential expressions, the respective pathway
       would be truly relevant in that condition; thus, a new gene set
       score is calculated as the mean of absolute values of weighted
       moderated gene t-scores where the gene weights are designed to be
       large for the genes appearing in few pathways and small for genes
       that appear in many pathways [[80]41].
     * (iv)
       GLOBALTEST: it holds an assumption that, if a group of genes (e.g.,
       pathway genes) can be used to predict the clinical outcome, the
       expression patterns of such gene group must differ for dissimilar
       clinical outcomes; thus, it uses generalized linear model to give
       one P-value for a group of genes, not a P-value for each gene,
       which can be applied to estimate the enrichment of a given pathway
       [[81]42].
     * (v)
       MRGSE: it proposes that the high ranks of expression changes (e.g.,
       fold-change) of genes can indicate the differential expression of a
       set of genes (e.g., pathway genes); and the enrichment score or the
       test statistic of a pathway is the mean rank of this gene set,
       i.e., the average of the ranks of t-statistics of pathway genes
       [[82]43].
     * (vi)
       GSA: it is similar to GSEA, and proposes two improvements as the
       maximal average statistic for summarizing gene-sets, and
       restandardization for accurate enrichment inferences [[83]44].
     * (vii)
       ORA: it takes into account the number of differentially expressed
       genes observed in a pathway as indicators of pathway states;
       generally, it uses a basic contingency table to test the
       association between the differential expression status of a gene
       (e.g., differentially expressed gene, or not) and its membership in
       a given gene set (e.g., pathway gene, or not), which can be
       measured by the P-value of a hypergeometric test [[84]45].
     * (viii)
       IEA: it is proposed in this work to generally consider the
       contribution of expression variance in a dysregulated pathway; as
       one implementation, this work takes into account the number of DEGs
       and DEVGs observed in a pathway as indicators of pathway states; it
       is designed to test the association between the differential
       expression/differential expression variance status of a gene and
       their memberships in a given gene set, which can be measured by the
       P-value from proposed HT2 approach in this work.

   First of all, we can cluster the above eight approaches by their
   performances on all datasets to investigate the general association
   among different methods. As shown in Figs. [85]2 and [86]3, the
   similarity among any two methods is measured by four kinds of
   criterion: the first one is whether the ranks given by two methods on
   the same dataset are also the same (i.e., Euclidean distance on ranks
   in Fig. [87]2a); the second one is whether the ranks given by two
   methods have the same change tendency among different datasets (i.e.,
   Correlation distance on ranks in Fig. [88]2b); the third one is whether
   the P-values given by two methods on the same dataset are also the same
   (i.e., Euclidean distance on P-values in Fig. [89]3a); and the last one
   is whether the P-values given by two methods have the same change
   tendency among different datasets (i.e., Correlation distance on
   P-values in Fig. [90]3b). Obviously, GSA and PADOG are both based on
   conventional GSEA, so that they are similar; the proposed IEA is based
   on ORA, thus, they also have similar performances on different
   datasets; PLAGE and GLOBALTEST are closely clustered together, one
   reason is that they both estimate a score from all pathway genes rather
   than individual genes (i.e., PLAGE uses weights of a metagene extracted
   from all pathway genes by SVD, and GLOBALTEST uses generalized linear
   model to give one P-value for a group of genes); in addition, MRGSE and
   GSVA are much different, and also different form other methods, which
   is possibly because they have specific design principles on the
   measurement of pathway dysfunctions, i.e., MRGSE combines the
   t-statistics of individual pathway genes meanwhile GSVA uses a score as
   a function of the expressions of genes inside and outside a pathway.

Fig. 2.

   Fig. 2
   [91]Open in a new tab

   Category of representative gene set analysis approaches based on
   clustering of prioritization performance. a Method clustering based on
   Euclidean distance of ranks of all pathways. b Method clustering based
   on Correlation distance of ranks of all pathways

Fig. 3.

   Fig. 3
   [92]Open in a new tab

   Category of representative gene set analysis approaches based on
   clustering of sensitivity performance. a Method clustering based on
   Euclidean distance of P-values of all pathways. b Method clustering
   based on Correlation distance of P-values of all pathways

   Then, we directly grouped the datasets according to the performance of
   a given method, e.g., some datasets are included as K-order
   IEA-specific datasets, only when the rank of IEA performance compared
   to all methods are in the Top-K on these datasets, where K is set 3 in
   this study. To quantify the performance, sensitivity (i.e., P-value)
   and prioritization (i.e., rank) are adopted as previously [[93]29]. In
   previous evaluation on these datasets, PADOG displays consistently
   comparable performance with other methods, meanwhile, PLAGE, GLOBALTEST
   and MRGSE have the best performances on some categorise of datasets
   [[94]29], which already suggest the existence of approach preferences.