Abstract

   Detecting associations between an input gene set and annotated gene
   sets (e.g., pathways) is an important problem in modern molecular
   biology. In this paper, we propose two algorithms, termed NetPEA and
   NetPEA’, for conducting network-based pathway enrichment analysis. Our
   algorithms consider not only shared genes but also gene–gene
   interactions. Both algorithms utilize a protein–protein interaction
   network and a random walk with a restart procedure to identify hidden
   relationships between an input gene set and pathways, but both use
   different randomization strategies to evaluate statistical significance
   and as a result emphasize different pathway properties. Compared to an
   over representation-based method, our algorithms can identify more
   statistically significant pathways. Compared to an existing
   network-based algorithm, EnrichNet, our algorithms have a higher
   sensitivity in revealing the true causal pathways while at the same
   time achieving a higher specificity. A literature review of selected
   results indicates that some of the novel pathways reported by our
   algorithms are biologically relevant and important. While the
   evaluations are performed only with KEGG pathways, we believe the
   algorithms can be valuable for general functional discovery from
   high-throughput experiments.

   Keywords: pathway, protein–protein interaction network, enrichment
   analysis, gene sets, random walk with restart

1. Introduction

   Modern molecular biology has been revolutionized with the emergence of
   high-throughput experimental technologies such as microarrays and
   next-generation DNA sequencing. While many exciting discoveries have
   been made by data-driven analysis of such whole-genome data sets, an
   important problem that many biologists face everyday is how to
   interpret such large-scale data sets. A typical output from such a
   high-throughput experiment is a list of genes that are observed to be
   associated with a certain phenotype, such as those differentially
   expressed in tumors compared to normal tissues. In contrast to the
   easiness in obtaining the gene list, the bottleneck usually lies in
   understanding the meaning of the genes and generating new testable
   hypotheses with the hope to reveal the underlying molecular cause of
   the phenotype.

   Biologists are knowledge-driven. A principled way to interpret such
   gene lists is to compare them with a database of well-annotated gene
   sets, such as biological pathways. For example, one of the most widely
   used approach, Over Representation Analysis (ORA) [[28]1], counts the
   number of common genes shared by an input gene set and each annotated
   gene set and applies a statistical test, such as the cumulative
   hyper-geometric test, to calculate the statistical significance of the
   overlap. A p-value cutoff, e.g., 0.05, is then applied to select the
   annotated gene sets that share a statistically significant overlap with
   the input gene set. ORA is very easy to implement, and the idea behind
   it is straightforward to biologists. A popular extension of ORA, known
   as the Gene Set Enrichment Analysis (GSEA) [[29]2], tries to eliminate
   the need for an ad hoc cutoff (e.g., expression fold change), which is
   often used in defining the input gene set. GSEA works by ranking all
   genes in the genome according to, say, level of differential
   expression, and tests whether any annotated gene set is ranked
   unexpectedly high or low through a running-sum statistic. While GSEA is
   becoming more popular, it is sensitive to noise and may report too many
   pathways that are conceptually hard to comprehend by biologists. In
   addition, GSEA is not applicable in cases where a completely ranked
   gene list is unavailable. As a result, ORA is still widely used by
   biologists.

   Both ORA and GSEA depend on the availability of trusted gene
   annotations, such as gene ontology (GO) or metabolic pathways, which
   limits their applicability to only well annotated species. In addition,
   gene annotations in databases such as GO or Kyoto Encyclopedia of Genes
   and Genomes (KEGG) pathway may be strongly biased by some classes of
   genes or phenotypes that are popular targets, such as cancer. More
   importantly, it is becoming more and more well known that such
   enrichment-based analyses, including both ORA and GSEA, have very low
   discriminative power, as they treat genes as independent functional
   units. In reality, genes function in a highly coordinated way. For
   example, two gene sets may share few genes but can be involved in
   similar functional pathways, or they can represent two sub-modules of
   the same pathway. Common enrichment-based analysis may not be able to
   detect the relationship between gene sets.

   To address the aforementioned issues, several studies have proposed the
   use of biological networks, such as protein–protein interaction (PPI)
   networks, as a more unbiased tool to investigate the biological meaning
   of gene sets [[30]3,[31]4,[32]5,[33]6]. The rationale is that genes
   that are located within a short distance in the same network are likely
   involved in similar biological processes. As such networks are
   typically obtained from high-throughput experiments, they are less
   likely to be biased by existing knowledge and can probably provide
   better coverage to different classes of genes and phenotypes.
   Furthermore, network-based analysis allows the relationship between
   genes to be explicitly modeled, instead of treating them as independent
   entities. While conceptually interesting, such methods have had limited
   success because high-throughput biological networks are usually very
   noisy, and still have many missing edges. Furthermore, the results
   obtained by such analyses, often in the form of PPI subnetworks, can be
   difficult to interpret because functional connections to biological
   processes are missing.

   Another strategy, which seems to be more successful in practice, is to
   combine both biological networks and pathways in an analysis. For
   example, Alexeyenko et al. proposed a network-based method to
   investigate the associations between input gene sets and annotated gene
   sets by counting the number of network links between members of two
   gene sets [[34]7]. Later, Glaab et al. proposed an algorithm called
   EnrichNet, which extends the method of [[35]7] to include gene pairs
   that are not necessarily direct neighbors but are within close
   proximity in the network [[36]8]. These approaches take gene
   correlations and interactions into consideration and agree with the
   fact that genes function in a coordinated way, which is a meaningful
   improvement over ORA. However, EnrichNet only provides scores to
   measure the functional associations and does not provide information
   about the statistical significance of the scores.

   In this paper, we propose two network-based pathway enrichment analysis
   algorithms, NetPEA and NetPEA’, for conducting a network-based pathway
   enrichment analysis. Our algorithms consider not only shared genes but
   also gene–gene interactions. The two algorithms share some common
   features to identify hidden relationships between an input gene set and
   pathways, but each uses a different randomization strategy to evaluate
   statistical significance and, as a result, emphasize different pathway
   properties.

   The remainder of this paper is organized as follows. We present the
   details of the two algorithms and the data sets in [37]Section 2. In
   [38]Section 3, we present the test results of our methods on multiple
   data sets and discuss the significance of our finding. We conclude with
   some remarks for future improvements in [39]Section 4.

2. Methods and Materials

2.1. Overview of Our Algorithms

   We propose two network-based pathway enrichment analysis algorithms,
   NetPEA and NetPEA’. NetPEA treats gene interactions as important as
   shared genes, while NetPEA’ is devised to find hidden causal pathways
   that are not enriched within the input gene set.

   The two algorithms share their first step to calculate similarity
   scores between the input gene set and pathways based on their
   “closeness” on a PPI network, measured by a random walk with restart
   procedure. The two methods then adopt different randomization
   strategies as their background models to evaluate the statistical
   significance of the similarity scores. NetPEA uses randomized gene sets
   as its background; NetPEA’ employs randomized gene sets and randomized
   networks as its background. Both algorithms take gene interactions into
   account but focus on different pathways. Considering not only shared
   genes but also interacting genes, NetPEA extends the search scope of
   ORA, and therefore it reports more significant pathways. On the other
   hand, NetPEA’ attempts to de-emphasize pathways that are considered
   statistically significant simply because of their overlaps with the
   input gene set, and, as a result, it is able to identify pathways that
   are within close proximity to the input genes but do not have a
   significant overlap with them. As shown in [40]Section 3, these hidden
   pathways, which are typically ignored by ORA and similar approaches,
   may very likely be the actual causal pathways and can be robust among
   experiments.

2.2. Random Walk-Based Similarity Measure

   [41]Figure 1 shows the main component of our methods, calculating
   similarity scores, which is used to measure the closeness of pathways
   to the input gene set. First, we map the genes in the input gene set
   and pathways to a relevant biological network (PPI network in this
   study). Then all nodes are assigned an initial value of 0, except the
   ones in the input gene set which are assigned a value of 1. The Random
   Walk with Restart (RWR) procedure [[42]9] is then used to spread the
   nonzero initial values to other nodes in the network. RWR is a
   well-known machine learning algorithm used to measure the similarities
   between nodes by imagining that, starting from each nonzero node, there
   is a random walker that, at each step, either moves to a randomly
   chosen neighbor or jumps back to the starting node. We formulate the
   procedure in Equation ([43]1). V denotes the vector of initial node
   values; p represents the restart probability, which indicates the
   probability for random walkers to jump back to their starting nodes
   (fixed at 0.5 in this study); M is the PPI network transition matrix;
   and S[n] is a vector of all nodes in the network, which is used to
   measure the similarities between each node in the network and nodes in
   the input gene set after n rounds of propagation. At the very
   beginning, S[0] is initialized with V. After a period of time of
   propagation, S[n] reaches some dynamic balance and the node values
   converge and become stable. For each pathway, we take the average of
   its member gene values as its similarity score to the input gene set.
   [MATH:
   <mrow><mrow><msub><mi>S</mi><mi>n</mi></msub><mo>=</mo><mrow><mo>(</mo>
   <mn>1</mn><mo>−</mo><mi>p</mi><mo>)</mo></mrow><mo>∗</mo><mi>M</mi><mo>
   ∗</mo><msub><mi>S</mi><mrow><mi>n</mi><mo>−</mo><mn>1</mn></mrow></msub
   ><mo>+</mo><mi>p</mi><mo>∗</mo><mi>V</mi></mrow></mrow> :MATH]
   (1)

Figure 1.

   [44]Figure 1
   [45]Open in a new tab

   The workflow for calculating similarity scores between the input gene
   set and pathways. RWR refers to a Random Walk with Restart procedure.

2.3. Network-Based Pathway Enrichment Analysis

   To evaluate the statistical significance for pathways to achieve such
   similarity scores, we introduce two algorithms based on different
   randomization strategies.

2.3.1. Algorithm NetPEA

   NetPEA only randomizes the input gene set, in which we randomly choose
   the same number of genes as the input gene set to calculate the
   similarity score for each pathway. After we repeat this randomization
   1000 times, we then get 1000 similarity scores for each pathway as its
   background. Equation ([46]2) is used to calculate z-scores for pathway
   significance. D is the similarity score using gene set of interest as
   input, while R is a set 1000 similarity scores taking randomized gene
   sets as inputs.
   [MATH:
   <mrow><mrow><mi>z</mi><mo>-</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi
   ><mi>e</mi><mo>=</mo><mfrac><mrow><mi>D</mi><mo>−</mo><mi>m</mi><mi>e</
   mi><mi>a</mi><mi>n</mi><mo>(</mo><mi>R</mi><mo>)</mo></mrow><mrow><mi>s
   </mi><mi>t</mi><mi>d</mi><mo>(</mo><mi>R</mi><mo>)</mo></mrow></mfrac><
   /mrow></mrow> :MATH]
   (2)

   We rank the pathways in descending order according to their z-scores.
   As the distribution of the z-scores roughly follows a normal
   distribution, we also convert the z-scores to p-values under a normal
   distribution assumption. We then select pathways with z-scores greater
   than 1.65, which corresponds to a p-value 0.05, as statistically
   significant pathways.

2.3.2. Algorithm NetPEA’

   NetPEA’ randomizes both the input gene set and PPI network to calculate
   the statistical significance of the associations between the input gene
   set and annotated gene sets. To randomize the network, we rewire the
   network connections randomly but ensure all nodes in the network
   maintain the same degrees as in the original PPI network. Repeating
   this rewiring 10 times, we have 11 networks including the original
   human PPI network. For each network, we perform a random walk with a
   restart procedure to calculate the similarity scores between the true
   input gene set and each of the annotated gene sets. This is also
   repeated for 1000 randomized input gene sets.

   The z-score of the association between an input gene set and each
   annotated gene set is calculated by Equation ([47]3). For each pathway,
   DN represents the similarity score taking the real input gene set and
   human PPI network as input; DR is a set of 10 similarity scores using
   the real input gene set and randomized networks as input; RN represents
   a set of 1000 similarity scores taking randomized gene sets and the
   human PPI network as input; and RR is 10 sets (each set corresponding
   to one randomized network) of 1000 similarity scores taking randomized
   gene sets and randomized networks as input. Similarly as in NetPEA, we
   rank all pathways according to their z-scores, and use a cutoff 1.65 to
   select statistically significant pathways.
   [MATH:
   <mrow><mrow><mi>z</mi><mo>-</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi
   ><mi>e</mi><mo>=</mo><mfrac><mrow><mo>(</mo><mi>D</mi><mi>N</mi><mo>−</
   mo><mi>m</mi><mi>e</mi><mi>a</mi><mi>n</mi><mo>(</mo><mi>D</mi><mi>R</m
   i><mo>)</mo><mo>)</mo><mo>−</mo><mi>m</mi><mi>e</mi><mi>a</mi><mi>n</mi
   ><mo>(</mo><mi>R</mi><mi>N</mi><mo>−</mo><mi>m</mi><mi>e</mi><mi>a</mi>
   <mi>n</mi><mo>(</mo><mi>R</mi><mi>R</mi><mo>)</mo><mo>)</mo></mrow><mro
   w><mi>s</mi><mi>t</mi><mi>d</mi><mo>(</mo><mi>R</mi><mi>N</mi><mo>−</mo
   ><mi>m</mi><mi>e</mi><mi>a</mi><mi>n</mi><mo>(</mo><mi>R</mi><mi>R</mi>
   <mo>)</mo><mo>)</mo></mrow></mfrac></mrow></mrow> :MATH]
   (3)

   Consider that there is a modest overlap between an input gene set and a
   particular pathway, and that none of the genes in the input gene set,
   other than those overlapping, are located within close proximity to the
   pathway genes in the PPI network. In this case, the expected values in
   the vector DR are close to DN. As a result, the z-score for this
   particular pathway will be “corrected” and becomes insignificant in
   NetPEA’ compared to that in NetPEA. Therefore, we expect NetPEA’ to
   report less pathways than NetPEA but, at the same time, be able to
   promote the ranking of the true causal pathways, which may not be
   ranked high in NetPEA or other overlap-based methods.

2.4. Data Sets

   The annotated gene sets used in this study are from KEGG pathways
   [[48]10]. We test our algorithms using input gene sets from several
   well-cited sources, including a Parkinson’s disease gene set [[49]11],
   a lymphoma cancer gene set [[50]12], two breast cancer gene sets
   obtained by two groups [[51]13,[52]14], two lung cancer gene sets
   [[53]15,[54]16], a diabetes disease gene set [[55]17], a leukemia gene
   set [[56]18], and two unpublished gene sets, “gender” and “p53” from
   the GSEA website [[57]2]. All gene sets except “Parkinson” are from
   high throughput experiments. The Parkinson’s disease gene set is from a
   literature search. For four of the data sets, (p53, gender, diabetes,
   leukemia), genes are further categorized into “up” and “down” according
   to the direction its expression level changes in the experiments. The
   human PPI network is downloaded from the Human Protein Reference
   Database (HPRD, version 9) [[58]19].

3. Results and Discussion

   Validation of associations between genes sets is difficult because of
   the lack of ground truth and the biases inherent in different
   evaluation standards. To evaluate the performance of our methods and
   have a fair comparison with the existing ones, we adopted and designed
   multiple evaluation methods.

3.1. Validation Using KEGG Pathways as Input Genes

   To validate that our algorithms can indeed identify the most relevant
   pathways, we first used KEGG pathways as input genes to identify the
   most significantly associated KEGG pathway for each pathway. The
   rationale is that each pathway should have a closer relationship with
   itself. Indeed, NetPEA ranks each pathway itself as the most enriched
   pathway with a very significant z-score. Moreover, some between-pathway
   associations found by NetPEA are also reasonable. For example, the top
   three pathways associated with “DNA replication” are “DNA replication”,
   ”mismatch repair”, and “nucleotide excision repair”, while the top
   three pathways associated with “chemokine signaling” are “hemokine
   signaling”, “cytokine–cytokine receptor interaction”, and “gap
   junction”.

   On the other hand, NetPEA’ ranks the pathway itself as the most
   enriched pathway only for 62% pathways, and ranks the pathway itself in
   the top 10% for 91% of pathways. The deviation from the ground truth is
   because NetPEA’ is intended to explore hidden pathways by
   de-emphasizing pathways that are considered significant simply because
   of their overlaps with the input gene set. Indeed, for a number of
   cases where the input gene set itself is not ranked as the top gene
   set, we found evidence of the association between the reported top gene
   set and the input gene set, such as “maturity onset diabetes of the
   young” and “methane metabolism”. Note also that the association between
   the two pathways is not significant according to NetPEA.

3.2. Validation Using GSEA Outputs as Benchmarks

   For four data sets that we have access to the coupled microarrays and
   ranked gene lists, we applied GSEA to rank the pathways and we use the
   rankings as benchmarks. GSEA is a benchmark widely used to validate
   gene set rankings, and [[59]8] uses it to check pathway rankings. While
   the results may be biased, it provides partial evidence that our
   algorithms achieve better performances. Here we calculate Spearman
   correlation coefficients between each mentioned method and GSEA.
   [60]Table 1 shows that for each input gene set, the largest correlation
   coefficient is from NetPEA or NetPEA’, which means that our algorithms
   gain more support on pathway rankings and are better than ORA and
   EnrichNet.

Table 1.

   Spearman correlation coefficient between GSEA and four other
   approaches.
   Input Gene Set p53 (down) Gender (down) Diabetes (down) Leukemia (down)
       NetPEA       0.4653      0.2713         0.2373          0.3253
      NetPEA’       0.3978      0.1265         0.2401           0.182
        ORA         0.3968      0.2406         0.1602           0.264
     EnrichNet      0.2967       0.219         0.1779          0.2726
   Input gene set  p53 (up)   Gender (up)   Diabetes (up)   Leukemia (up)
       NetPEA       0.3427      0.4349          0.227          0.1195
      NetPEA’       0.2911      0.2756         0.1419          0.1438
        ORA         0.2507       0.332         0.1421          0.0583
     EnrichNet      0.2167      0.3067         0.0823          0.0599
   [61]Open in a new tab

3.3. Evaluation Based on Number of Enriched Pathways

   We apply our algorithms, NetPEA and NetPEA’, to each of the data sets
   mentioned in [62]Section 2.4. Meanwhile, we run ORA on these input gene
   sets and compare the significant pathways found by the three methods.

3.3.1. NetPEA vs. ORA

   [63]Table 2 shows significant pathways only reported by NetPEA but not
   present in the results for ORA. For most of cases (11/14), ORA does not
   identify any pathway that is not found by NetPEA. For common
   significant pathways discovered by both methods, we define
   [MATH:
   <mrow><mrow><mi>N</mi><mi>e</mi><mi>t</mi><mi>P</mi><mi>E</mi><mi>A</mi
   ><mo>≪</mo><mi>O</mi><mi>R</mi><mi>A</mi></mrow></mrow> :MATH]
   ,
   [MATH:
   <mrow><mrow><mi>N</mi><mi>e</mi><mi>t</mi><mi>P</mi><mi>E</mi><mi>A</mi
   ><mo>≫</mo><mi>O</mi><mi>R</mi><mi>A</mi></mrow></mrow> :MATH]
   and
   [MATH:
   <mrow><mrow><mi>N</mi><mi>e</mi><mi>t</mi><mi>P</mi><mi>E</mi><mi>A</mi
   ><mo>≈</mo><mi>O</mi><mi>R</mi><mi>A</mi></mrow></mrow> :MATH]
   by the ratios of their p-values.
   [MATH:
   <mrow><mrow><mi>N</mi><mi>e</mi><mi>t</mi><mi>P</mi><mi>E</mi><mi>A</mi
   ><mo>≫</mo><mi>O</mi><mi>R</mi><mi>A</mi></mrow></mrow> :MATH]
   means the p-value ratio (NetPEA/ORA) is less than 0.001;
   [MATH:
   <mrow><mrow><mi>N</mi><mi>e</mi><mi>t</mi><mi>P</mi><mi>E</mi><mi>A</mi
   ><mo>≪</mo><mi>O</mi><mi>R</mi><mi>A</mi></mrow></mrow> :MATH]
   represents the ratio greater than 1000; otherwise it is
   [MATH:
   <mrow><mrow><mi>N</mi><mi>e</mi><mi>t</mi><mi>P</mi><mi>E</mi><mi>A</mi
   ><mo>≈</mo><mi>O</mi><mi>R</mi><mi>A</mi></mrow></mrow> :MATH]
   . Strikingly, no pathways fall into the range
   [MATH:
   <mrow><mrow><mi>N</mi><mi>e</mi><mi>t</mi><mi>P</mi><mi>E</mi><mi>A</mi
   ><mo>≪</mo><mi>O</mi><mi>R</mi><mi>A</mi></mrow></mrow> :MATH]
   . Overall, NetPEA can successfully identify nearly all significant
   pathways reported by ORA. Moreover, NetPEA reports many significant
   pathways not found by ORA. The superiority of NetPEA over ORA can be
   explained by the fact that NetPEA not only considers the pathway
   enrichment caused by common genes but also takes gene interactions into
   account. Through the gene interactions, some pathways not enriched in
   ORA are elevated to be significant. In [64]Section 3.6, we will show
   that these additional pathways are biologically meaningful.

Table 2.

   Significant pathways: NetPEA vs. ORA .
   Input Gene Set # Unique Pathways # Common Pathways
   NetPEA ORA
   [MATH: <mrow><mstyle mathvariant="bold"><mrow><mi
   mathvariant="bold-italic">P</mi><mi
   mathvariant="bold-italic">NetPEA</mi><mo>≫</mo><mi
   mathvariant="bold-italic">P</mi><mi
   mathvariant="bold-italic">ORA</mi></mrow></mstyle></mrow> :MATH]
   [MATH: <mrow><mstyle mathvariant="bold"><mrow><mi
   mathvariant="bold-italic">P</mi><mi
   mathvariant="bold-italic">NetPEA</mi><mo>≪</mo><mi
   mathvariant="bold-italic">P</mi><mi
   mathvariant="bold-italic">ORA</mi></mrow></mstyle></mrow> :MATH]
   [MATH: <mrow><mstyle mathvariant="bold"><mrow><mi
   mathvariant="bold-italic">P</mi><mi
   mathvariant="bold-italic">NetPEA</mi><mo>≈</mo><mi
   mathvariant="bold-italic">P</mi><mi
   mathvariant="bold-italic">ORA</mi></mrow></mstyle></mrow> :MATH]
   Parkinson 18 0 19 0 18
   Lymphoma 18 0 10 0 5
   Breast cancer [[65]13] 13 0 6 0 12
   Breast cancer [[66]14] 4 1 4 0 16
   Lung cancer [[67]15] 28 0 3 0 12
   Lung cancer [[68]16] 25 1 6 0 17
   Diabetes (down) 13 0 4 0 2
   Diabetes (up) 7 0 0 0 2
   Leukemia (down) 22 0 1 0 2
   Leukemia (up) 7 0 3 0 7
   Gender (down) 7 0 0 0 1
   Gender (up) 10 0 1 0 2
   p53 (down) 14 0 1 0 5
   p53 (up) 24 1 5 0 18
   [69]Open in a new tab

3.3.2. NetPEA’ vs. ORA

   As NetPEA’ is devised to complement NetPEA, it provides new significant
   pathway information that is not present in the results of ORA or
   NetPEA. Compared to information in [70]Table 2, the number of common
   significant pathways decreases for each input gene set ([71]Table 3).
   At the same time, NetPEA’ produces some significant pathways not
   present in ORA. This difference is because of the network randomization
   in NetPEA’, which eliminates some significant pathways in ORA or NetPEA
   with loose gene interactions and lifts insignificant ones in ORA with
   close gene interactions. As shown clearly in [72]Section 3.4, these
   pathways are often preserved between different experiments for the same
   disease, signifying the importance of the pathways for the disease.

Table 3.

   Significant pathways: NetPEA’ vs. ORA.
   Input Gene Set # Unique Pathways # Common Pathways
   NetPEA’ ORA
   [MATH: <mrow><mstyle mathvariant="bold"><mrow><mi
   mathvariant="bold-italic">P</mi><msup><mi
   mathvariant="bold-italic">NetPEA</mi><mo>′</mo></msup><mo>≫</mo><mi
   mathvariant="bold-italic">P</mi><mi
   mathvariant="bold-italic">ORA</mi></mrow></mstyle></mrow> :MATH]
   [MATH: <mrow><mstyle mathvariant="bold"><mrow><mi
   mathvariant="bold-italic">P</mi><msup><mi
   mathvariant="bold-italic">NetPEA</mi><mo>′</mo></msup><mo>≪</mo><mi
   mathvariant="bold-italic">P</mi><mi
   mathvariant="bold-italic">ORA</mi></mrow></mstyle></mrow> :MATH]
   [MATH: <mrow><mstyle mathvariant="bold"><mrow><mi
   mathvariant="bold-italic">P</mi><msup><mi
   mathvariant="bold-italic">NetPEA</mi><mo>′</mo></msup><mo>≈</mo><mi
   mathvariant="bold-italic">P</mi><mi
   mathvariant="bold-italic">ORA</mi></mrow></mstyle></mrow> :MATH]
   Parkinson 7 28 3 1 5
   Lymphoma 16 5 4 0 6
   Breast cancer [[73]13] 11 9 0 0 9
   Breast cancer [[74]14] 5 9 1 0 11
   Lung cancer [[75]15] 28 12 1 0 2
   Lung cancer [[76]16] 27 12 1 0 11
   Diabetes (down) 19 4 0 0 2
   Diabetes (up) 13 2 0 0 0
   Leukemia (down) 16 2 0 0 1
   Leukemia (up) 15 6 0 0 4
   Gender (down) 5 1 0 0 0
   Gender (up) 20 3 0 0 0
   p53 (down) 13 5 0 0 1
   p53 (up) 27 15 0 0 9
   [77]Open in a new tab

3.4. Evaluation Using Cross-Data Stability Analysis

   It is well known that the agreement is often poor between different
   high throughput experiments concerning the same disease performed by
   different groups. Among other reasons, this is because many of the
   genes identified by these experiments are caused by downstream effects,
   which can vary significantly among experiments. It is reasonable to
   assume that if indeed we can find true causal genes/pathways, agreement
   between experiments will be improved. Therefore, for the genes
   identified from the two breast cancer data sets and the two lung cancer
   data sets, we compare the significant pathways from different data sets
   reported by NetPEA, NetPEA’ and ORA. We calculate the ratios of common
   significant pathways to the total number of unique significant pathways
   identified from the two experiments as well as the p-values of the
   overlap under the hypergeometric distribution using 208 total KEGG
   pathways as background. [78]Table 4 shows that NetPEA can find more
   common significant pathways than ORA and NetPEA’ as a result of the
   increased number of significant pathways reported by NetPEA, which
   partially suggest that the additional pathways reported by NetPEA are
   reasonable. Remarkably, while NetPEA’ reports much fewer pathways than
   NetPEA, the number of common pathways remains almost unchanged. As
   shown in [79]Figure 2, the statistical significance of overlap between
   the pathways detected from different data sets is the highest in
   NetPEA’, compared to NetPEA and ORA. Therefore, we believe that the
   pathways reported by NetPEA’ may have a greater chance of containing
   the pathways that are directly associated with the phenotype instead of
   downstream effects. In addition, while the overlap of pathways detected
   by NetPEA is not as significant as in ORA, this is mainly due to the
   increased number of detected pathways and the limited number of
   candidate pathways as background. Since almost all pathways reported by
   ORA are also reported by NetPEA, we removed the ORA-detected pathways
   from NetPEA results and reanalyzed the overlap. As shown in [80]Figure
   2 (“NetPEA unique”), the pathways found by NetPEA but not by ORA do
   have an increased level of overlap compared to ORA, suggesting that the
   additional pathways identified by NetPEA are biologically relevant.
   Collectively, the results suggest that our algorithms have an advantage
   in interpreting the results of high throughput experiments performed by
   different groups and can potentially discover the key pathways
   underlying the diseases.

Table 4.

   Common significant pathways analysis.
   NetPEA NetPEA’ ORA
   Common pathways between two breast cancer data sets ([[81]13,[82]14])
   glycolysis/gluconeogenesis, homologous recombination,
   oocyte meiosis, p53 signaling, progesterone-mediated
   oocyte maturation, base excision repaire, cell cycle lipoic acid
   metabolism, progesterone-mediated oocyte maturation, cell cycle,
   protesome, ubiquitin mediated proteolysis, oocyte meiosis
   glycolysis/gluconeogensis, homologous recombination,
   progesterone-mediated oocyte maturation, cell cycle, oocyte meiosis
   Common pathways between two lung cancer data sets ([[83]15,[84]16])
   DNA  replication, ECM -receptor interaction, focal adhesion, mismatch
   repair, nucleotide excision repair, pancreatic cancer, pathways in
   cancer, prostate cancer, small cell lung cancer, base excision repair,
   bladder cancer antigen processing and presentation, base excision
   repair, DNA replication, ErBB signaling, FC epsilon RI signaling, FC
   gamma r-mediated phagocytosis, lysosome, mismatch repair, nucleotide
   excision repair, prostate cancer, vibrio cholerae infection focal
   adhesion, mismatch repair, pathways in cancer, small cell lung cancer
   [85]Open in a new tab

Figure 2.

   [86]Figure 2
   [87]Open in a new tab

   Common pathways from two different datasets of the same disease. (a)
   Overlap between two breast cancer data sets; (b) Overlap between two
   lung cancer data sets.

3.5. Pathways Cross Verification Analysis

   Checking whether pathways ranked at the top by one method are also
   ranked at the top by other methods can provide additional confidence to
   biologists and help biologists to narrow down new hypotheses to test.
   Here we use two cross verification methods, positive cross verification
   and negative cross verification, to compare our algorithms with ORA,
   EnrichNet and GSEA. For positive cross verification, we examine how
   many pathways out of the top 20 by one method appear in the top 20
   pathways determined by all the other methods. For negative cross
   verification, we checked how many pathways out of the top 20 by one
   method are ranked below the top 100 by the other three methods. We
   verify NetPEA and NetPEA’ separately because if we verify them together
   they may vote for each other, which would provide biased, favorable
   results for our algorithms.

3.5.1. NetPEA

   [88]Table 5 shows that ORA receives the most recognition and reports
   only one pathway that is not agreed by others, which is understandable
   because ORA is the most conservative method and most of its results are
   also reported by NetPEA and EnrichNet. For the two network-based
   approaches, they receive similar results on positive cross
   verification, while NetPEA is better than the counterpart with less
   negative results. Moreover, the only pathway of negative verification
   result of NetPEA is “taste transduction” for diabetes, which has been
   reported previously [[89]20]. The pathways of EnrichNet’s negative
   verification result include “thyroid cancer”, “basal cell carcinoma”,
   “melanogenesis”, “endometrial cancer” and “hedgehog signaling”. Our
   limited literature search does not reveal enough evidence of their
   associations with diabetes. GSEA is the one receiving the least
   recognition as it exploits whole microarrays and, from a methodology
   point of view, it is far away from the other three methods. Its
   negative cross verification results include “olfactory transduction”,
   “mismatch repair” and “snare interactions in vesicular transport” for
   diabetes. These associations claimed by GSEA are hard to understand.
   Therefore, NetPEA has an advantage over other methods to rank
   meaningful pathways at the top.

Table 5.

   Pathways cross verification analysis for NetPEA.
      Input Gene Set            Positive                  Negative
                        NetPEA ORA EnrichNet GSEA NetPEA ORA EnrichNet GSEA
   Lung cancer [[90]15]   16   16     16      7     0     0      0      2
   Lung cancer [[91]16]   14   16     16      4     0     0      1      8
     Diabetes (down)      17   19     16      5     0     0      0      3
      Diabetes (up)       15   16     13      5     1     1      5      3
     Leukemia (down)      18   18     15      5     0     0      0      7
      Leukemia (up)       15   19     15      8     0     0      1      4
      Gender (down)       18   19     15      8     0     0      1      2
       Gender (up)        17   19     15      7     0     0      0      1
        p53 (down)        19   19     13      10    0     0      1      2
         p53 (up)         16   15     16      6     0     0      2      3
   [92]Open in a new tab

3.5.2. NetPEA’

   Compared with [93]Table 5, [94]Table 6 shows that NetPEA’ receives less
   support than NetPEA. This is reasonable since NetPEA’ eliminates some
   pathways that are an important part of ORA. On the other hand, we
   conclude that NetPEA’ shares some similarities with GSEA because the
   results of positive verification of GSEA increase while its negative
   verification results decrease.

Table 6.

   Pathways cross verification analysis for NetPEA’.
      Input Gene Set             Positive                  Negative
                        NetPEA’ ORA EnrichNet GSEA NetPEA’ ORA EnrichNet GSEA
   Lung cancer [[95]15]    7    14     16      9      0     0      0      2
   Lung cancer [[96]16]    4    15     15      4      4     0      1      7
     Diabetes (down)       6    19     16      6      4     0      0      2
      Diabetes (up)        8    15     13      5      1     1      4      2
     Leukemia (down)       7    15     16      5      1     0      0      6
      Leukemia (up)        7    18     15      9      1     0      0      2
      Gender (down)        8    17     15      9      2     0      1      2
       Gender (up)         9    18     15      10     2     0      0      1
        p53 (down)         6    14     14      11     1     0      1      3
         p53 (up)          6    13     15      8      0     0      2      2
   [97]Open in a new tab

3.6. Novel Pathways

   As shown in [98]Section 3.3, our algorithms usually report more
   significant pathways than ORA. A careful inspection of these additional
   pathways suggests that many of them are biologically relevant and
   important. Here we only discuss a few of these pathways.

   For the diabetes down-regulated input gene set, NetPEA ranks the
   pathway “glycerolipid metabolism”, as 5th with a significant z-score
   3.5 (p-value =
   [MATH:
   <mrow><mrow><mn>2.3</mn><mo>×</mo><msup><mn>10</mn><mrow><mo>−</mo><mn>
   4</mn></mrow></msup></mrow></mrow> :MATH]
   ), and NetPEA’ ranks it as 1st with a significant z-score 7.9 (p-value
   =
   [MATH:
   <mrow><mrow><mn>1.4</mn><mo>×</mo><msup><mn>10</mn><mrow><mo>−</mo><mn>
   15</mn></mrow></msup></mrow></mrow> :MATH]
   ). The same pathway has a p-value 0.12 in ORA and is ranked 37th by
   EnrichNet. Extensive literature review shows that “glycerolipid
   metabolism” plays an important role in the pathogenesis of obesity and
   type 2 diabetes [[99]21,[100]22].

   Another good example is the Leukemia up-regulated input gene set, where
   NetPEA’ ranks the pathway “chronic myeloid leukemia” 4th with a
   significant z-score 3.12 (p-value =
   [MATH:
   <mrow><mrow><mn>9.0</mn><mo>×</mo><msup><mn>10</mn><mrow><mo>−</mo><mn>
   14</mn></mrow></msup></mrow></mrow> :MATH]
   ). ORA ranks it 63rd with an insignificant p-value, and EnrichNet ranks
   it 77th. This pathway is missed by NetPEA (z-score = 0.23). For the
   Leukemia down-regulated gene set, both NetPEA and NetPEA’ rank “folate
   biosynthesis” as the most significant pathway (z-score = 5.0 and 6.5
   respectively), while the same pathway is ranked 22nd in EnrichNet and
   has a p-value 0.05 in ORA. A search through the literature confirms
   that the relationship between the pathway, “folate biosynthesis”, and
   leukemia can be verified [[101]23].

   Other verifiable significant associations that are identified by our
   methods but missed by both EnrichNet (rank > 30) and ORA (p-value >
   0.05) include “pathways in cancer” for the p53 up-regulated gene set,
   “steroid hormone biosynthesis” and “sphingolipid metabolism” in
   Parkinson’s disease, “natural killer cell mediated cytotoxicity” in
   diabetes, as well as “MAPK signaling”, “ERBB signaling”, “PPAR
   signaling”, “focal adhesion” and “ECM receptor interaction” for various
   cancer gene sets, to mention a few.

4. Conclusions

   In this paper, we propose two novel network-based algorithms to analyze
   functional associations between input gene sets and annotated gene sets
   (e.g., KEGG pathways). The two algorithms apply different randomization
   strategies to evaluate the statistical significance of the associations
   and often return complementary results. Compared to the well-adopted
   over representation analysis (ORA), our methods extend beyond
   overlap-based comparison, and as a result they are able to identify
   more significant pathways, report more common pathways shared by
   different gene sets of the same diseases and gain more GSEA support on
   the pathways rankings. Compared to another network-based approach,
   EnrichNet, our algorithms usually report fewer false negative pathways,
   have a better discriminative power and provide statistical
   significance. We demonstrate that novel significant pathways reported
   by our algorithms are biologically meaningful and are confirmed by
   previous publications. In the future, we plan to extend the methods to
   be applied to multiple heterogeneous terms and contexts.

Acknowledgments