Abstract

Background

   In a complex disease, the expression of many genes can be significantly
   altered, leading to the appearance of a differentially expressed
   "disease module". Some of these genes directly correspond to the
   disease phenotype, (i.e. "driver" genes), while others represent
   closely-related first-degree neighbours in gene interaction space. The
   remaining genes consist of further removed "passenger" genes, which are
   often not directly related to the original cause of the disease. For
   prognostic and diagnostic purposes, it is crucial to be able to
   separate the group of "driver" genes and their first-degree neighbours,
   (i.e. "core module") from the general "disease module".

Results

   We have developed COMBINER: COre Module Biomarker Identification with
   Network ExploRation. COMBINER is a novel pathway-based approach for
   selecting highly reproducible discriminative biomarkers. We applied
   COMBINER to three benchmark breast cancer datasets for identifying
   prognostic biomarkers. COMBINER-derived biomarkers exhibited 10-fold
   higher reproducibility than other methods, with up to 30-fold greater
   enrichment for known cancer-related genes, and 4-fold enrichment for
   known breast cancer susceptible genes. More than 50% and 40% of the
   resulting biomarkers were cancer and breast cancer specific,
   respectively. The identified modules were overlaid onto a map of
   intracellular pathways that comprehensively highlighted the hallmarks
   of cancer. Furthermore, we constructed a global regulatory network
   intertwining several functional clusters and uncovered 13 confident
   "driver" genes of breast cancer metastasis.

Conclusions

   COMBINER can efficiently and robustly identify disease core module
   genes and construct their associated regulatory network. In the same
   way, it is potentially applicable in the characterization of any
   disease that can be probed with microarrays.

Background

   In recent years, gene expression signatures based on DNA microarray
   technology have proven useful for predicting the risk of breast cancer.
   Agendia's MammaPrint has become the first FDA-cleared breast cancer
   prognosis marker chip containing 70 gene signatures [[30]1]. Many other
   microarray-based biomarkers, such as 76 gene signatures [[31]2] have
   been derived using independent data sources. However, there are only
   three overlaps between MammaPrint's 70-gene and Wang's 76-gene
   signatures. Furthermore, many of these markers are functionally
   unrelated to breast cancer. In order to identify robust, functionally
   relevant disease biomarkers, it is crucial to find gene signatures that
   are consistent in various data sources.

   A complex disease such as breast cancer results in many differentially
   expressed genes (DEGs), which together can be used to construct a
   "disease module" network [[32]3]. Some of these DEGs directly
   correspond to the disease phenotype (i.e. "driver" genes). The
   expression changes enacted on the driver genes lead to a cascade of
   changes of other genes: initially to their first-degree interaction
   neighbors [[33]4], followed by downstream effects to so-called
   "passenger" genes. Due to their direct relevance to the biology of the
   disease in question, the expression changes of the driver genes and
   their first-degree neighbours (i.e. members of the "core module"),
   should be more consistent than those of the passenger genes when
   compared across independent cohorts. However, it is often difficult to
   separate the core module from the passenger genes for a given disease
   [[34]5,[35]6]. In this paper, we aim to isolate the core module from
   the more general disease module and further identify the driver genes
   using network analysis.

   The most intuitive way of finding the disease core module is to
   identify the Differential Expressed Genes (DEGs) over various cohorts.
   Unfortunately, the typically larger number of passenger genes in each
   cohort will contribute to the majority of gene overlaps, due to
   statistical chance. A more biologically-motivated technique for
   identifying the core module is to find overlapping differentially
   expressed pathways. However, a pathway may also contain hundreds of
   genes with respect to the disease in question, while only a functional
   submodule (a small group of genes) is differentially expressed. These
   submodules are often overlooked in pathway enrichment analysis.

   In light of the aforementioned challenges, we propose to identify
   Pathway Activities (PAs) from cohorts of data and use supervised
   classification to isolate a consistent core module. Each PA is a vector
   aggregating the information of a few genes expressed in a pathway
   [[36]7,[37]8]. The use of PAs for biomarker identification has been
   shown improve reproducibility and disease-related functional enrichment
   of the resulting biomarkers [[38]7]. The main idea behind our method is
   to infer the most significant PAs in each data cohort, and validate
   these PAs using classification methods in other cohorts. If a PA also
   scores highly in all the other cohorts, we consider it to be
   consistently differentially expressed in the disease of interest.
   Furthermore, we would consider the genes that make up the PA to belong
   to the disease core module.

   In this work, we develop a novel biomarker identification framework
   entitled COre Module Biomarker Identification with Network ExploRation
   (COMBINER). COMBINER identifies "core module" (Figure [39]1) that are
   consistently differentially expressed as a whole in the data cohorts of
   interest. COMBINER uses a Core Module Inference (CMI) component to
   infer candidate PAs from pathway database, a Consensus Feature
   Elimination (CFE) component to filter out irreproducible PAs, and a
   multi-level reproducibility validation framework to find the consistent
   PAs, which in turn make up the complete core module. In its final step,
   COMBINER uses known pathways and protein networks to identify the
   driver genes within this core module.

Figure 1.

   [40]Figure 1
   [41]Open in a new tab

   Schematic overview of COMBINER. COMBINER uses Core Module Inference
   (CMI) to infer candidate pathway activities from each pathway in an
   inference dataset, Consensus Feature Elimination (CFE) to filter out
   irreproducible activities in validation datasets, and a multi-level
   reproducibility validation framework to conduct pair-wise validations
   to find common reproducible activities which make up the "core module".
   To identify the driver genes, we reassemble the resulting core module
   markers in both intracellular signalling pathways and a large overall
   regulatory network reflecting interactions between pathways.

   To illustrate its utility, we apply COMBINER to three benchmark breast
   cancer datasets. We evaluate the resulting core module for accuracy,
   reproducibility, and enrichment for known cancer-related genes. We then
   explore the roles of the COMBINER-identified core module in the
   hallmarks of cancer, and we reconstruct a breast cancer-specific
   interaction network composed of functionally coherent modules. Finally,
   we summarize our analyses by identifying 13 high confidence driver
   genes from COMBINER markers.

Results and Discussion

Overview

   COMBINER is a multi-level optimization framework for identifying core
   module markers (Figure [42]1 and Methods). Briefly, COMBINER infers
   candidate submodules from known pathways, identifies the reproducible
   "core module" using independent cohorts, and uses intracellular
   signaling pathways and protein networks to identify the "driver" genes
   from the "core module".

   We applied COMBINER to three independent breast cancer datasets to
   evaluate its effectiveness: Netherlands [[43]9], USA [[44]2], and
   Belgium [[45]10]. We obtained pathway information from the MsigDB v3.0
   Canonical Pathways subset [[46]11]. To decrease redundancy, we applied
   pathway filtering to remove bulky pathways such as KEGG Pathways of
   Cancer. This resulted in a pathway dataset containing 624 pathways with
   5,155 genes assayed in all three benchmark datasets.

Core Module Inference improves reproducibility and classification accuracy

   A primary challenge of pathway inference is to find pathway subsets
   that are reproducible between independent datasets. We compared Core
   Module Inference (CMI) with five other inference methods as well as
   individual genes (see Methods). When compared to a range of numbers of
   inferred Pathway Activities (PAs), CMI showed two-fold increased
   reproducibility over the related CORG method and about a 10-fold
   improvement over other methods (Figure [47]2).

Figure 2.

   [48]Figure 2
   [49]Open in a new tab

   Reproducible power of pathway inference methods. The reproducibility
   power of a pathway inference method in an inference-validation pair
   datasets is measured by
   [MATH:
   <mrow><msub><mrow><mi>C</mi></mrow><mrow><mi>s</mi><mi>c</mi><mi>o</mi>
   <mi>r</mi><mi>e</mi></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>N</mi></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-rel">=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>N</
   mi></mrow></mfrac><msubsup><mrow><mo mathsize="big">
   ∑</mo></mrow><mrow><mi>i</mi><mo
   class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow><mi>N</mi></mrow></m
   subsup><msub><mrow><mi>t</mi></mrow><mrow><mi>s</mi><mi>c</mi><mi>o</mi
   ><mi>r</mi><mi>e</mi></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><msubsup><mrow><mi>P</mi></mrow><mro
   w><mi>I</mi></mrow><mrow><mi>i</mi></mrow></msubsup></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-bin">⋅</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>s</m
   i><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><msubsup><mrow><mi>P</mi></mrow><mro
   w><mi>V</mi></mrow><mrow><mi>i</mi></mrow></msubsup></mrow><mo
   class="MathClass-close">)</mo></mrow></mrow> :MATH]
   , where
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>I</mi></mrow><mrow><mi>i</mi>
   </mrow></msubsup> :MATH]
   is the i^th PA in descending order in the inference dataset,
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>V</mi></mrow><mrow><mi>i</mi>
   </mrow></msubsup> :MATH]
   is its corresponding PA in the validation dataset, and N is the number
   of selected inferred pathways. The overall reproducibility is then
   defined as the average Cscore of selected top inferred pathway
   activities over all six inference-validation pairs. We compared CMI
   with five inference methods, including the CORG, mean, median, first
   component score of PCA, as well as no-inferring gene method. Comparing
   by different ranges of top inferred activities, the CMI showed
   significant better overall reproducibility over other methods.

   We then compared the classification accuracy of CMI and the other
   inference methods using Linear Discriminant Analysis-Consensus Feature
   Elimination (LDA-CFE) classifiers focused on the top 100 inferred PAs
   (Methods). As shown in Figure [50]3, COMBINER run using PA vectors
   identified by CMI (CMI-COMBINER) exhibits better overall accuracy than
   the other methods coupled with COMBINER. Similarly, CMI also shows good
   overall accuracy using the SVM classifier (Additional file [51]1,
   Figure S1).

Figure 3.

   [52]Figure 3
   [53]Open in a new tab

   Comparison of CMI and other inference methods-based COMBINER using
   LDA-CFE classifiers focused on the top 100 inferred pathways. Seven
   methods were compared here, including CMI, CORG, Mean, Median, PCA, LLR
   and Individual Gene. (a) Classification accuracy for best feature set:
   pair-wise comparisons. Starting from all 100 inferred pathway
   activities, we recursively removed the activity with the lowest average
   weight from 500 LDA classifiers, until the maximum average AUC was
   reached. The process was repeated 100 times and the most frequently
   occurring marker set was regarded as the ultimate marker. We measured
   classification accuracy of each method by computing AUC mean ± standard
   error for the final feature set. (b) Classification accuracy overall.
   The overall classification accuracy was measured by computing the
   average maximum mean AUC of all six inference-validation pairs. On
   average, CMI was superior to the other methods, even though its
   activity vector consisted of expression values from only a few genes in
   each pathway.

Core module markers enrich cancer-related genes

   We compared the enrichment of known cancer genes in the biomarkers
   discovered by CMI-COMBINER, (93 genes); CORG-COMBINER, (i.e. COMBINER
   run using CORG activity vectors), (123 genes); Subnetwork markers (1162
   genes) ( [[54]7], [55]http://www.cellcircuits.com); MammaPrint's
   70-gene signature (G70) (70 genes) [[56]1]; and Wang's 76-gene
   signature (G76) (76 genes) [[57]2]. Seven known cancer gene datasets
   were compared (see Materials and methods). Both CMI-COMBINER and
   CORG-COMBINER showed much higher enrichment of cancer-related genes in
   their biomarker signatures (Table [58]1). Specifically, CMI- and
   CORG-COMBINER showed up to 4-fold increased enrichment over subnetwork
   markers and up to 30-fold enrichment over other gene signatures. In
   particular for known breast cancer genes in Census, they exhibited up
   to 4 fold enrichment over others. More than 50% and 40% of the
   resulting biomarkers are cancer and breast cancer specific,
   respectively. Additionally, CMI-COMBINER showed greater enrichment than
   CORG-COMBINER with respect to the Atlas of Cancer Genes, which is the
   largest cancer gene collection. Consistent to Chuang et al's results
   [[59]7],. we also found insignificant enrichment in CANgene dataset
   including 122 mutative genes from 11 breast cancer cell lines. A
   possible explanation is that "the cancer cell lines capture a different
   disease state than that found in the population of patients surveyed by
   microarray profiling." [[60]7] The COMBINER core module markers with
   associated pathways are summarized in Additional file [61]2, Table S1
   and Additional file [62]3, Table S2. Additional file [63]4, Table S3
   lists the overlaps between CMI-/CORG-COMBINER and KEGG pathways of
   cancer, along with up-/down-regulation information.

Table 1.

   Cancer Gene Enrichment rate of various breast cancer gene signatures
           CMI-COMBINER CORG-COMBINER Subnetwork G70    G76
   NetPath 54.17%*      50.41%*       26.33%*    10.00% 10.53%
   Atlas   60.42%*      46.34%        32.87%     15.71% 18.42%
   Census  11.46%*      13.82%*       5.42%*     2.86%  0.00%
   CANgene 1.04%        1.63%         0.52%      0.00%  0.00%
   G2SBC   43.75%*      46.34%*       19.02%     21.43% 10.53%
   COSMIC  16.67%       17.89%*       7.06%      4.29%  1.32%
   KEGG    35.42%*      29.27%*       9.90%*     8.57%  1.32%
   [64]Open in a new tab

   * p-value < 0.05 for hypergeometric tests

Core module markers highlight the hallmarks of cancer

   As shown in Figure [65]4, the COMBINER-discovered biomarkers are
   overlaid on the hallmarks of cancer [[66]12,[67]13], which integrate
   the common intracellular signalling pathways of all subtypes of cancer.
   The components of the core module markers from CMI and CORG along with
   eighteen common markers are listed in different fonts. The remaining
   proteins (most were not differentially expressed) in the pathways are
   consolidated into unlabeled nodes. Figure [68]4 shows that the
   identified core module genes comprehensively highlight the hallmarks,
   demonstrating the high specificity of COMBINER. In particular, 18
   common markers, which we regard as the most reliable predictors,
   describe well-characterized processes involving growth factors,
   survival factors, the cell cycle, and the ExtraCellular Matrix (ECM).
   The modules unique to CMI-COMBINER include anti-apoptosis and JAK-STAT
   cascades, while pathways describing anti-growth factors and death
   factors were unique to CORG-COMBINER. A few well-known mutant proteins,
   including cyclin D1 and p53, may play an important role in connecting
   other signatures [[69]7], but they showed only limited predictive
   ability in the three breast cancer datasets.

Figure 4.

   [70]Figure 4
   [71]Open in a new tab

   COMBINER biomarkers overlap with well-known cancer-related signalling
   pathways. The core module markers from CMI and CORG are listed in
   normal and italic fonts, respectively, while the common markers are in
   bold. Red/green color denotes up-/down-regulation. The remaining
   proteins in the circuit are abstracted as unlabeled nodes. The common
   core module markers of CMI- and CORG-COMBINER describe growth factors,
   survival factors, the cell cycle, and the extracellular matrix. Unique
   pathways to CMI-COMBINER include the anti-apoptosis and JAK-STAT
   cascade, while anti-growth factor and death factor pathways were
   discovered uniquely by CORG-COMBINER.

Core module markers in predicted protein-protein interaction networks
underpin functional modules

   Figure [72]5 shows how a regulatory network was constructed using the
   interactome of the core module markers. The regulatory network was
   divided into a few functional modules, including cell cycle and ECM.
   These functional modules were interconnected by 20 "hub" genes (larger
   pink/green nodes), 13 of which overlapped with the common marker genes
   (Additional file [73]2, Table S1). Our results imply that these 13
   "hub" markers are the essential "driver" genes of breast cancer
   metastasis (Table [74]2). For example, BRCA1 is among the most
   well-characterized genes whose mutation gives rise to breast cancer. In
   addition, low E2F1 transcript levels strongly predicted good prognosis
   based on quantitative RT-PCR in 317 primary breast cancer patients
   [[75]14]. We further enlarged the nodes of three standard breast cancer
   indicators TP53, BRCA1, and ERBB2, which connect many of the
   surrounding hub genes. Although TP53 and ERBB2 are useful for a
   mechanistic understanding of breast cancer, they were not identified as
   discriminative gene markers. A regulatory network was also created
   representing CORG-COMBINER (Additional file [76]5, Figure S2), but no
   additional "hub" markers were found.

Figure 5.

   [77]Figure 5
   [78]Open in a new tab

   Regulatory networks of CMI-COMBINER biomarkers The pink/green nodes
   denote up-/down-regulation of gene expression. The orange nodes
   indicate contradictory regulation in different datasets. Larger nodes
   are highly connected in the network; most are overlaps between CMI- and
   CORG-COMBINER. The three well-known oncogenes for breast cancer
   metastasis-TP53, BRCA1, and ERBB2-were enlarged further. The core
   module markers were reassembled into an overall interaction network.
   Known functional modules neatly overlay well-connected clusters. Many
   of the highly connected genes are known "driver" genes playing an
   important role in breast cancer metastasis.

Table 2.

   Confident "driver" genes for breast cancer metastasis
   Symbol Entrez Description
   MAP2K1 [[79]32] 5604 mitogen-activated protein kinase kinase 1
   E2F1 [[80]14] 1869 E2F transcription factor 1
   GRB2 [[81]33] 2885 growth factor receptor-bound protein 2
   NFKB1 [[82]34] 4790 nuclear factor of kappa light polypeptide gene
   enhancer in B-cells 1
   RB1 [[83]35] 5925 retinoblastoma 1
   BRCA1 [[84]36] 672 breast cancer 1, early onset
   FOS [[85]37] 2353 v-fos FBJ murine osteosarcoma viral oncogene homolog
   SOS1 [[86]38] 6654 son of sevenless homolog 1 (Drosophila)
   PIK3CA [[87]39] 5290 phosphoinositide-3-kinase, catalytic, alpha
   polypeptide
   JAK1 [[88]40] 3716 Janus kinase 1
   SHC1 [[89]41] 6464 SHC (Src homology 2 domain containing) transforming
   protein 1
   MYC [[90]42] 4609 v-myc myelocytomatosis viral oncogene homolog (avian)
   CCNA2 [[91]37] 890 cyclin A2
   [92]Open in a new tab

Conclusions

   Identifying accurate and reproducible disease biomarkers is an
   important challenge for gene expression analysis. To facilitate this
   task, we developed COMBINER, a novel pathway-based biomarker
   identification method that extracts the essential "core module" of
   disease from known biological networks. Compared to existing methods,
   COMBINER substantially improves the reproducibility and cancer-specific
   enrichment of its resulting biomarkers. We examined the identified
   markers in intracellular signalling networks highlighting the hallmarks
   of cancer. Reassembling the core module genes into a regulatory
   network, we found 13 "driver" genes connecting eight functional
   modules. We anticipate such molecular descriptions to prove even more
   useful when applied to diseases that are less well-characterized; our
   current work focuses on several such applications.

Methods

Gene expression, pathways, cancer gene databases, and interactome

   We used three breast cancer datasets from different countries of origin
   to evaluate our method: Netherlands [[93]9], USA [[94]2], and Belgium
   [[95]10]. Each dataset recorded whether the assayed patients developed
   metastasis within 5 years after surgery. The Netherlands, USA, and
   Belgium datasets contain expression profiles for 295, 286, and 198
   patients, respectively, with 78, 107, and 35 patients experiencing
   metastasis. All of the patients in the USA and Belgium datasets had
   lymph-node-negative disease, although their estrogen receptor (ER)
   types differed. The Netherlands data contained both lymph-node positive
   and negative disease patients with differing ER types, 130 of which
   received adjuvant systemic therapy including chemotherapy and hormonal
   therapy. We performed a two-tailed t-test on the gene expression values
   of each dataset to distinguish between metastatic and non-metastatic
   patients, considering genes with p-value ≤.05 as differentially
   expressed (DE).

   The reference cancer genes for enrichment analysis were collected from
   datasets including NetPath [[96]15] (all cancers,
   [97]http://www.netpath.org/), Atlas of Cancer Genes [[98]16] (all
   cancers, [99]http://atlasgeneticsoncology.org/), Census Genes [[100]17]
   (all cancers), CANgenes [[101]18] (breast cancer), G2SBC [[102]19]
   (breast cancer, [103]http://www.itb.cnr.it/breastcancer/), and KEGG
   Pathways of Cancer [[104]20] (all cancers, KEGG hsa05200
   [105]http://www.genome.jp/kegg/pathway/hsa/hsa05200.html).

   Pathway information was obtained from the MsigDB v3.0 Canonical
   Pathways subset [[106]11,[107]21]. This collection contains 880
   pathways collected from seven hand-curated pathway databases including
   KEGG, Reactome, and Biocarta.

   Predicted protein protein interaction information was obtained from
   STRING 9 [[108]22].

Core Module Inference

   The CMI method adopts the strategy of the CORG method [[109]8] of
   finding the genes with the most discriminative power, differing in
   three ways: first, the CORG method collects CORGs only from the up- or
   downregulated subset of genes in a pathway, and some key genes can thus
   be discarded. In contrast, CMI considers both up- and downregulation
   together. Second, CMI improves the greedy search for the discriminative
   set of genes. Third, CMI considers only differentially expressed genes.
   As illustrated in Figure [110]1, given a pathway consisting of genes
   {g[1],... g[i], ..., g[n]} ranking by a descending order of their
   absolute t-scores, with their normalized expression values
   {z(g[1]),..., z(g[n])}, determining a core module {g[1],..., g[K]} is
   equivalent to finding the K^th component, such that
   [MATH: <mrow><mi>K</mi><mo
   class="MathClass-rel">=</mo><mtext>arg</mtext><mtext>max</mtext><mrow><
   mo
   class="MathClass-open">(</mo><mrow><msub><mrow><mi>t</mi></mrow><mrow><
   mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi></mrow></msub><mrow><m
   o
   class="MathClass-open">(</mo><mrow><msub><mrow><mi>P</mi></mrow><mrow><
   mi>j</mi></mrow></msub></mrow><mo
   class="MathClass-close">)</mo></mrow></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-punc">,</mo></mrow> :MATH]
   (1)

   where
   [MATH:
   <msub><mi>P</mi><mi>j</mi></msub><mo>=</mo><mrow><mo>{</mo><mrow><mtabl
   e><mtr><mtd
   columnalign="left"><mrow><mfrac><mrow><msubsup><mrow><msup><mstyle
   mathsize="140%"
   displaystyle="true"><mo>∑</mo></mstyle><mtext>​</mtext></msup></mrow><m
   row><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>j</mi></msubsup><mi>z</mi>
   <mo stretchy="false">(</mo><msub><mi>g</mi><mi>i</mi></msub><mo
   stretchy="false">)</mo><mi>s</mi><mi>i</mi><mi>g</mi><mi>n</mi><mo
   stretchy="false">(</mo><msub><mi>t</mi><mrow><mi>s</mi><mi>c</mi><mi>o<
   /mi><mi>r</mi><mi>e</mi></mrow></msub><mo
   stretchy="false">(</mo><msub><mi>g</mi><mi>i</mi></msub><mo
   stretchy="false">)</mo><mo
   stretchy="false">)</mo></mrow><mrow><msqrt><mi>j</mi></msqrt></mrow></m
   frac><mo>,</mo><mn>1</mn><mo>≤</mo><mi>j</mi><mo>≤</mo><mi>m</mi><mi>i<
   /mi><mi>n</mi><mo stretchy="false">(</mo><mo
   stretchy="false">|</mo><msub><mi>g</mi><mi>i</mi></msub><mo>∈</mo><mi>D
   </mi><mi>E</mi><mi>G</mi><mi>s</mi><mo
   stretchy="false">|</mo><mo>,</mo><mn>20</mn><mo
   stretchy="false">)</mo><mo>,</mo></mrow></mtd><mtd><mrow><mo>|</mo><msu
   b><mi>g</mi><mi>i</mi></msub><mo>∈</mo><mi>D</mi><mi>E</mi><msub><mi>G<
   /mi><mi>s</mi></msub><mo>|</mo><mo>></mo><mn>0</mn><mo>,</mo></mrow></m
   td></mtr><mtr><mtd
   columnalign="left"><mrow><mn>0</mn><mtext> </mtext><mtext> </mtext><mte
   xt> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </m
   text><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><m
   text> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> <
   /mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext>
   <mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext>
    </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtex
   t><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtex
   t> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mt
   ext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mt
   ext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </
   mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><
   mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext>
   </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext
   ><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext
   > </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mte
   xt><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mte
   xt> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </m
   text><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><m
   text> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> <
   /mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext>
   <mtext> </mtext><mtext> </mtext><mtext></mtext><mtext></mtext><mtext></
   mtext><mo>,</mo></mrow></mtd><mtd><mrow><mo> |</mo><msub><mi>g</mi><mi>
   i</mi></msub><mo>∈</mo><mi>D</mi><mi>E</mi><msub><mi>G</mi><mi>s</mi></
   msub><mo>|</mo><mo>=</mo><mn>0.</mn></mrow></mtd></mtr></mtable></mrow>
   </mrow> :MATH]
   (2)

   g[i ]is the i^th DEG in descending order and Pj is the PA containing
   from g[1 ]to g[j]. | g[i ]∈ DEGs | denotes number of DEGs in the
   pathway. The DEGs by default are the genes with p-value ≤ 0.05 in a
   two-tailed t-test. We limit the largest marker size to 20 DEGs. In
   fact, all marker sets have fewer than 20 components.

Reproducibility power

   We consider an inference-validation pair datasets to be reproducible if
   their pathway activities provide similar discriminative power. First,
   we rank the PAs inferred from the inference dataset in descending order
   by their tscores. Then, we define reproducibility by
   [MATH:
   <mrow><msub><mrow><mi>C</mi></mrow><mrow><mi>s</mi><mi>c</mi><mi>o</mi>
   <mi>r</mi><mi>e</mi></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>N</mi></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-rel">=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>N</
   mi></mrow></mfrac><msubsup><mrow><mo mathsize="big">
   ∑</mo></mrow><mrow><mi>i</mi><mo
   class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow><mi>N</mi></mrow></m
   subsup><msub><mrow><mi>t</mi></mrow><mrow><mi>s</mi><mi>c</mi><mi>o</mi
   ><mi>r</mi><mi>e</mi></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><msubsup><mrow><mi>P</mi></mrow><mro
   w><mi>I</mi></mrow><mrow><mi>i</mi></mrow></msubsup></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-bin">⋅</mo><msub><mrow><mi>t</mi></mrow><mrow><mi>s</m
   i><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><msubsup><mrow><mi>P</mi></mrow><mro
   w><mi>V</mi></mrow><mrow><mi>i</mi></mrow></msubsup></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-punc">,</mo></mrow> :MATH]
   (3)

   where
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>I</mi></mrow><mrow><mi>i</mi>
   </mrow></msubsup> :MATH]
   is the i^th PA in descending order in the inference dataset, and
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>V</mi></mrow><mrow><mi>i</mi>
   </mrow></msubsup> :MATH]
   is its corresponding PA in the validation dataset. For the breast
   cancer datasets, the overall reproducibility is then given by the
   average Cscore of the inferred pathways over all six
   inference-validation pairs.

   Six methods were compared in this work, including CMI, CORG [[111]8],
   Mean [[112]23], Median [[113]23], PCA [[114]24], and Individual Gene.
   LLR(Log likelihood Ratio, [[115]25]) was not compared here, because it
   is not discussed in the same gene expression space.

Consensus Feature Elimination (CFE)

   In this work, gene expression and activity vectors are generalized as
   features for classification. Given a set of features {x [1], x[2],...,
   x[n]} with class labels {y[1], y[2],..., y[n]} ∈ {-1, +1}, the task of
   binary classification is to find a decision function
   [MATH: <mrow><mi>D</mi><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>x</mi></mrow><mo
   class="MathClass-close">)</mo></mrow><mfenced open="{"><mrow><mtable
   class="gathered"><mtr><mtd><mo
   class="MathClass-rel">></mo><mn>0</mn><mo
   class="MathClass-rel">⇒</mo><mi>x</mi><mo
   class="MathClass-rel">∈</mo><mi>c</mi><mi>l</mi><mi>a</mi><mi>s</mi><mi
   >s</mi><mrow><mo class="MathClass-open">(</mo><mrow><mo
   class="MathClass-bin">+</mo></mrow><mo
   class="MathClass-close">)</mo></mrow></mtd></mtr><mtr><mtd><mo
   class="MathClass-rel"><</mo><mn>0</mn><mo
   class="MathClass-rel">⇒</mo><mi>x</mi><mo
   class="MathClass-rel">∈</mo><mi>c</mi><mi>l</mi><mi>a</mi><mi>s</mi><mi
   >s</mi><mrow><mo class="MathClass-open">(</mo><mrow><mo
   class="MathClass-bin">-</mo></mrow><mo
   class="MathClass-close">)</mo></mrow></mtd></mtr><mtr><mtd><mo
   class="MathClass-rel">=</mo><mn>0</mn><mo
   class="MathClass-rel">⇒</mo><mi>x</mi><mo
   class="MathClass-rel">∈</mo><mstyle mathvariant="bold"><mstyle
   mathvariant="italic"><mi>d</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi>
   <mi>i</mi><mi>o</mi><mi>n</mi><mi>b</mi><mi>o</mi><mi>u</mi><mi>n</mi><
   mi>d</mi><mi>a</mi><mi>r</mi><mi>y</mi></mstyle></mstyle><mo
   class="MathClass-punc">,</mo></mtd></mtr><mtr><mtd></mtd></mtr></mtable
   ></mrow></mfenced></mrow> :MATH]
   (4)

   We choose a linear decision function, which can be described as a
   separating hyperplane:
   [MATH: <mrow><mi>D</mi><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>x</mi></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-rel">=</mo><mi>w</mi><mo
   class="MathClass-bin">⋅</mo><mi>x</mi><mo
   class="MathClass-bin">+</mo><mi>b</mi><mo
   class="MathClass-punc">,</mo></mrow> :MATH]
   (5)

   with w the weight vector and b the bias value.

   Linear classifiers such as Linear Discriminant Analysis (LDA) [[116]26]
   and linear Support Vector Machines (SVM) [[117]27] use differing
   optimization criteria to estimate the weight vector. Intuitively, the
   weights indicate the importance of the associated features. Guyon et al
   proposed Recursive Feature Elimination (RFE), which removes features
   recursively based on their weights [[118]28]. However, classical RFE
   exhibits lack of stability in feature selection [[119]29]. In contrast
   to binary classification tasks that emphasize maximization of
   classification accuracy, biomarker identification requires features
   that are both accurate and reproducible across multiple experiments.
   Thus, we propose a Consensus Feature Elimination (CFE) approach to
   improve the stability of RFE. As illustrated in Figure [120]6, we first
   generate 100 alternative 5-fold random splits of samples, upon which we
   construct 500 classifiers and record their AUCs (Area Under Receiver
   Operating Characteristic Curves) and weight vectors. Each feature was
   then ranked by average square weight
   [MATH: <mover accent="true"><mrow><mstyle
   mathvariant="bold"><mi>w</mi></mstyle></mrow><mo
   class="MathClass-op">¯</mo></mover><mo
   class="MathClass-rel">=</mo><msubsup><mrow><mo class="MathClass-op">
   ∑</mo></mrow><mrow><mi>j</mi><mo
   class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow><mn>500</mn></mrow><
   /msubsup><msup><mrow><mrow><mo
   class="MathClass-open">(</mo><mrow><msup><mrow><mstyle
   mathvariant="bold"><mi>w</mi></mstyle></mrow><mrow><mi>j</mi></mrow></m
   sup></mrow><mo
   class="MathClass-close">)</mo></mrow></mrow><mrow><mn>2</mn></mrow></ms
   up><mo class="MathClass-bin">/</mo><mn>500</mn> :MATH]
   . The lowest ranking feature was removed recursively until the maximum
   average AUC was achieved. This process, which has also been called
   Multiple RFE [[121]30] or ensemble feature selection [[122]31] is known
   to increase biomarker reproducibility and accuracy by as much as 30%
   and 15%, respectively. For the breast cancer datasets described in this
   work, we found the maximum AUC to be very stable, while the
   corresponding biomarker set was not always unique. Thus we chose to
   repeat the above procedure 100 times, selecting the most frequently
   occurring biomarkers as the final marker set.

Figure 6.

   [123]Figure 6
   [124]Open in a new tab

   Diagram of Consensus Feature Elimination. We first generated 100
   alternative 5-fold random splits of samples, upon which it constructs
   500 classifiers with their AUCs as well as weight vectors. Each feature
   is then ranked by its average square weight. The lowest ranking feature
   was removed backward until the maximum average AUC was achieved. The
   procedure is repeated for 100 times, and the most frequently occurring
   marker set was regarded to be the ultimate marker.

   Seven methods were compared in this work, including CMI, CORG [[125]8],
   Mean [[126]23], Median [[127]23], PCA [[128]24], LLR [[129]25], and
   Individual Gene.

Cancer gene enrichment analysis

   The cancer gene enrichment analysis examines over-representation of
   known cancer genes in a gene signature. Assuming the total number of
   genes N, cancer genes M, and signature genes J, the probability of
   having more than K cancer genes in a signature follows a hypergeometric
   distribution:
   [MATH: <mrow><mi>P</mi><mo
   stretchy="false">(</mo><mo>#</mo><mtext> </mtext><mtext>of cancer genes
   </mtext><mo>></mo><mi>K</mi><mo
   stretchy="false">)</mo><mo>=</mo><mn>1</mn><mo>−</mo><mstyle
   displaystyle="true"><msubsup><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>0<
   /mn></mrow><mi>K</mi></msubsup><mrow><mfrac><mrow><mrow><mo>(</mo><mrow
   ><msubsup><mrow></mrow><mi>i</mi><mi>J</mi></msubsup></mrow><mo>)</mo><
   /mrow><mrow><mo>(</mo><mrow><msubsup><mrow></mrow><mrow><mi>M</mi><mo>−
   </mo><mi>i</mi></mrow><mrow><mi>N</mi><mo>−</mo><mi>J</mi></mrow></msub
   sup></mrow><mo>)</mo></mrow></mrow><mrow><mrow><mo>(</mo><mrow><msubsup
   ><mrow></mrow><mi>M</mi><mi>N</mi></msubsup></mrow><mo>)</mo></mrow></m
   row></mfrac></mrow></mstyle><mo>.</mo></mrow> :MATH]
   (6)

Software

   COMBINER was implemented in Matlab R2010a with Bioinformatics toolbox
   v3.5. The source code is available on [130]http://www.ruotingyang.com.

Authors' contributions

   RY, BJD, LRP, and FJD conceived and designed the research. RY, and BJD
   performed the analysis, the statistical computations, and wrote the
   paper. RY implemented the programs. All authors read and approved the
   final manuscript.

Supplementary Material

   Additional file 1

   Figure S1: Comparison of CMI and other pathway inference methods using
   SVM-CFE classifiers subject to top 100 inferred pathways.
   [131]Click here for file^ (433.5KB, TIFF)
   Additional file 2

   Table S1: List of core module genes identified by CMI and CORG.
   [132]Click here for file^ (19.9KB, XLSX)
   Additional file 3

   Table S2: Pathway markers identified by all methods.
   [133]Click here for file^ (28.5KB, XLSX)
   Additional file 4

   Table S3: List of core module genes overlaid in KEGG pathway of
   cancers.
   [134]Click here for file^ (13.7KB, XLSX)
   Additional file 5

   Figure S2: Unique core module of cancer pathway identified by
   CORG-COMBINER method.
   [135]Click here for file^ (712.1KB, TIFF)

Contributor Information

   Ruoting Yang, Email: ruoting@engineering.ucsb.edu.

   Bernie J Daigle, Jr, Email: bdaigle@gmail.com.

   Linda R Petzold, Email: petzold@engineering.ucsb.edu.

   Francis J Doyle, III, Email: doyle@engineering.ucsb.edu.

Acknowledgements