Abstract

Background

   Mining novel biomarkers from gene expression profiles for accurate
   disease classification is challenging due to small sample size and high
   noise in gene expression measurements. Several studies have proposed
   integrated analyses of microarray data and protein-protein interaction
   (PPI) networks to find diagnostic subnetwork markers. However, the
   neighborhood relationship among network member genes has not been fully
   considered by those methods, leaving many potential gene markers
   unidentified. The main idea of this study is to take full advantage of
   the biological observation that genes associated with the same or
   similar diseases commonly reside in the same neighborhood of molecular
   networks.

Results

   We present EgoNet, a novel method based on egocentric network-analysis
   techniques, to exhaustively search and prioritize disease subnetworks
   and gene markers from a large-scale biological network. When applied to
   a triple-negative breast cancer (TNBC) microarray dataset, the top
   selected modules contain both known gene markers in TNBC and novel
   candidates, such as RAD51 and DOK1, which play a central role in their
   respective ego-networks by connecting many differentially expressed
   genes.

Conclusions

   Our results suggest that EgoNet, which is based on the ego network
   concept, allows the identification of novel biomarkers and provides a
   deeper understanding of their roles in complex diseases.

   Keywords: Gene expression, Network medicine, Machine learning, Cancer
   biology, Biological networks, Microarray

Background

   Complex human diseases, e.g. cancer, diabetes, or autism, are caused by
   dysregulations of biological networks. Genetic analysis approaches
   focused on individual genetic determinants are unlikely to characterize
   the network architecture of complex diseases comprehensively. Creating
   effective therapies for these diseases requires a thorough
   understanding of how cells integrate enormous amounts of genomic,
   proteomic, and environmental information to produce specific cellular
   functions, and furthermore, how such functions are perturbed in the
   disease state. Transcriptomics, metabolomics, proteomics and other
   -omics technologies have the potential to provide insights into complex
   disease pathogenesis and heterogeneity, especially if they are applied
   within a network biology framework. “Network medicine” is the rapidly
   developing field which applies systems biology and network science
   methods to human disease [[31]1-[32]3].

   In the past decade, extensive work has been done to identify
   differentially expressed genes across different phenotypes, which can
   be used as diagnostic markers for classifying different disease states
   or predicting clinical outcomes [[33]4-[34]7]. However, gene markers
   based on expression data alone are still not reliable [[35]8]. To meet
   this challenge, many have turned to network medicine to gain a
   comprehensive understanding of the complex disease process. In contrast
   to studying individual genes in isolation, mapping human
   disease-associated genes to interactome data has greatly empowered our
   understanding of human disease mechanisms [[36]9]. Network-based
   approaches have multiple potential biological and clinical
   applications, including a better understanding of the effects of
   interconnection of disease genes and disease pathways, which, in turn,
   may offer better targets for drug development. These advances may also
   lead to more reliable biomarkers to monitor the functional integrity of
   networks that are perturbed by diseases.

   To date, many computational methods have been developed to integrate
   gene expression profiles with protein-protein interaction maps or
   pathway databases, with the goal of identifying significant subnetwork
   markers for predicting biological or clinical outcomes [[37]10-[38]18].
   More recently, different machine learning and data mining strategies
   for feature selection have been applied to identifying a subset of
   genes that can maximize the prediction performance [[39]19]. Dutkowski
   et al.[[40]20] proposed Network-Guided Forests (NGF) which integrates
   the key ideas of Random Forests (RF) into the selection of disease
   modules. However, it involves a random search over subnetworks, leading
   to possibly different results from different runs with no guarantee of
   the optimality of the final result. Zhu et al.[[41]21] applied
   network-based Support Vector Machine (SVM) for classification of
   microarray samples but the method only worked for small subnetworks.
   More importantly, the above methods are largely heuristic, and the
   definition of output subnetworks is ambiguous without a formal
   topological feature. Hence, selected network modules tend to include
   only significant genes based on their expression profiles, but exclude
   the non-differentially expressed genes despite the fact that they are
   functionally linked to many differentially expressed disease genes.

   In this study, we developed a novel method called EgoNet to identify
   significant subnetworks that are functionally associated with diseases,
   as well as accurately predict clinical outcomes. The type of subnetwork
   sought by our method is called ego-network, which is well-defined in
   the study of social networks [[42]22]. In particular, an ego-network is
   the part of a network that involves a particular node we are focusing
   on, which we call ego. In addition to the ego, the network consists of
   a neighborhood including all nodes to which the ego is connected to at
   a certain path length. The one-step neighborhood contains the nodes the
   ego is directly connected to (referred to as the ego’s alters), and the
   links between the ego’s alters. In studying ego-networks, we are
   interested in examining how egos make use of or are influenced by their
   alters in terms of associating with disease outcomes. It has been
   reported that the ego-network played an important role in the inference
   of novel disease genes and supported predictions in pathogenesis
   studies [[43]23].

   The underlying assumption of our model is that if the majority of
   neighbors of a central disease gene are disease genes, then its other
   neighbors are likely to be involved in the disease pathway
   (Figure [44]1A). Alternatively, if most neighbors of the ego node are
   associated with a disease, the ego gene itself is considered highly
   likely to play a role in the disease (Figure [45]1B). We intend to find
   the hidden genes that show no significance by themselves but are
   clustered in a subnetwork module whose genes collectively are highly
   predictive of the disease status. The ego-network model has been used
   for network module over-representation analysis in ConsensusPathDB
   [[46]24]. In this study, we use machine-learning techniques to assess
   the association between an ego-network with the clinical outcome. This
   approach allows compensatory effects between the genes in an
   ego-network, as well as nonlinear relations between the genes and the
   clinical outcome.

Figure 1.

   Figure 1
   [47]Open in a new tab

   Two illustrative ego-networks. Red nodes are putative disease genes,
   white nodes are hidden disease genes either as alter nodes (A) or ego
   node (B).

   We evaluated the performance of EgoNet in human protein-protein
   interaction network and a triple negative breast cancer (TNBC)
   microarray data set. The method not only successfully identified known
   breast cancer susceptibility genes TP53, BRCA1, BRCA2 from significant
   ego-networks, but also detected several novel targets, like ABL1 and
   RAD51 as predictive factors for TNBC patients. We expect that EgoNet
   can be widely used to infer novel biomarkers for phenotypic outcome
   prediction of many human diseases.

Results and discussion

Overview of EgoNet algorithm

   The goal of EgoNet algorithm is to identify significant ego-networks
   from gene expression and large-scale biological network data. As
   outlined in Figure [48]2, the algorithm takes the network and gene
   expression data as input. The input biological network can be a gene
   regulatory network, a signaling pathway network, or a protein-protein
   interaction network. The gene expression data needs to be associated
   with a certain biological or clinical outcome, which can be a
   categorical, continuous, or survival outcome.

Figure 2.

   Figure 2
   [49]Open in a new tab

   Workflow of the EgoNet algorithm.

   EgoNet iteratively scans through all genes with two or more neighbors
   in the network. With each initial gene (the ego node), it first finds
   the score of the level-one ego-network based on how well the genes as a
   collection predicts the clinical outcome. Then it spreads outward from
   the ego node progressively to involve more genes in the predictive
   model. The spreading stops when the prediction accuracy drops
   (Figure [50]2; Methods). The above process of growing ego-network is
   also known as snowball sampling [[51]25]. After obtaining the score of
   an ego-network, the significance is evaluated by permutation test.

Simulation studies

   To evaluate the capability of an ego network to predict the clinical
   outcome, a machine-learning method needs to be chosen. In this study,
   we selected three widely used methods: support vector machines (SVM)
   [[52]26], K-nearest neighbors (KNN) [[53]27] and random forests (RF)
   [[54]28], and compared their performance for subnetwork identification
   through a simulation study.

   In each simulation, a scale-free network was generated, and one
   subnetwork was selected as the ground truth. The subnetwork was linked
   to the outcome variable through linear or nonlinear relationship. We
   applied the EgoNet algorithm in conjunction with the three classifiers
   for subnetwork selection, and inspected if the top identified
   ego-netowork (s) recovered the true subnetwork. In general, SVM
   performed the best (Table [55]1). In both linear and non-linear
   settings, if we only selected the top ego-netowork in every simulation,
   SVM successfully recovered the true subnetwork more than 50% of the
   time. When we increased the number of identified ego-networks to top 5,
   SVM was able to recover the true subnetwork over 80% of the time. Thus
   we chose SVM for the subsequent data analysis.

Table 1.

   Percentage of top identified ego-networks successfully matching true
   subnetworks in simulations using different classification algorithms^ *

     __________________________________________________________________

   Top 1
     __________________________________________________________________


     __________________________________________________________________

   Top 5
     __________________________________________________________________


     __________________________________________________________________

        Linear (%) Nonlinear (%) Linear (%) Nonlinear (%)
   SVM:
     __________________________________________________________________

   68
     __________________________________________________________________

   53
     __________________________________________________________________

   89
     __________________________________________________________________

   83
     __________________________________________________________________

   RF:
     __________________________________________________________________

   50
     __________________________________________________________________

   42
     __________________________________________________________________

   83
     __________________________________________________________________

   69
     __________________________________________________________________

   KNN: 62         46            91         70
   [56]Open in a new tab

   ^*Bold numbers denote the best performing method in each simulation
   setting (column).

   Next we compared the performance of EgoNet with the method proposed by
   Chuang et al. [[57]11], which scores subnetworks using the mutual
   information between aggregated gene Z-scores and class labels. We
   simulated two scenarios: (1) All genes in an ego-network, including the
   ego gene, are associated with the clinical outcome; and (2) All genes
   in an ego-network, except the ego gene, are associated with the
   clinical outcome. The second scenario was motivated by our
   consideration that sometimes a gene functionally related to a disease
   may not be differentially expressed, while it is surrounded by
   differentially expressed genes in the network (Figure [58]1B). In each
   of the scenarios, we further simulated both linear and nonlinear
   associations between gene expression and clinical outcome.

   The methods were compared in two ways. The first is the accuracy in
   predicting the clinical outcome, and the second is the rate of
   correctly recovering the true ego network. For prediction accuracy, we
   employed the area under the ROC curve (AUC) as the metric to evaluate
   performance. Additional file [59]1: Figure S1A shows EgoNet
   outperformed Chuang et al.’s method in terms of classification
   accuracy, albeit the difference is relatively small. For true ego
   network recovery, we calculated the rate of the top selected subnetwork
   capturing the true ego node. We found EgoNet showed substantially
   higher proportions of recovering the true ego node (Additional file
   [60]1: Figure S1B). As expected, the difference was most pronounced in
   the scenarios where the ego node itself was not directly associated
   with the clinical outcome.

Gene modules differentiate breast cancer subtypes

   We applied EgoNet to analyze human PPI network with the expression
   profiles of the two cohorts of breast cancer patients previously
   reported by Li et al. [[61]29], which compared the gene expression of
   24 sporadic triple negative breast cancer (TNBC) samples against 51
   primary breast tumor samples representing all subtypes (NCBI
   [62]GSE18864). TNBC is characterized by the lack of expression of
   estrogen receptor (ER), progesterone receptor (PgR), and the human
   epidermal growth factor receptor 2 (ERBB2, or HER2) [[63]30]. It
   largely overlaps with the basal-like subtype of breast cancer [[64]31].

   The PPI network was obtained from HINT database [[65]32], which
   collected data from several databases and filtered both systematically
   and manually to remove low-quality/erroneous interactions. The network
   contained 8292 human proteins and 27493 high-quality binary physical
   interactions.

   We applied our algorithm to this dataset. We allowed only nodes with
   more than one connection to serve as egos. From every ego node, we
   progressively grew the ego-networks by levels, and tested the
   predictive power. For every ego network, the procedure stopped when the
   predictive power dropped with the growth. Following this procedure, a
   total of 5375 ego-networks were examined, and the average of nodes in
   an ego-network is 30. Since ego-networks spread out in levels, which
   are the maximum network distance from ego to its alters, we found ~76%
   of the generated ego-networks were level 1 and ~24% of them were level
   2 (Additional file [66]2: Figure S2). Prediction accuracy for
   phenotypic outcome of those ego-networks varied between 0.63 and 0.95.
   We identified the top 50 discriminative ego-networks by setting the
   accuracy cutoff at 0.9. All were significant with p < 0.001 in
   permutation tests with 1000 permutations.

   BRCA1 and BRCA2 are well-known breast cancer susceptibility genes that
   belong to tumor suppressor genes [[67]33]. TP53 is a tumor suppressor
   gene whose mutation is associated with a variety of cancers. Distinct
   mutation patterns of TP53 was found between the luminal subtypes of
   breast cancer and TNBC [[68]31]. We explored the three genes in our
   identified subnetworks. Interestingly, we found they were clustered in
   one ego-network in which BRCA2 was the ego node (Figure [69]3A). This
   observation is consistent with the local property of disease networks –
   proteins involved in the same disease have an increased tendency to
   interact with each other [[70]2]. We conducted single-gene level
   differential expression analysis. At the FDR cutoff of 0.05, none of
   the three genes showed differential expression between TNBC and
   non-TNBC breast cancer patients. We further evaluated the importance of
   each gene on the classification accuracy using a tree-based feature
   selection algorithm (Method). We found genes with high importance
   scores were mostly differentially expressed. In the BRCA2 ego-network,
   breast cancer susceptibility genes ABL1 and RAD51 [[71]34,[72]35] were
   under such scenario.

Figure 3.

   Figure 3
   [73]Open in a new tab

   Identified ego-networks in the TNBC breast cancer dataset. Module (A)
   contains major breast cancer genes BRCA1, BRCA2 and TP53. Modules (B)
   and (C) contain ERBB2 and ESR1 respectively. Examples of other
   top-scoring modules are shown in (D-F). The area of each node scales
   with its importance in the classification of the phenotype. Red color
   indicates differential expression (FDR <0.05 based on a two-tailed
   t-test with Benjamini & Hochberg FDR adjustment).

   The ABL1 proto-oncogene encodes a cytoplasmic and nuclear protein
   tyrosine kinase that has been implicated in processes of cell
   differentiation, cell division, and so on [[74]36]. ABL1 is activated
   into an oncogene and forms a fusion gene with break point cluster (BCR)
   gene due to missense mutations within the ABL1 kinase domain. The
   chimeric oncogene BCR-ABL1 has been implicated to play a critical role
   in the development of chronic myelogenous leukemia [[75]37]. The
   over-expressed BCR-ABL gene will increase the transmembrane plasma
   protein expression and constitutively activate the downstream signaling
   molecules such as Src family kinases [[76]38], including DOK1 and
   NCOA2, which we discuss below. Thus it is logical to believe that ABL1
   is a critical factor in breast cancer development. A detailed
   examination of the expression level of ABL1 revealed it was
   substantially over-expressed in TNBC, as compared to other primary
   breast cancer subtypes (Additional file [77]3: Figure S3a). Our study
   suggests ABL1 may be regarded as a predictive factor for
   differentiating TNBC from other primary breast cancer.

   RAD51 encodes the major eukaryotic homologous recombinase [[78]39],
   which assists in the repair of DNA double strand breaks. The RAD51
   protein has been demonstrated to interact with the ssDNA-binding
   protein BRCA2, a well-known breast cancer susceptibility gene [[79]40].
   BRCA2 controls and regulates both the intracellular localization and
   DNA–binding ability of RAD51 [[80]41,[81]42]. There were some reports
   suggesting that dysfunctional variants of RAD51 is associated with
   breast cancer risk. One recent study suggested the association of RAD51
   polymorphis with DNA repair in BRCA1 mutation carriers and sporadic
   breast cancer risk [[82]43]. Smolarz et al. reported that there was a
   significant positive association between RAD51 polymorphisms and TNBC
   [[83]44]. In our current study, RAD51 is significantly under-expressed
   in the TNBC samples (Additional file [84]3: Figure S3b).

   TNBC lacks the expression of three receptors, ER, ERBB2 and PgR
   [[85]30]. We found two of the corresponding genes from our identified
   subnetworks, of which ERBB2 was in the DOK1 ego-network (Figure [86]3B)
   and ESR1 in the NCOA2 ego-network (Figure [87]3C). DOK1 is known to be
   a tumor suppressor gene in epithelial ovarian cancer [[88]45] and lung
   cancer [[89]46]. It is a substrate of several non-receptor tyrosine
   kinases [[90]47,[91]48], including breast tumor kinase (BRK) [[92]49].
   Since most of DOK1’s alters were differentially expressed, DOK1 may
   play a role in the molecular pathways of TNBC. DOK1 itself showed a
   minor under-expression in TNBC (Additional file [93]3: Figure S3c),
   though not statistically significant at the FDR level of 0.05. ERBB2 is
   a member of the DOK1 ego-network. Because the receptor itself is not
   expressed in TNBC, as expected, the ERBB2 gene was under-expressed in
   TNBC as compared with other primary breast cancer subtypes (Additional
   file [94]3: Figure S3d). ESR1 showed a similar pattern (Additional file
   [95]3: Figure S3e).

   Our results also suggest NCOA2 could be an important factor in the TNBC
   gene regulatory pathways. NCOA2, the nuclear receptor coactivator 2,
   which belongs to the steroid receptor coactivator (SRC) family, has
   been reported to be broadly involved in many cancers [[96]50]. The SRC
   family comprises three members, SRC-1 (NCOA1), SRC-2 (NCOA2) and SRC-3
   (NCOA3), which are known to be overexpressed in breast cancer and
   essentially involved in estrogen mediated cancer cell proliferation
   [[97]51]. Currently, most research on the SRC family has been focused
   on NCOA1 and NCOA3. Clinical and preclinical studies have demonstrated
   that overexpressed NCOA1 and NCOA3 are linked to resistance to
   therapies in breast cancers [[98]52]. For example, overexpression of
   NCOA3, especially in conjunction with high levels EGF receptor (EGFR)
   and HER2 (ERBB2), is associated with poor outcome after tamoxifen
   treatment [[99]53,[100]54]. In ERBB2–overexpressing breast cancer
   cells, overexpression of NCOA3 also contributes to resistance against
   the ERBB2 targeting drug transtuzumab [[101]55]. In the current study,
   NCOA2 is significantly under-expressed in the TNBC samples as compared
   with other subtypes of primary breast cancer (Additional file [102]3:
   Figure S3f). Our results indicate that NCOA2 could be as important as
   the other two members and play an important role in the TNBC gene
   regulation.

   We shall note that the current study is to compare TNBC with the pool
   of other subtypes of breast cancer. Thus the resulting sub-networks
   have more to do with the differences between TNBC and other subtypes,
   as opposed to directly explaining the clinical characteristics of TNBC
   itself. Although EgoNet pointed to DOK1 and NCOA2 ego-networks as among
   the best to separate TNBC from other primary breast cancers, it is
   still far from establishing a mechanistic explanation. This limitation
   has to be addressed by future biological studies.

   Given an ego-network, a “structural hole” is the absence of an edge
   among a pair of nodes in the ego network. A well-established
   proposition in social network analysis is that egos with lots of
   structural holes are better performers in certain competitive settings
   [[103]22]. Among our identified ego-networks, we found examples
   containing few structural holes (Figure [104]3C-D), and those
   containing many (Figure [105]3E-F). The binding mechanism may imply ego
   genes such as ERCC8 and GGA1 whose ego-networks include many structural
   holes are key factors to distinguish the TNBC patients.

Network-based ranking of marker genes

   Next, we evaluated the importance of individual genes by considering
   all the subnetworks together. An important property of disease genes in
   a molecular network is that the nodes with much higher degrees of
   linkages, so called hubs, should typically be associated with disease
   genes [[106]19]. We assume that a putative disease hub is important,
   and thus should be included in more identified disease subnetworks. For
   each ego-network, a classification accuracy score is available, and the
   relative importance values are calculated for genes included in the
   ego-network. We propose a metric that is the summation of the product
   of subnetwork score (S[i]) and node importance (V[ij]) over all the
   considered subnetworks, namely
   [MATH: <mrow><msub><mi mathvariant="italic">M</mi><mi
   mathvariant="italic">j</mi></msub><mo>=</mo><mstyle
   displaystyle="true"><msubsup><mo>∑</mo><mfenced open="("
   close=")"><mrow><mi
   mathvariant="italic">i</mi><mo>=</mo><mn>1</mn></mrow></mfenced><mi
   mathvariant="italic">N</mi></msubsup><mrow><msub><mi
   mathvariant="italic">S</mi><mi
   mathvariant="italic">i</mi></msub><msub><mi
   mathvariant="italic">V</mi><mi
   mathvariant="italic">ij</mi></msub><mo>,</mo></mrow></mstyle></mrow>
   :MATH]

   where i is the ego-network index, and V[ij] is the importance score of
   the j^th gene in the i^th subnetwork which takes value zero if the gene
   is not in the subnetwork. Node importance (V[ij]) is calculated using
   tree-based feature selection method (Methods).

   Table [107]2 shows the top 20 ranked genes based on their M values. We
   found the list included both differentially expressed (DE) genes and
   non-DE genes. In the DE group, a notable example of biomarker gene in
   TNBC, EGFR [[108]56] is present, which suggests the ranking derived by
   our proposed metric is sensible. The non-DE genes could not have been
   identified based on the gene expression data alone. However, by
   integrating the network and gene expression profiles, we could identify
   these putative biomarker genes that were not differentially expressed.

Table 2.

   The top 20 genes for classifying TNBC patients based on gene ranking
   metric
   Gene name M value Differentially expressed
   ABL1
     __________________________________________________________________

   58.5
     __________________________________________________________________

   YES
     __________________________________________________________________

   GRB2
     __________________________________________________________________

   27.7
     __________________________________________________________________

   NO
     __________________________________________________________________

   FYN
     __________________________________________________________________

   26
     __________________________________________________________________

   YES
     __________________________________________________________________

   CSNK2B
     __________________________________________________________________

   24.3
     __________________________________________________________________

   YES
     __________________________________________________________________

   NCK1
     __________________________________________________________________

   17.6
     __________________________________________________________________

   YES
     __________________________________________________________________

   TRAF2
     __________________________________________________________________

   15.1
     __________________________________________________________________

   YES
     __________________________________________________________________

   TGFBR1
     __________________________________________________________________

   12.3
     __________________________________________________________________

   NO
     __________________________________________________________________

   MDFI
     __________________________________________________________________

   12.2
     __________________________________________________________________

   NO
     __________________________________________________________________

   EGFR
     __________________________________________________________________

   11.9
     __________________________________________________________________

   YES
     __________________________________________________________________

   ATXN1
     __________________________________________________________________

   11.5
     __________________________________________________________________

   NO
     __________________________________________________________________

   SMAD1
     __________________________________________________________________

   11.3
     __________________________________________________________________

   NO
     __________________________________________________________________

   CCDC85B
     __________________________________________________________________

   11.2
     __________________________________________________________________

   NO
     __________________________________________________________________

   UBQLN4
     __________________________________________________________________

   10.9
     __________________________________________________________________

   NO
     __________________________________________________________________

   PRKCA
     __________________________________________________________________

   10.6
     __________________________________________________________________

   YES
     __________________________________________________________________

   CHD3
     __________________________________________________________________

   10
     __________________________________________________________________

   YES
     __________________________________________________________________

   CRK
     __________________________________________________________________

   9.8
     __________________________________________________________________

   NO
     __________________________________________________________________

   FXR2
     __________________________________________________________________

   9.7
     __________________________________________________________________

   YES
     __________________________________________________________________

   PIK3R1
     __________________________________________________________________

   9.7
     __________________________________________________________________

   YES
     __________________________________________________________________

   EP300
     __________________________________________________________________

   9.5
     __________________________________________________________________

   YES
     __________________________________________________________________

     MAPK6     9.5              NO
   [109]Open in a new tab

   For the non-DE genes in Table [110]2, there have been literatures
   reporting TGFBR1 and SMAD1 signaling pathways to be related to breast
   cancer [[111]57,[112]58]. Previous studies also showed MAPK signaling
   pathway to be activated in triple-negative breast cancer [[113]59].
   Gene Ontology (GO) and KEGG pathway enrichment analysis for the top 100
   genes by their M values was carried out using the DAVID tool [[114]60].
   The identified genes were highly enriched in cancer processes or
   pathways (Additional file [115]4: Table S1). We further investigated
   the network degree distribution for the 100 genes. The results showed
   that these genes tend to be higher degree nodes in the large PPI
   network (Additional file [116]5: Figure S4). Our results demonstrated
   that disease-associated genes have significantly higher connectivity in
   the PPI network. Similar conclusions have also been reported in the
   literature [[117]61,[118]62].

   EgoNet can be viewed as a feature selection technique that identifies
   sets of genes to build a predictive model. Specifically, the gene sets
   considered are an ‘ego’ and its neighboring genes that can be reached
   from the ego at a certain path length. We leveraged the EgoNet method
   to search for subnetworks that can distinguish triple negative breast
   cancer tumors from other breast cancer subtypes, recovering several
   known breast cancer-related genes. Importantly, our results revealed a
   list of novel candidate genes that may provide a deeper understanding
   in breast cancer studies.

Conclusions

   In this study, we proposed EgoNet, an algorithm for selecting
   subnetworks whose gene expression is predictive of a disease phenotype.
   The key advantage of EgoNet is its capability to discover potential
   markers that are not differentially expressed, but are functionally
   associated with many differentially expressed genes. EgoNet is a
   general framework for ego-network selection. In this study, we paired
   EgoNet with SVM to solve a two-class (case/control) decision problem.
   However, when paired with an appropriate machine learning approach,
   EgoNet can be readily applied to datasets with continuous, multi-class,
   and survival outcome variables.

Methods

EgoNet algorithm

   The EgoNet algorithm is described in the following quasi-code.
   graphic file with name 1471-2164-15-314-i2.gif

Accessing the significance of the identified ego-network

   When an ego-network is identified, a test of significance is performed
   to obtain the statistical significance. The null distribution of
   classification accuracy is derived by randomly permuting the phenotypic
   labels B times and calculating the score from the same ego-network each
   time. The actual score of this ego network is then indexed on the null
   distribution to obtain a p-value (Figure [119]2).

Computation of ego-network node importance

   We employ Random Forest to rank the importance of variables, in this
   case, the importance of nodes of an ego-network for making disease
   outcome predictions. The relative importance (RI) of a predictor in a
   Random Forest model is obtained by the out-of-bag (OOB) error
   estimation, which is the increase of mean squared error (MSE) when the
   predictor values are permuted.

   For each tree t, let OOB[t] be the associated sample and errOOB[t] be
   the error of t on this OOB[t] sample. Randomly permute the value of
   predictor X^j in OOB[t] to get a perturbed sample denoted by
   [MATH: <mrow><mi mathvariant="italic">OO</mi><msubsup><mi
   mathvariant="italic">B</mi><mi mathvariant="italic">t</mi><mi
   mathvariant="italic">j</mi></msubsup></mrow> :MATH]
   and compute
   [MATH: <mrow><mi mathvariant="italic">err</mi><msubsup><mover
   accent="true"><mi mathvariant="italic">OOB</mi><mo
   stretchy="true">˜</mo></mover><mi mathvariant="italic">t</mi><mi
   mathvariant="italic">j</mi></msubsup></mrow> :MATH]
   . The variable importance score of predictor X^j is derived by
   [MATH: <mrow><mi mathvariant="italic">VI</mi><mfenced open="("
   close=")"><msup><mi mathvariant="italic">X</mi><mi
   mathvariant="italic">j</mi></msup></mfenced><mo
   mathvariant="italic">=</mo><mfrac><mn>1</mn><mi
   mathvariant="italic">T</mi></mfrac><mstyle
   displaystyle="true"><msub><mo>∑</mo><mi
   mathvariant="italic">t</mi></msub><mrow><mfenced open="("
   close=")"><mrow><mi mathvariant="italic">err</mi><mover
   accent="true"><mrow><mi mathvariant="italic">OO</mi><msubsup><mi
   mathvariant="italic">B</mi><mi mathvariant="italic">t</mi><mi
   mathvariant="italic">j</mi></msubsup></mrow><mo
   stretchy="true">˜</mo></mover><mo mathvariant="italic">‒</mo><mi
   mathvariant="italic">errOO</mi><msub><mi mathvariant="italic">B</mi><mi
   mathvariant="italic">t</mi></msub></mrow></mfenced></mrow></mstyle></mr
   ow> :MATH]

   Where T is the number of trees. We used the Python package “sklearn” to
   implement this procedure.

The design of simulation study

   We simulated each scenario 100 times. In each simulation, we generated
   a scale-free undirected and no-self-loop network with 500 nodes.
   Together with the network data, a gene expression dataset with 500
   genes and 100 samples was generated by random sampling the expression
   values from the standard normal distribution. An ego-network is
   selected by first randomly selecting a node as ego with its network
   degree between 5 and 20, and then taking the level 1 ego-network from
   the selected ego node. Eighty percent of the nodes in the ego-network
   were marked as disease genes, and the phenotypic outcomes were
   generated based on the expression values of those disease genes using
   linear and nonlinear models. The linear relationship was formulated as
   Y = ∑ X[i], while the nonlinear relationship was formulated as
   Y = ∑ X[i]^3. Finally, Y was dichotomized to 0 if Y < 0 or 1 if Y ≥ 0.

Availability

   The EgoNet algorithm is implemented by Python scripts and available at
   [120]https://github.com/cauyrd/EgoNet.

Abbreviations

   TNBC: Triple negative breast cancer; PPI: Protein-protein interaction;
   RF: Random forest; SVM: Support vector machine; KNN: K-nearest
   neighbor; FDR: False discovery rate; DE: Differentially expressed; MSE:
   Mean square error; RI: Relative importance; OOB: Out of bag.

Competing interests

   The authors declare that they have no competing interests.

Authors’ contributions

   RY, ZQ and TY conceived and designed the study. RY implemented the
   method and conducted the simulation study. RY and TY conducted the data
   analysis. RY and YB interpreted the biological results. RY, YB and TY
   wrote the manuscript. All authors read and approved the final
   manuscript.

Supplementary Material

   Additional file 1: Figure S1

   Classification performance (A) and proportion of ego node coverage (B)
   for the proposed EgoNet method and Chuang et al.’s method in different
   simulation settings.
   [121]Click here for file^ (15.9KB, pdf)
   Additional file 2: Figure S2

   The distribution of ego-network levels of the identified subnetworks.
   [122]Click here for file^ (26.3KB, pdf)
   Additional file 3: Figure S3

   Boxplots of the expression levels of some important genes.
   [123]Click here for file^ (160.4KB, pdf)
   Additional file 4: Table S1

   Enriched GO and KEGG categories for the top 100 disease-associated
   genes ranked by M value.
   [124]Click here for file^ (17.5KB, xlsx)
   Additional file 5: Figure S4

   Network degree distribution of the top 100 identified
   disease-associated genes ranked by M value (red curve) and all genes
   from the human PPI network (blue curve).
   [125]Click here for file^ (60.2KB, pdf)

Contributor Information

   Rendong Yang, Email: yang4414@umn.edu.

   Yun Bai, Email: yunba@pcom.edu.

   Zhaohui Qin, Email: zhaohui.qin@emory.edu.

   Tianwei Yu, Email: tianwei.yu@emory.edu.

Acknowledgements