Abstract

   Complex diseases, such as breast cancer, are often caused by mutations
   of multiple functional genes. Identifying disease-related genes is a
   critical and challenging task for unveiling the biological mechanisms
   behind these diseases. In this study, we develop a novel computational
   framework to analyze the network properties of the known breast
   cancer–associated genes, based on which we develop a
   random-walk-with-restart (RCRWR) algorithm to predict novel disease
   genes. Specifically, we first curated a set of breast cancer–associated
   genes from the Genome-Wide Association Studies catalog and Online
   Mendelian Inheritance in Man database and then studied the distribution
   of these genes on an integrated protein–protein interaction (PPI)
   network. We found that the breast cancer–associated genes are
   significantly closer to each other than random, which confirms the
   modularity property of disease genes in a PPI network as revealed by
   previous studies. We then retrieved PPI subnetworks spanning top breast
   cancer–associated KEGG pathways and found that the distribution of
   these genes on the subnetworks are non-random, suggesting that these
   KEGG pathways are activated non-uniformly. Taking advantage of the
   non-random distribution of breast cancer–associated genes, we developed
   an improved RCRWR algorithm to predict novel cancer genes, which
   integrates network reconstruction based on local random walk dynamics
   and subnetworks spanning KEGG pathways. Compared with the disease gene
   prediction without using the information from the KEGG pathways, this
   method has a better prediction performance on inferring breast
   cancer–associated genes, and the top predicted genes are better
   enriched on known breast cancer–associated gene ontologies. Finally, we
   performed a literature search on top predicted novel genes and found
   that most of them are supported by at least wet-lab experiments on cell
   lines. In summary, we propose a robust computational framework to
   prioritize novel breast cancer–associated genes, which could be used
   for further in vitro and in vivo experimental validation.

   Keywords: disease-gene prediction, protein-protein interactions, KEGG
   pathway, breast cancer, network propagation

Introduction

   Complex diseases, such as cancers, are often caused by dysfunction of
   multiple genes. The pathogenic mechanism is often due to molecular
   abnormalities, which affect the biological function of the body through
   biomolecular networks, resulting in complex and diverse diseases
   (Taherian-Fard et al., [37]2015). The gene families of RAS, MYC, ERBB,
   and FGFR are common proto-oncogenes (Bi et al., [38]2018). Although
   chemoradiotherapy remains the standard treatment for some cancers, the
   majority of patients, who are sensitive initially, develop resistance
   after multiple relapses, for example, platinum resistance (Guan and Lu,
   [39]2018). Besides this, molecular targeted therapy is expected to be
   more effective and less toxic compared with chemoradiotherapy. The Food
   and Drug Administration has approved several targeted medicines. The
   research and wide application of EGFR-TKI (Tyrosine kinase inhibitors)
   drugs, mainly including Gefitinib, Erlotinib, Icotinib, Afatinib,
   Dasatinib, and Osimertinib, have greatly improved the overall survival
   of patients with lung cancer with the EGFR gene mutation. In this case,
   molecular targeted therapy has brought us much closer to personalized
   therapy, which will improve the therapeutic effect and prognosis for
   patients (Colli et al., [40]2017). Therefore, identifying
   disease-related genes is a critical and challenging task for the study
   of complex diseases, which can help us understand the mechanisms of
   diseases, identify treatment targets, and develop novel treatment
   strategies (Aitman, [41]2002; Gill et al., [42]2014).

   Traditional approaches to identification of disease-related genes, such
   as linkage analysis, involves a candidate list consisting of hundreds
   of genes, requiring a lot of cost and time for in-depth validation
   (Gill et al., [43]2014; Opap and Mulder, [44]2017). As such,
   disease-gene prediction has attracted much attention in past decades,
   and many computational algorithms have been developed to predict
   disease-related genes to minimize the cost and time for the study of
   disease-related genes (Chen et al., [45]2014; Gill et al., [46]2014;
   Opap and Mulder, [47]2017; Luo et al., [48]2019a,[49]b). Many studies
   show that genes associated with the same or similar diseases often are
   more similar in function than others (Goh et al., [50]2007). Functional
   similar genes as well as their products often have physical
   interactions or functional associations. At present, with the rapid
   development of high-throughput technology, a large number of physical
   and functional relationships between biomolecules have been revealed,
   and these form complex biomolecular networks, e.g., protein–protein
   interaction (PPI) networks (Keshava Prasad et al., [51]2009), gene
   co-expression networks, and pathway networks (Kanehisa and Goto,
   [52]2000). It is found that a gene is more likely to be related to a
   disease if there exists direct physical interactions or strong
   functional associations between it and known disease-related genes.
   Therefore, “guilt by association” becomes a popular strategy for
   disease-gene prediction (Oliver, [53]2000; Wu et al., [54]2008; Hu et
   al., [55]2018), and network propagation, such as random walk, has
   become a widely used approach for disease-gene prediction (Cowen et
   al., [56]2017). However, the existing PPI network is still incomplete,
   and there is a lot of data noise. How to improve the PPI network so as
   to enhance the ability to predict disease genes is still a problem that
   needs further study.

   Breast cancer is one of the common malignant tumors among women all
   over the world. Surgery is still the preferred treatment for breast
   cancer. However, patients with poor systemic conditions, such as
   serious diseases in the main organs, are prohibited from using surgical
   treatment. Therefore, to expand the benefit population and improve the
   treatment effect of breast cancer patients, targeted therapy occupies
   the most important position in the treatment of breast cancer (Valencia
   et al., [57]2017). To identify breast cancer–related genes more
   effectively, we conduct analysis and prediction of breast
   cancer–related genes based on the PPI network and KEGG pathway because
   PPIs are proven to be very useful in disease-gene prediction, and the
   physical and functional relationships between genes in the KEGG
   pathways are stronger and more reliable than others. After collecting
   disease-gene associations for breast cancer as well as many other
   diseases, PPIs and KEGG pathway data, we first analyze breast
   cancer–related genes from two aspects: network and enrichment analysis.
   Then, to enhance the ability for disease-gene prediction, we propose an
   improved algorithm (RCRWR), which consists of network reconstruction
   based on local random walk dynamics and random walk with restart.
   Further, we also improve the prediction ability for disease-related
   genes by integrating KEGG pathway data. Finally, we conduct extensive
   analysis for candidate genes.

   The rest of the paper is organized as follows. Section Materials and
   Methods describes the materials and methods used in the study,
   including the improved algorithm (RCRWR) for disease-gene prediction.
   Section Results conducts the analysis of disease-related genes by
   network and enrichment analysis and then evaluates the performance of
   RCRWR when predicting genes related to breast cancer and other
   diseases. The results confirm the effectiveness of RCRWR and the
   important roles of KEGG pathway data in enhancing the ability of
   disease-gene prediction. Finally, Section Conclusion draws conclusions.

Materials and Methods

   Here, we first prepare the following data sets: known disease-gene
   associations, PPIs, and KEGG pathway data. Then, we introduce the
   methods for statistics of breast cancer-related genes and the improved
   algorithm for predicting disease-related genes.

Data SOURCES

Disease-Gene Associations

   The disease/trait associated genes were retrieved from the National
   Institutes of Health Genome-Wide Association Studies (GWAS) catalog
   ([58]https://www.ebi.ac.uk/gwas/) (Danielle et al., [59]2013) and
   Online Mendelian Inheritance in Man (OMIM) ([60]https://omim.org/)
   (Hamosh, [61]2004). Some GWAS catalog disease categories are closely
   related but named differently by different investigators, some of which
   have many overlapping genes (e.g., see [62]Supplementary Tables 1,
   [63]2). It is helpful to merge the related groups of diseases. For that
   purpose, a hierarchical clustering of diseases is applied to cluster
   these diseases according to their common disease-related genes. Similar
   diseases in GWAS and OMIM are manually merged based on disease names.
   The data set was obtained from the previous study (Yang et al.,
   [64]2016).

PPIs

   In the various types of data that have been used for the prediction of
   disease genes, PPIs are the most widely used data. The PPI network was
   obtained from the database of STRING ([65]https://string-db.org) (von
   Mering et al., [66]2003), which quantitatively incorporates several
   studies and interaction types. In this study, we consider only the
   undirected and weighted network.

KEGG Pathways

   We downloaded the KEGG pathway data set from KEGG (Kanehisa and Goto,
   [67]2000) ([68]https://www.genome.jp/) and MSigDB
   ([69]https://www.gsea-msigdb.org) (Liberzon et al., [70]2011). The KEGG
   pathway database is a collection of manually drawn pathway maps
   representing our knowledge on the molecular interaction, reaction, and
   relation networks for metabolism, genetic information processing,
   environmental information processing, cellular processes, organismal
   systems, human diseases, and drug development. MSigDB provides gene
   sets of canonical KEGG pathways derived from the KEGG pathway database.
   This data set contains 5,267 unique genes.

   Data preparation: We prepare the disease-gene associations, PPI
   network, and pathway data. Analysis of breast cancer-related genes: We
   conduct two types of analysis for disease-related genes (network and
   enrichment). Prediction of breast cancer-related genes: We evaluate the
   prediction performance based on the PPI network and PPI & KEGG pathway,
   and then we prioritize the candidate genes related to breast cancer by
   using all known disease-related genes as a training set. Analysis of
   candidate genes for breast cancer: We conduct three types of analysis
   for the candidate genes related to breast cancer (enrichment analysis
   of GO and KEGG as well as literature validation).

Statistics of Breast Cancer–Related Genes

Network Analysis

   First, we extract the disease-gene subnetwork related to a specific
   disease by retaining genes related to this disease and removing all
   other genes from the PPI network. We calculate six statistical measures
   of the network to evaluate the disease-gene subnetwork: (a) the number
   of genes; (b) the number of edges; (c) the average degrees of nodes;
   (d) clustering coefficient in the subnetwork; (e) link density, which
   is defined as ratio of the number of existing interactions to its
   maximum of possible edges; and (f) a p-value is given to evaluate the
   significance of interaction enrichment in the subnetwork.

   Then, we analyze the distribution of breast cancer–related genes in
   KEGG pathways (e.g., gastric cancer, cellular senescence, human T cell
   leukemia virus 1 infection, breast cancer, melanoma) by calculating (a)
   the number of common genes between the pathway and the breast
   cancer–related gene set; (b) the number of genes in KEGG pathway; (c)
   the number of edges in the subnetwork of the KEGG pathway; (d) the
   average degrees of nodes; (e) the clustering coefficient in the
   subnetwork; and (f) the link density, which is defined as ratio of the
   number of existing interactions to its maximum of possible edges as
   well as (g) a p-value indicating the significance of gene enrichment in
   the KEGG pathway.

   To demonstrate the higher connectivity of the related subnetworks, we
   compare these statistical quantities to those of random subsets of
   genes mapped on the PPI network with the same number of genes and same
   degree distribution.

Enrichment Analysis

   Enrichment analysis is a widely used approach to identify biological
   themes. We analyze the enrichment of the gene set in GO and the
   pathway. P-values using the hypergeometric distribution are defined as
   [MATH: <mrow><mi>p</mi><mo>=</mo><mn>1</mn><mo>−</mo><mstyle
   displaystyle="true"><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn
   >0</mn></mrow><mrow><mi>k</mi><mo>−</mo><mn>1</mn></mrow></munderover><
   mrow><mfrac><mrow><mrow><mo>(</mo><mrow><mtable><mtr><mtd><mi>M</mi></m
   td></mtr><mtr><mtd><mi>i</mi></mtd></mtr></mtable></mrow><mo>)</mo></mr
   ow><mrow><mo>(</mo><mrow><mtable><mtr><mtd><mrow><mi>N</mi><mo>−</mo><m
   i>M</mi></mrow></mtd></mtr><mtr><mtd><mrow><mi>n</mi><mo>−</mo><mi>i</m
   i></mrow></mtd></mtr></mtable></mrow><mo>)</mo></mrow></mrow><mrow><mro
   w><mo>(</mo><mrow><mtable><mtr><mtd><mi>N</mi></mtd></mtr><mtr><mtd><mi
   >n</mi></mtd></mtr></mtable></mrow><mo>)</mo></mrow></mrow></mfrac></mr
   ow></mstyle><mo>,</mo></mrow> :MATH]
   (1)

   where N is the total number of genes in the background distribution, M
   is the number of genes with given annotations in that distribution, n
   is the size of the list of genes of interest, and k is the number of
   genes with the annotations in this list. P-values are adjusted for
   multiple comparisons, and q-values are also calculated for FDR control.

   The clusterProfiler package was used to perform the enrichment analysis
   for GO terms and KEGG pathways (Yu et al., [71]2012). As such, the
   background genes are dependent on the databases used by this package.
   This package depends on the bioconductor annotation data GO.db and
   KEGG.db to obtain the maps of the entire GO and KEGG corpus. It
   provides functions, enrichGO and enrichKEGG, to perform the enrichment
   test for GO terms and KEGG pathways based on hypergeometric
   distribution. According to the description of clusterProfiler, the
   background genes should be all genes within a given annotation file,
   e.g., the GO annotation file. However, the version of the specific
   annotation file is dependent on the clusterProfiler package.

Improved Algorithm for Predicting Breast Cancer-Related Genes

   As shown in Figure 2, breast cancer–related genes tend to be connected
   with each other in the PPI network. As such, the network-based
   algorithms can often provide useful insight to infer breast
   cancer–related (candidate) genes. In this case, the PPI network is
   critical. Despite the rapid development of biotechnologies, there is
   still a large amount of data noise in the existing PPI network.
   Therefore, we propose an improved algorithm (RCRWR), which consists of
   network reconstruction based on local random walk dynamics and random
   walk with restart (see Algorithm 1 for the workflow of RCRWR). We try
   to use local random walks to extract the feature vectors of nodes
   (i.e., genes or proteins) and then use the feature vectors to calculate
   the similarity between nodes and reconstruct the PPI network to reduce
   the impact of data noise so as to improve the ability of disease-gene
   prediction based on the PPI network. Furthermore, we use KEGG pathways
   to enhance the ability to predict disease-related genes because the
   connections in the KEGG pathways tend to be stronger and more reliable
   than others.

   Algorithm 1 RCRWR Algorithm.
   Input: PPIs, known disease genes, and number (k) of nearest neighbors.
   Output: Probability scores.
   1: Calculate behavior vectors (i.e., feature vectors) of all nodes by
   local random walk dynamics in the PPI network.
   2: Calculate similarity scores between all nodes by the behavior
   vectors.
   3: Generate a reconstructed PPI network by only retaining similarity
   scores between each node i (= 1~n) and its k-nearest neighbors.
   4: Calculate probability scores of all nodes by applying network
   propagation based on random walk with restart to the reconstructed
   network, where known disease genes are used as seed nodes.
   [72]Open in a new tab

Network Reconstruction Based on Local Random Walks

Similarity Measure Based on Local Random Walk Dynamics

   Generally, similar behavior patterns appear when the dynamic processes
   are triggered on similar nodes. Therefore, we applied the local random
   walk dynamics to infer the similarity measure between nodes (Lai et
   al., [73]2010; Xiang et al., [74]2016). The probability of a walker
   from one node to others in k-step random walk is determined by
   probability matrix P^k (k is random walk length, determining the range
   of the local structure that will be explored). Due to the small-world
   effect, good results can generally be generated by using a small
   k-value (k = 2, 3, …). The element P[ij] of the transition matrix P is
   the ratio between the weight of link (i, j) and the weighted degree of
   vertex i, P[ij] = w[ij]/∑[j]w[ij], where w[ij] is the weight of edge
   (i, j). The behaviors of the random walk dynamics from a node can be
   quantified by a n-dimensional vector v[i] (i = 1~n; n is the number of
   nodes in a network), which is defined as the row of the matrix
   [MATH: <munderover accentunder="false"
   accent="false"><mrow><mo>∑</mo></mrow><mrow><mi>τ</mi><mo>=</mo><mn>1</
   mn></mrow><mrow><mi>k</mi></mrow></munderover><msup><mrow><mi>P</mi></m
   row><mrow><mi>τ</mi></mrow></msup> :MATH]
   . Here, all random walks whose steps vary from 1 to k are taken into
   consideration to reinforce the contributions from the nodes near the
   target nodes. The similarity measure between nodes based on the local
   random walk dynamics can be calculated by,
   [MATH: <mtable class="eqnarray" columnalign="right center
   left"><mtr><mtd><msub><mrow><mi>S</mi></mrow><mrow><mi>i</mi><mi>j</mi>
   </mrow></msub><mo>=</mo><mfrac><mrow><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi>v</mi></mrow><mrow><mi>i</
   mi></mrow></msub><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>j</mi
   ></mrow></msub></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mrow><msqrt><mrow><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi>v</mi></mrow><mrow><mi>i</
   mi></mrow></msub><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>i</mi
   ></mrow></msub></mrow><mo
   stretchy="false">)</mo></mrow></mrow></msqrt><msqrt><mrow><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi>v</mi></mrow><mrow><mi>j</
   mi></mrow></msub><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>j</mi
   ></mrow></msub></mrow><mo
   stretchy="false">)</mo></mrow></mrow></msqrt></mrow></mfrac></mtd></mtr
   ></mtable> :MATH]
   (2)

   where, if the behavior vectors v[x] and v[y] are highly consistent,
   then
   [MATH:
   <msubsup><mrow><mi>s</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow>
   </mrow></msubsup><mo>→</mo><mn>1</mn> :MATH]
   ; otherwise,
   [MATH:
   <msubsup><mrow><mi>s</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow>
   </mrow></msubsup><mo>→</mo><mn>0</mn> :MATH]
   .

Network Reconstruction

   We denote an undirected and weighted network by G = (V, E, W), where V
   is a set of proteins, E is a set of interactions, and W is a set of
   confidence scores of interactions in the original network. By using the
   above similarity measure based on local random walk dynamics (Equation
   2), we calculate the similarity scores between all nodes in the
   original PPI network and obtain a similarity matrix S, where S[ij]
   records the similarity score between nodes i and j. Then, we use the
   similarity scores to reconstruct the PPI network by retaining only the
   connections/similarity scores between each node i and its k-nearest
   neighbors (that is, its k neighbors with the highest similarity scores
   to the node i). The mathematical description of the reconstruction
   process is as follows.

   Definition 1. For each node i, according to the similarity scores
   between the node and other nodes, all nodes are sorted in a descending
   order. By the descending order of all nodes, we define a ranking index
   vector,
   [MATH:
   <msubsup><mrow><mi>R</mi></mrow><mrow><mo>▪</mo><mo>,</mo><mi>i</mi></m
   row><mrow></mrow></msubsup><mo>=</mo><mrow><mo>{</mo><mrow><msubsup><mr
   ow><mi>R</mi></mrow><mrow><mi>j</mi><mo>,</mo><mi>i</mi></mrow><mrow></
   mrow></msubsup><mo>|</mo><mi>j</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>.<
   /mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>n</mi></mrow><mo>}</mo></mrow>
   :MATH]
   , to record ranking indices of all nodes about the node i (note that
   node i itself is given a largest ranking index), where R[j, i] records
   the ranking index of node j in this case, and n is the number of nodes
   in the network.

   Definition 2. By combining the ranking vectors about all nodes, we
   define a ranking matrix
   [MATH: <mi>R</mi><mo>=</mo><mrow><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mi>R</mi></mrow><mrow><mo>
   ▪</mo><mo>,</mo><mn>1</mn></mrow><mrow></mrow></msubsup><mo>,</mo><msub
   sup><mrow><mi>R</mi></mrow><mrow><mo>▪</mo><mo>,</mo><mn>2</mn></mrow><
   mrow></mrow></msubsup><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo
   ><msubsup><mrow><mi>R</mi></mrow><mrow><mo>▪</mo><mo>,</mo><mi>n</mi></
   mrow><mrow></mrow></msubsup></mrow><mo stretchy="false">)</mo></mrow>
   :MATH]
   , where n is the number of nodes in the network.

   Definition 3. By using the ranking matrix and the similarity matrix S,
   we define a reconstructed and undirected network
   [MATH: <mi>Ĝ</mi><mo>=</mo><mrow><mo
   stretchy="false">(</mo><mrow><mover
   accent="true"><mrow><mi>V</mi></mrow><mo>^</mo></mover><mo>,</mo><mi>Ê<
   /mi><mo>,</mo><mi>Ŵ</mi></mrow><mo stretchy="false">)</mo></mrow>
   :MATH]
   , where
   [MATH: <mover
   accent="true"><mrow><mi>V</mi></mrow><mo>^</mo></mover><mo>=</mo><mi>V<
   /mi> :MATH]
   , Ê and Ŵ denote the set of edges and the set of weights of edges in
   the reconstructed network, respectively:
   [MATH: <mtable
   columnalign="left"><mtr><mtd><mi>Ê</mi><mo>=</mo><mrow><mo>{</mo><mrow>
   <mrow><mo
   stretchy="false">(</mo><mrow><mi>j</mi><mo>,</mo><mi>i</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>|</mo><mi>i</mi><mo>=</mo><mn>1</mn><
   mo>~</mo><mtext> </mtext><mi>n</mi><mo>,</mo><mtext> </mtext><mi>j</mi>
   <mo>=</mo><mn>1</mn><mo>~</mo><mtext> </mtext><mi>n</mi><mo>,</mo><mtex
   t> </mtext><msubsup><mrow><mi>R</mi></mrow><mrow><mi>j</mi><mo>,</mo><m
   i>i</mi></mrow><mrow></mrow></msubsup><mo>≤</mo><mi>k</mi><mtext> </mte
   xt></mrow><mo>}</mo></mrow><mo>,</mo></mtd></mtr><mtr><mtd><mi>Ŵ</mi><m
   o>=</mo><mrow><mo>{</mo><mrow><msub><mrow><mi>S</mi></mrow><mrow><mi>j<
   /mi><mo>,</mo><mi>i</mi></mrow></msub><mo>|</mo><mi>i</mi><mo>=</mo><mn
   >1</mn><mo>~</mo><mi>n</mi><mo>,</mo><mi>j</mi><mo>=</mo><mn>1</mn><mo>
   ~</mo><mi>n</mi><mo>,</mo><msub><mrow><mi>R</mi></mrow><mrow><mi>j</mi>
   <mo>,</mo><mi>i</mi></mrow></msub><mo>≤</mo><mi>k</mi></mrow><mo>}</mo>
   </mrow><mo>,</mo></mtd></mtr></mtable> :MATH]

   where S[j, i] = S[i, j], and k denotes the number of the nearest
   neighbors (k = 50 for default).

   In the reconstruction process for a given k-value, the newly added
   edges can be denoted by
   [MATH:
   <msubsup><mrow><mi>Ê</mi></mrow><mrow><mi>a</mi><mi>d</mi><mi>d</mi></m
   row><mrow></mrow></msubsup><mo>=</mo><mrow><mo>{</mo><mrow><mrow><mo
   stretchy="false">(</mo><mrow><mi>j</mi><mo>,</mo><mi>i</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>|</mo><mi>i</mi><mo>=</mo><mn>1</mn><
   mo>~</mo><mtext> </mtext><mi>n</mi><mo>,</mo><mtext> </mtext><mi>j</mi>
   <mo>=</mo><mn>1</mn><mo>~</mo><mtext> </mtext><mi>n</mi><mo>,</mo><mtex
   t> </mtext><msubsup><mrow><mi>R</mi></mrow><mrow><mi>j</mi><mo>,</mo><m
   i>i</mi></mrow><mrow></mrow></msubsup><mo>≤</mo><mi>k</mi><mtext> </mte
   xt><mi>a</mi><mi>n</mi><mi>d</mi><mtext> </mtext><mrow><mo
   stretchy="false">(</mo><mrow><mi>j</mi><mo>,</mo><mi>i</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>∉</mo><mi>E</mi></mrow><mo>}</mo></mr
   ow> :MATH]
   ; the removed edges can be denoted by
   [MATH:
   <msubsup><mrow><mi>Ê</mi></mrow><mrow><mi>r</mi><mi>e</mi><mi>m</mi><mi
   >o</mi><mi>v</mi><mi>e</mi></mrow><mrow></mrow></msubsup><mo>=</mo><mi>
   E</mi><mo>\</mo><mi>Ê</mi> :MATH]
   ; the retained edges can be denoted by
   [MATH:
   <msubsup><mrow><mi>Ê</mi></mrow><mrow><mi>r</mi><mi>e</mi><mi>t</mi><mi
   >a</mi><mi>i</mi><mi>n</mi></mrow><mrow></mrow></msubsup><mo>=</mo><mi>
   E</mi><mo>⋂</mo><mi>Ê</mi> :MATH]
   ; and the weights of the retained edges are substituted by the
   similarity scores obtained by the similarity measure based on local
   random walk dynamics.

   By using the reconstruction process, we can generate a reconstructed
   and undirected network. The reconstructed network may enhance our
   ability for disease-gene prediction because it can improve the original
   PPI network. To show the effect of the reconstruction process on the
   PPI network, we have generated a set of reconstructed PPI networks by
   using a series of k-values, and then we calculate the mean score (in
   the String database) of retained edges
   [MATH:
   <msubsup><mrow><mi>Ê</mi></mrow><mrow><mi>r</mi><mi>e</mi><mi>t</mi><mi
   >a</mi><mi>i</mi><mi>n</mi></mrow><mrow></mrow></msubsup> :MATH]
   and removed edges
   [MATH:
   <msubsup><mrow><mi>Ê</mi></mrow><mrow><mi>r</mi><mi>e</mi><mi>m</mi><mi
   >o</mi><mi>v</mi><mi>e</mi></mrow><mrow></mrow></msubsup> :MATH]
   for each k value. The results show that the mean score (in the String
   database) of the retained edges tends to be larger than that of the
   removed edges (see [75]Supplementary Figure 1). This is consistent with
   our expectation: By using the reconstruction process, PPIs with high
   reliability in the String database tend to be retained, and PPIs with
   low reliability in the String database tend to be removed, and the
   reconstruction process also supplements some edges with high similarity
   scores that do not exist in the original PPI network. Moreover, we have
   provided an example figure to compare the original network with the
   reconstructed one, which shows the effect of network reconstruction on
   the original network, so the reader can more clearly see what is being
   done (see [76]Supplementary Figure 2).

   As a whole, this reconstruction process may reduce data noise to a
   certain extent to optimize the PPI network so as to improve the network
   data environment for disease-gene prediction. In the following step, we
   apply network propagation to the reconstructed network to predict
   disease-related genes more effectively.

Network Propagation Based on Random Walk With Restart

   The random walk with restart can been seen as performing multiple
   random walks over the PPI network, each starting from a seed node
   associated to a known disease gene, iteratively moving from one node to
   a random neighbor, and the stationary distribution can be considered as
   a measure of the proximity between the seed(s) and all the other nodes
   in the network. More formally, the random walk with restart is defined
   as
   [MATH: <mtable class="eqnarray" columnalign="right center
   left"><mtr><mtd><msubsup><mrow><mi>p</mi></mrow><mrow><mi>t</mi><mo>+</
   mo><mn>1</mn></mrow><mrow><mi>T</mi></mrow></msubsup><mo>=</mo><mrow><m
   o stretchy="false">(</mo><mrow><mn>1</mn><mo>-</mo><mi>r</mi></mrow><mo
   stretchy="false">)</mo></mrow><mi>M</mi><msubsup><mrow><mi>p</mi></mrow
   ><mrow><mi>t</mi></mrow><mrow><mi>T</mi></mrow></msubsup><mo>+</mo><mi>
   r</mi><msubsup><mrow><mi>p</mi></mrow><mrow><mn>0</mn></mrow><mrow><mi>
   T</mi></mrow></msubsup></mtd></mtr></mtable> :MATH]
   (3)

   Here, p[0] is the initial probability distribution. M is the
   column-normalized adjacency matrix of the graph. r∈(0, 1) is the
   restart probability, and it is set to be 0.7 as suggested by previous
   studies (Zhao et al., [77]2015). p[t] is the probability vector of the
   random walker reaching all nodes at the end of the tth step. After
   several iterations, the difference between the vectors p[t+1] and p[t]
   becomes negligible, the stationary probability distribution is reached,
   and the element in the vector represents a proximity measure between
   every graph node and the seed(s). In this work, iterations are repeated
   until the difference between p[t] and p[t+1] falls below 10^-6 as used
   by previous studies (Zhao et al., [78]2015).

   Note that for cross-validation, the known disease-related genes in the
   training set are used as seed nodes to conduct the random walk with
   restart, and all known disease-related genes are used as seed nodes
   when predicting novel candidate genes.

Prediction Based on PPI Network

   We first prepare the PPI network. The PPI network from the String
   database retains edges with confidence scores >400, and we normalize
   the confidence scores to be between zero and one by dividing a value of
   1,000. The PPI network is used as the original PPI network. We use a
   weighted graph G = (V, E, W) to denote the PPI network comprising a set
   of proteins V, a set of interactions E, and a set of confidence scores
   W. Then, we map known breast cancer–related genes into the PPI network
   and conduct the random walk with restart to predict disease-related
   genes. Finally, the probabilities of nodes are used to rank candidate
   genes.

Prediction Based on PPI Network and KEGG Pathway

   Similarly, we prepare the related data sets, including the PPI network,
   breast cancer–related genes, and KEGG pathway. The PPI network still
   retains edges with confidence >400. We map known breast cancer–related
   genes to the PPI network. Then, KEGG pathways are mapped into the PPI
   network and intersect with the above network. Finally, we perform the
   random walk with restart to predict breast cancer–related genes.

Performance Evaluation

   To evaluate the prediction performance of the algorithm, we apply
   traditional 3-fold cross-validation in the benchmark. Each time, the
   known disease genes are randomly split into three parts. Each part is,
   in turn, used as test set and the rest as a training set. Then, we use
   the genes in the training set as seeds to perform the random walk with
   restart to predict disease-related genes. Note that, in the process of
   predicting disease genes, only genes in the training set are used as
   seed genes. For the cross-validation, the training set made up of two
   thirds of all disease genes randomly selected. For the prediction of
   novel genes, all known disease genes are used as the training set.

   For a disease d in disease set D, T[d] denotes the set of genes in test
   set. The disease-gene prediction algorithm provides a ranking list of
   candidate genes for disease d. We denote by R[d](k) the set of top k
   candidate genes in the ranking list. Then recall in the top k ranking
   list is defined as
   [MATH: <mtable class="eqnarray" columnalign="right center
   left"><mtr><mtd><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l
   </mi><mtext> </mtext><mrow><mo
   stretchy="false">(</mo><mrow><mi>k</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mfrac><mrow><mo>|</mo><msub><m
   row><mi>T</mi></mrow><mrow><mi>d</mi></mrow></msub><mo>∩</mo><msub><mro
   w><mi>R</mi></mrow><mrow><mi>d</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><mi>k</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>|</mo></mrow><mrow><mo>|</mo><msub><m
   row><mi>T</mi></mrow><mrow><mi>d</mi></mrow></msub><mo>|</mo></mrow></m
   frac></mtd></mtr></mtable> :MATH]
   (4)

   This metric is used to evaluate the performance of prediction
   algorithms.

Results

   Here, we first conduct two types of analysis for breast cancer–related
   genes: (1) network analysis of the breast cancer–related subnetwork and
   KEGG pathways and (2) enrichment analysis of GO and the pathway of
   breast cancer–related genes. Then, we predict breast cancer–related
   genes on the (reconstructed) PPI network with and without the KEGG
   pathways and analyze the prediction performance, including (1)
   quantitative evaluation on the known breast cancer–related gene set,
   (2) enrichment analysis of GO and the pathway of candidate genes, and
   (3) a literature validation of candidate genes. [79]Figure 1 shows the
   workflow.

Figure 1.

   [80]Figure 1
   [81]Open in a new tab

   Workflow of the work.

Analysis of Breast Cancer-Related Genes

Network Analysis

Subnetwork of Breast Cancer-Related Genes

   Breast cancer–related genes were obtained from Yang et al. ([82]2016).
   After mapping breast cancer–related genes into the PPI network, there
   are only 127 breast cancer–related genes. We first analyze the
   distribution of breast cancer–related genes in the PPI network as well
   as KEGG pathways ([83]Figure 2). [84]Supplementary Figures 3–[85]7
   provide larger plots so that gene names can be identified more easily.
   [86]Figure 2A displays the subnetwork of breast cancer–related genes.
   The subnetwork is extracted from the PPI network by only retaining
   breast cancer–related genes. We quantitatively analyze the breast
   cancer–related subnetwork by calculating six statistical measures of
   networks (see [87]Table 1). We find that the breast cancer–related
   subnetwork has a higher value of the clustering coefficient (CC) and
   higher link density compared with random sampling on the whole network,
   showing significantly more interactions than expected. These results
   quantitatively suggest that the breast cancer–related genes/proteins
   tend to interact with each other, forming disease module with higher
   link density than expected.

Figure 2.

   [88]Figure 2
   [89]Open in a new tab

   (A) Subnetwork of breast cancer–related genes extracted from the PPI
   network. (B–F) Subnetworks extracted from the PPI network by using the
   sets of genes of five KEGG pathways, respectively: gastric cancer,
   cellular senescence, human T cell leukemia virus 1 infection, breast
   cancer, and melanoma. Note that green nodes with larger size denote
   known breast cancer–related genes.

Table 1.

   Statistics of disease-gene subnetworks related to breast cancer as well
   as other diseases.
   Disease #Genes #Interactions Degree CC Link density p-value
   Breast cancer 130 477 (232 ± 24) 7.3 0.55 (0.42 ± 0.04) 5.7% (2.8% ±
   0.3%) <1.0e-16
   Rheumatoid arthritis 115 607 (87 ± 15) 10.6 0.45 (0.36 ± 0.04) 9.3%
   (1.3% ± 0.2%) <1.0e-16
   Cholesterol 221 1,152 (245 ± 27) 10.4 0.47 (0.37 ± 0.03) 4.7% (1.0% ±
   0.1%) <1.0e-16
   Obesity 102 764 (65 ± 14) 15.0 0.62 (0.35 ± 0.05) 14.8% (1.3% ± 0.3%)
   <1.0e-16
   Hypertension 104 234 (64 ± 9) 4.5 0.44 (0.35 ± 0.05) 4.4% (1.2% ± 0.2%)
   <1.0e-16
   Metabolic traits 135 439 (70 ± 10) 6.5 0.38 (0.34 ± 0.04) 4.9% (0.8% ±
   0.1%) <1.0e-16
   Crohn's disease 194 847 (198 ± 27) 8.7 0.50 (0.38 ± 0.04) 4.5% (1.1% ±
   0.1%) <1.0e-16
   Inflammatory bowel disease 220 1,653 (251 ± 32) 15.0 0.52 (0.38 ± 0.03)
   6.9% (1.1% ± 0.1%) <1.0e-16
   Metabolite levels 95 366 (44 ± 10) 7.7 0.50 (0.34 ± 0.05) 8.2% (1.0% ±
   0.2%) <1.0e-16
   Prostate cancer 238 589 (300 ± 24) 5.0 0.44 (0.39 ± 0.03) 2.1% (1.1% ±
   0.1%) <1.0e-16
   [90]Open in a new tab

   Disease-gene subnetworks are extracted from the PPI network by
   retaining genes related to specific disease, e.g., breast cancer.
   #Genes and #Interactions denote the number of genes and edges in the
   subnetworks, respectively. Degree and CC denote the average degrees of
   all nodes and CCs in the subnetwork, respectively. Link density is
   defined as ratio of the number of existing interactions to its maximum
   of possible edges in the subnetwork. p-value evaluates the significance
   of interaction enrichment in the subnetwork. “(x ± y)” denotes the mean
   and standard deviation of statistics in random sampling.

   As we know, in PPI networks, proteins with similar functions tend to
   connect or interact with each other. The occurrence and development of
   disease is usually due to the abnormal function of related genes or
   proteins, which leads to the change of related signal pathways. These
   proteins usually have functional similarity or correlation. Therefore,
   genes of the same disease or similar diseases tend to connect with each
   other in the PPI network to form disease modules.

   We calculate the six statistical measures for subnetworks of other
   diseases, such as rheumatoid arthritis, cholesterol, and obesity (see
   [91]Table 1). Similar conclusions can be obtained for other diseases.
   Clearly, these diseases also have similar modular property. This again
   confirms the modular property of disease-related genes (Ghiassian et
   al., [92]2015; Xiang et al., [93]2016; Chen et al., [94]2018; Hu et
   al., [95]2018, [96]2020; Choobdar et al., [97]2019; Dwivedi et al.,
   [98]2020). This is why guilt by association can become a useful
   strategy in disease-gene prediction based on PPI networks.

Subnetworks of KEGG Pathways Related to Breast Cancer

   Moreover, we study subnetworks of KEGG pathways related to breast
   cancer. We analyze the distribution of breast cancer–related genes in
   KEGG pathways (also, see [99]Supplementary Figures 1–[100]5).

   We extract the subnetworks of the KEGG pathways from the PPI network by
   using the sets of genes of the KEGG pathways and calculate the
   statistical measures of networks for these subnetworks. [101]Table 2
   lists five KEGG pathways significantly related to breast cancer along
   with the statistical measures of the subnetworks. The results show that
   these subnetworks have similarly higher values of CC and higher link
   density than the whole network, and it has significantly more
   interactions than expected (p <1.0e-16). This means the genes/proteins
   in these KEGG pathways also tend to interact with each other, forming
   modules with higher link density than expected.

Table 2.

   Statistics of KEGG pathways related to breast cancer.
   Pathway ID Pathway Name #Matched Genes #Genes #Interactions Degree CC
   Link density p-value
   hsa04218 Cellular senescence 13 156 2,377 (1,136 ± 69) 30.5 0.65(0.52 ±
   0.02) 19.7% (9.6% ± 0.6%) <1.0e-16
   hsa05224 Breast cancer 12 147 3,169 (1,059 ± 79) 43.1 0.72(0.57 ± 0.03)
   29.5% (9.9% ± 0.7%) <1.0e-16
   hsa05226 Gastric cancer 15 149 3,042 (953 ± 74) 40.8 0.71(0.55 ± 0.03)
   27.6% (9.0% ± 0.7%) <1.0e-18
   hsa05166 Human T-cell leukemia virus 1 infection 13 217 3,872 (1,516 ±
   96) 35.7 0.63(0.49 ± 0.02) 16.5% (6.6% ± 0.4%) <1.0e-17
   hsa05218 Melanoma 9 72 1,112 (385 ± 39) 30.9 0.77(0.61 ± 0.04) 43.5%
   (15.1% ± 1.5%) <1.0e-16
   [102]Open in a new tab

   The KEGG pathways used in analysis are selected based on the number of
   matched genes between the pathways and known disease gene set.

   #Matched Genes denotes the number of common genes between gene set of
   pathway and breast cancer–related gene set; #Genes in Pathway denotes
   the number of genes in pathway; #Edges denotes the number of
   interaction in the PPI subnetwork consisting of genes in pathway;
   Degree and CC denote the average degrees of all nodes and CCs in the
   subnetwork, respectively. Link density is defined as ratio of the
   number of existing interactions to its maximum of possible edges in the
   subnetwork. p-value evaluates the significance of interaction
   enrichment in the subnetwork. “(x ± y)” denotes the mean and standard
   deviation of statistics in random sampling.

   The values of CC and link density for most KEGG pathways are higher
   than those of the above subnetwork of breast cancer–related genes (see
   [103]Tables 1, [104]2). This means the genes in the KEGG pathways are
   more modular than breast cancer–related genes. The reason may be that
   genes in these KEGG pathways are more closely related than other genes
   in functions. Moreover, we can find that there exist submodule
   structures in the subnetworks of the KEGG pathways (see [105]Figures
   2B–F). This means that there exist functional subunits in the KEGG
   pathways.

   We label known breast cancer–related genes in the subnetworks of the
   KEGG pathways. Other unlabeled genes in the KEGG pathways are also
   likely to be related to breast cancer because they are likely to
   jointly affect breast cancer–related functions. One can see that some
   subunits have more breast cancer–related genes. This means that the
   known breast cancer–related genes may be non-randomly distributed in
   the subnetworks of KEGG pathways, and some subunits in the KEGG
   pathways may be more related to breast cancer.

   Overall, the physical and functional connections between genes in the
   KEGG pathways are stronger and more reliable than others. Therefore, we
   make use of information of KEGG pathways in disease-gene prediction.

Enrichment Analysis

   To analyze the relatedness of disease-gene sets to functional units, we
   perform GO enrichment analysis and KEGG pathway enrichment analysis.
   [106]Figure 3 shows the results of GO enrichment analysis and KEGG
   pathway enrichment analysis (obtained by clusterProfiler; Yu et al.,
   [107]2012).

Figure 3.

   [108]Figure 3
   [109]Open in a new tab

   Enrichment analysis of known breast cancer–related genes: (A) GO
   enrichment analysis and (B) pathway enrichment analysis.

   According to the GO terms in [110]Figure 3, breast cancer–related genes
   are enriched in the following GO terms, e.g., “double-strand break
   repair,” “replicative senescence,” “cell aging,” “aging,” “cell cycle
   checkpoint,” “cell cycle arrest,” “gland development,” “signal
   transduction by p53 class mediator,” “mitotic cell cycle checkpoint,”
   and “protein kinase B signaling.”

   According to the KEGG pathways in [111]Figure 3, breast cancer–related
   genes are enriched in cancer-related KEGG pathways, e.g., gastric
   cancer, endometrial cancer, colorectal cancer, thyroid cancer,
   pancreatic cancer, prostate cancer, central carbon metabolism in
   cancer, proteoglycans in cancer, bladder cancer.

   BRCA gene mutations, which are commonly present in breast cancer, are
   associated with significantly increased susceptibility to tumors,
   including prostate, pancreatic, gallbladder/cholangioma, and stomach
   cancer as well as malignant melanoma. These tumors share a common
   pathogenic gene network in which the BRCA gene plays an important role
   as it is a member of the mismatch repair gene family. The prediction of
   breast cancer–related genes can discover the interaction between tumors
   and enrich the relationship network, which is of great significance for
   finding therapeutic targets for tumors.

Prediction of Breast Cancer Genes Based on PPI Network

   To evaluate the prediction performance of our algorithm, we first apply
   RCRWR to the PPI network. The results show that RCRWR significantly
   outperforms the original RWR algorithm (Wu et al., [112]2008) on the
   PPI network for the top 1, 5, and 10% lists of candidate genes (see
   [113]Figure 4). This means that the network reconstruction indeed can
   improve the PPI network so as to enhance the ability to predict breast
   cancer–related genes. Moreover, it is clear that RCRWR and RWR are
   significantly better than that in the random case.

Figure 4.

   Figure 4
   [114]Open in a new tab

   Top-k Recall (k = 1, 5, 10, 20%) of the original and improved
   algorithms in the PPI network.

Prediction of Breast Cancer Genes Based on PPI Network and KEGG Pathway

   Further, we intersect genes in the KEGG pathways with genes in the PPI
   to obtain a more reliable PPI network and then apply RCRWR to the PPI
   network. The results show that RCRWR is significantly better than the
   RWR algorithm on the PPI network for top 1, 5, 10, and 20% lists of
   candidate genes (see [115]Figure 5). This again proves that the network
   reconstruction can indeed enhance the ability to infer breast
   cancer–related genes on the PPI network. Moreover, it is clear that the
   results of RCRWR and RWR are also significantly better than in the
   random case.

Figure 5.

   Figure 5
   [116]Open in a new tab

   Top-k Recall (k = 1, 5, 10, 20%) of the original and improved
   algorithms in the PPI network with KEGG pathway (PPI_ KEGG).

   Compared with the results on the PPI network with and without KEGG
   pathway data (see [117]Figure 6), it is very clear that the prediction
   performance of both RWR and RCRWR can be enhanced due to the addition
   of information of the KEGG pathway. The information of the KEGG pathway
   is very helpful for the prediction of disease-related genes.

Figure 6.

   [118]Figure 6
   [119]Open in a new tab

   Comparison of top-k Recall (k = 1, 5, 10, 20%) in the PPI network with
   and without KEGG pathway by the (A) original algorithm and (B) improved
   algorithm.

Analysis of Candidate Genes of Breast Cancer

   Here, we use all known breast cancer–related genes as training set to
   predict candidate genes. We map breast cancer–related genes into the
   PPI network and map the KEGG pathway onto the PPI network because the
   KEGG pathway is helpful for disease-gene prediction. We perform our
   improved algorithm RCRWR in the network to score all candidate genes.
   Then, we generate a ranking list of candidate genes for breast cancer.
   The higher the ranking of genes, the more likely they are to be
   associated with breast cancer.

   We list the top 10 predicted genes in [120]Table 3, which are
   considered to be most closely associated with breast cancer according
   to the scores from prediction algorithm. To check the effectiveness of
   prediction for the candidate genes, we search the literature and try to
   find the connections between these genes and breast cancer.

Table 3.

   Predicted top 10 candidate genes for breast cancer using PPI and KEGG
   pathway.
   Gene References