Abstract

   Although numerous approaches have been proposed to discern driver from
   passenger, identification of driver genes remains a critical challenge
   in the cancer genomics field. Driver genes with low mutated frequency
   tend to be filtered in cancer research. In addition, the accumulation
   of different omics data necessitates the development of algorithmic
   frameworks for nominating putative driver genes. In this study, we
   presented a novel framework to identify driver genes through
   integrating multi-omics data such as somatic mutation, gene expression,
   and copy number alterations. We developed a computational approach to
   detect potential driver genes by virtue of their effect on their
   neighbors in network. Application to three datasets (head and neck
   squamous cell carcinoma (HNSC), thyroid carcinoma (THCA) and kidney
   renal clear cell carcinoma (KIRC)) from The Cancer Genome Atlas (TCGA),
   by comparing the Precision, Recall and F1 score, our method
   outperformed DriverNet and MUFFINN in all three datasets. In addition,
   our method was less affected by protein length compared with DriverNet.
   Lastly, our method not only identified the known cancer genes but also
   detected the potential rare drivers (PTPN6 in THCA, SRC, GRB2 and PTPN6
   in KIRC, MAPK1 and SMAD2 in HNSC).

   Keywords: driver genes, protein interaction network, integrative data

INTRODUCTION

   With the development of high throughput sequencing, amounts of cancer
   omics data have allowed us to better understand cancer biology [[26]1].
   The Cancer Genome Atlas (TCGA) project stores omics data of more than
   20 cancer types, thus allows us to study cancer driver genes (driver
   gene is a specific type of cancer gene). However, the key question is
   how to distinguish the driver genes, which confer a selective advantage
   to tumor growth, from passengers, which provide no fitness advantage to
   the tumor [[27]2]. Besides, how to subsequently integrate omics data,
   including exploit protein interaction networks to detect cancer driver
   genes remains a challenge. Computational approaches and tools have been
   developed to identify driver genes. These methods can be categorized
   into gene level and module level approaches [[28]3]. Gene level
   approaches for identifying drivers mainly rely on the hypothesis that
   driver has a more chance to be mutated across a set of tumors [[29]4,
   [30]5]. These approaches including the mutational significant in cancer
   (MuSiC) [[31]6], OncodriverCLUST [[32]7], and MuSigCV [[33]8], which
   can identify the genes that harbor significantly more mutations than
   background mutation rate. Although the gene level approaches can be
   used to distinguish driver genes from passengers, rare mutations played
   functional roles in later stages of tumor progression are failed to be
   detected [[34]9]. What's more, cells are made up of multiple molecular
   structures that form dynamic networks [[35]10]. Under a network, a
   genetic aberration may affect its connection within the network
   [[36]10]. The module level approaches using the network or pathway
   information can be effective in identifying drivers. An example is that
   Hamed and his colleges used the protein interaction network for
   identifying cancer drivers [[37]11]. DawnRank, which used PageRank
   algorithm to detect driver genes [[38]12], is an affective module level
   method. Besides, DriverNet identifies drivers by estimating their
   effect on mRNA expression [[39]13]. MUFFINN prioritizes driver genes
   based on genes and their neighbors in functional network [[40]14]. In
   addition, some approaches were developed to identify driver modules or
   pathways, such as MEMCover and MEMo [[41]15, [42]16]. Although these
   approaches are effective in detecting drivers, they have limitations.
   First, the network is built based on static network, rather being
   condition specific. Second, network edges are often not considered in
   the majority of the aforementioned approaches.

   In view of the functional relationship between gene pairs in a network
   may radically boost the detection of drivers [[43]17]. The network
   consists of nodes (which may stand for genes or proteins) and edges
   (which may present the functional links that connect them). Merid et
   al. performed a network-based algorithm to identify driver genes by
   considering the relationship between the mutation events and functional
   gene sets (FGS), the result showed a complementary to frequency-based
   driver analyses [[44]17]. However, their method did not consider gene
   expression data. In a biological network, an interacting gene pair
   tends to present positive or negative correlations which are reflected
   by the gene expression data. Alteration of these gene correlations
   indicates system's condition (such as normal or disease state)
   [[45]18]. Therefore, differentially correlated gene pairs can
   distinguish tumor and normal sample [[46]19]. Thus, candidate potential
   cancer genes can be detected by studying dynamic regulation between
   genes, which may improve the detection of driver genes [[47]17].

   In this paper, we presented a network-based approach which integrated
   gene expression, mutation data, and PPI (protein-protein interaction)
   data to distinguish between driver and passenger genes. Our approach
   built on a hypothesis that a driver gene can be determined by its
   neighbors. We firstly built a relationship network among the DCGs
   (differentially coexpressed genes), functional genes, and then
   calculated the impact of DCGs by weighting the relationship between the
   DCGs and its connective functional genes on a bipartite graph. Finally,
   we combined the mutation information to improve the effect of screening
   drivers. In order to evaluate the performance of our approach, we
   applied it to three datasets (KIRC (kidney renal clear cell carcinoma),
   THCA (thyroid carcinoma), and HNSC (head and neck squamous cell
   carcinoma)) to identify driver genes. We detected some potential rare
   drivers that previously could not be identified by DriverNet such as
   SMAD2 in HNSC [[48]13]. Besides, we also detected some known cancer
   driver genes in each cancer dataset. All in all, our computational
   method is effective to detect potential cancer drivers to improve
   cancer-specific therapeutic targets.

RESULTS

Performance comparison

   In order to assess the performance of our method's ability to detect
   known driver genes, DriverNet and DawnRank and MUFFINN, and
   frequency-based methods were used to be compared with our method in CGC
   (the Cancer Gene Cense database) and driver gene list defined by 20/20
   rules [[49]20] as benchmark of known drivers. We performed the
   comparisons as follows: we used the same datasets to perform the
   DriverNet, DawnRank, MUFFINN, and frequency-based method and our
   method, respectively. We input these datasets into DriverNet, DawnRank,
   and MUFFINN. Then ran the program with the default settings, and we ran
   our method with the settings mentioned above. We calculated the DCGs
   Z-score and the mutated gene Z-score respectively. Combining the two
   scores and using the total score as the driver gene score. In CGC
   benchmark dataset, to evaluate the comparison, we used the three
   measures (Precision, Recall, and F1 score) mentioned in the Method
   Section. Based on these measures, our approach showed a better
   performance than MUFFINN and DriverNet and frequency-based method. We
   first evaluated the performance of our method. In Figure [50]1,
   Precision, Recall and F1 score curves of our method are both higher
   than those with DriverNet, MUFFIN and frequency-based method, but
   slightly worse than those with DawnRank method in THCA and HNSC
   datasets. Although DriverNet performed comparably in ranking the top 5
   genes in THCA and KIRC, it has poorer performance in all driver genes.
   A potential explanation of the difference may lie in the CGC is not
   cancer-specific and the cancer gene listed in CGC is not complete. In
   Table [51]1, we can observe LYN is not a CGC gene, while they have been
   reported to have an association with THCA [[52]21].

Figure 1. A comparison of the precision, recall, and F1 score for the top
ranking genes in our method and DriverNet.

   Figure 1
   [53]Open in a new tab

   The X-axis represents the number of top ranking genes involved in the
   precision, recall, and F1 score calculation. The Y-axis represents the
   score of the given metric.

Table 1. Top 5 cancer–associated driver genes in each three cancer type.

   Ranking Gene Driver gene score Annotation The number of mutated gene
   THCA 1 EGFR 11.58 Sanger cancer gene 3
   2 EP300 11.20 Sanger cancer gene 3
   3 NRAS 9.97 Sanger cancer gene 31
   4 LYN 9.78 3
   5 PTPN11 9.42 Sanger cancer gene 11
   HNSC 1 TP53 39.63906 Sanger cancer gene 172
   2 PIK3CA 19.19902 Sanger cancer gene 100
   3 EGFR 16.93629 Sanger cancer gene 40
   4 EP300 16.62027 Sanger cancer gene 22
   5 FADD 15.40049 81
   KIRC 1 PBRM1 41.84646 Sanger cancer gene 138
   2 SETD2 14.98057 Sanger cancer gene 51
   3 BAP1 11.83806 Sanger cancer gene 42
   4 SRC 11.66549 2
   5 EP300 11.39506 Sanger cancer gene 6
   [54]Open in a new tab

   In benchmarking 20/20 rule dataset, we used the top ranked driver genes
   (top 89, 100, and 100 driver genes for THCA, HNSC, and KIRC,
   respectively (see [55]Supplementary Table 1) to compare with other
   methods, which is shown in Figure [56]2. It can be seen that our method
   outperformed the other four methods on the top 100 genes in HNSC and
   KIRC dataset. In THCA, our method has a remarkably better performance
   than frequency-based method, MUFFIN, and DriverNet, but slightly worse
   than DawnRank. However, in the top 22 gene list, our method presents
   advantage than DawnRank.

Figure 2. Cumulative numbers of retrieved cancer genes annotated by 20/20
rule within top 25, 50, 75, and 100 of HNSC and KIRC, top 22, 44, 66, and 89
of THCA using four different methods.

   Figure 2
   [57]Open in a new tab

   For THCA, our method identified 89 genes (driver gene score ≥2) which
   includes 28 genes found in CGC. 11 genes found in CGC (NRAS, HRAS,
   PTPN, PTEN, RB1, EP300, ATM, PIK3R1, TP53, HSP90AA1, PML) were also
   among the driver genes nominated by DriverNet approach. TG
   (thyroglobulin) is the 13th-ranked driver in our method, and it is
   altered in 20/435 THCA samples. TG plays a role in the pathogenesis of
   papillary thyroid carcinoma and its malignant evolution [[58]22].
   However, it ranked 478 and was not detected by DriverNet as top ranking
   drivers.

   For KIRC, DriverNet identified 100 driver genes, of which 21 driver
   genes were found in CGC. Our method identified 127genes (driver gene
   score ≥2), 52 genes out of them were found in CGC. 12 genes found in
   CGC (PTEN, TP53, EGFR, EP300, SMARCA4, ATM, CREBBP, APC, NCOR1, XPO1,
   AKT1, HSP90AA1) were also among the driver genes nominated by
   DriverNet. We detected PBRM1 as the first ranked gene (mutated in 138
   cases) [[59]23], whereas it was not detected by DriverNet.

   For HNSC, DriverNet identified 202 driver genes, of which 23 were found
   in CGC. Our method identified 202 genes (driver gene score≥2), among
   them, 49 genes were included in CGC. 17 genes (TP53, PIK3CA, EGFR,
   EP300, CREBBP, NOTCH1, SMAD4, FGFR1, HRAS, CASP8, NFE2L2, MDM2, RAC1,
   HSP90AA1,AKT1, PLCG1, and CDH1) found in CGC were also among the driver
   genes nominated by DriverNet. TP53, PIK3CA, HRAS, and EGFR are known
   oncogenic drivers and it indicates that the efficiency of integrating
   multi-omics data to detect drivers. In addition, STAT3 mediates the
   cell cycle, regulates apoptosis, which has been reported to be
   constitutively active in HNSC [[60]24]. However, STAT3 which was ranked
   27th was not found by DriverNet.

   Our method is less affected by noise than DriverNet. An illustration is
   that TTN was ranked 23th in KIRC and 18th in HNSC by DriverNet. TTN is
   the longest gene in the human genome, which has a higher mutation rate
   and a potential to be artifacts [[61]25]. However, TTN was not detected
   as driver genes in any three cancers according to our method.

Infrequent (rare) driver mutations identified in three cancer types (THCA,
KIRC, and HNSC)

   In this section, we validated the ability of detecting rare drivers by
   our method. We adopted the three criteria to identify rare driver
   genes. Firstly, only the top 30 of driver genes in various samples were
   considered as drivers. Secondly, the alteration frequency should be
   lower than 2% in tumor samples. And finally the gene should not be
   reported in CGC [[62]12].

   In THCA, we found some novel rare driver genes. Of them, PTPN6 is the
   most promising. PTPN6 has been described as a tumor suppressor gene
   [[63]26]. PTPN6 participates in several cancer related pathways,
   including adherens junction, T cell receptor signaling pathway, B cell
   receptor signaling pathway, and Jak-STAT signaling pathway (see
   [64]Supplementary Table 2). PTPN6 is the 11th-ranked driver in our
   method, and it is altered in 0.23% THCA samples.

   In KIRC, there were three candidate novel drivers including SRC, GRB2
   and PTPN6. SRC is 5th-ranked driver in our method, and it is altered in
   0.48% KIRC samples. SRC is human proto-oncogene, which was reported as
   a novel therapeutic target in renal cell carcinoma [[65]27]. GRb2 (The
   adapter protein growth factor receptor-bound 2), a scaffolding adaptor
   protein, has recently been involved in a critical crosstalk between RTK
   signals and the intracellular signals [[66]28]. The enrichment analysis
   shows that GRb2 participates in multiple cancer related pathway, such
   as chemokine signaling pathway, ErbB signaling pathway, MAPK signaling
   pathway, Jak-STAT signaling pathway (see [67]Supplementary Table 2).
   There is a significant association between the protein tyrosine
   phosphatase PTPN6 (SHP-1) and GRB2 expression, which may amplify
   tyrosine kinase signaling in human breast cancer [[68]29].
   Nevertheless, direct demonstration of the relationship between the
   PTPN6 (SHP-1) and GRB2 in KIRC has been reported. In the cluster 1 of
   the Figure [69]3, we can observe that PTPN6 and GRB2 connect with each
   other. GRB2 is the 6th-ranked driver in our method, and it is altered
   in 0.24% KIRC samples. PTPN6 is the 12th-ranked driver in our method,
   and it is altered in 0.72% HNSC samples. These findings suggest
   potential crosstalk between mutant PTPN6 and GRB2.

Figure 3. The gene modules identified using the top 50 genes and their
corresponding interaction partners in KIRC.

   Figure 3
   [70]Open in a new tab

   Genes in green ellipse represent the detected driver gene, while genes
   in blue ellipse represent the driver genes’ interaction partners.

   In HNSC, two potential novel drivers we found are MAPK1 and SMAD2.
   SMAD2 is 28th-ranked driver in our method, and it is altered in 0.59%
   HNSC samples. The algorithm also identified a well-established tumor
   suppressor gene SMAD2. According to the pathway enrichment analysis,
   SMAD2 is involved in TGF-beta signaling pathway, Cell cycle, and Wnt
   signaling pathway (see [71]Supplementary Table 2). In addition, SMAD2
   mutations in human head and neck cancer have been reported [[72]30].
   Additionally, we identified one MAP kinase MAPK1, which is 19th-ranked
   driver in our method, and it is altered in 0.79% HNSC samples. We can
   observe that MAPK1 is involved in multiple pathways (see
   [73]Supplementary Table 2). In a recent study, MAPK1 (p38) mediates
   epithelial-mesenchymal transition to drive HNSC metastasis [[74]31].

Confirmation of cancer genes

   In total, we found 89, 127, and 202 drivers in THCA, KIRC, and HNSC
   respectively (see [75]Supplementary Table 1). The identified driver
   genes were overlapped with CGC and shown in Figure [76]4. In THCA, it
   can be seen that 28 driver genes out of top 89 driver genes are known
   driver genes in CGC (p-value < 2.2e-16). In HNSC, of these top 202
   driver genes, 49 driver genes are in CGC (p-value < 2.2e-16). In KIRC,
   of these top 127 driver genes, 52 are identified in CGC. Our result
   indicates that the detected driver genes are enriched among known
   cancer related genes and cannot be selected randomly.

Figure 4. Overlap of the known cancer genes identified for three cancers.

   Figure 4
   [77]Open in a new tab

   Venn diagram represents the overlap between each cancer-specific driver
   genes and CGC. 572 known cancer genes were obtained from the CGC
   database, 49, 28, 52 of which appear in HNSC drivers, THCA drivers, and
   KIRC drivers respectively.

   We examined the top 5 ranked in THCA, HNSC, KIRC respectively (Table
   [78]1). In addition, 12 genes have been functionally linked to cancer
   in multiple reports.

   In THCA, EGFR is especially intriguing. It is a member of the protein
   kinase superfamily, and ranked first in the predicted THCA driver
   genes. EGFR mediated downstream signal transduction and was
   overexpressed in an aplastic thyroid cancer cell lines, rendering this
   receptor a potential target for molecular therapy [[79]32]. The
   third-ranked gene, NRAS, encodes membrane-associated proteins that play
   a vital role in the transduction of signals [[80]33], which has been
   reported in thyroid cancer [[81]34]. The EP300 gene encodes p300, which
   is significant in the processes of cell proliferation and
   differentiation [[82]35]. It has altered protein expression in thyroid
   cancer [[83]36]. LYN has been previously reported in thyroid cancer
   [[84]21]. PTPN11 encodes protein-tyrosine phosphatase SHP2, and their
   domains are involved in cellular signaling. Besides, PTPN11 was
   significantly increase expressed in human thyroid carcinoma [[85]37].
   All told, all the top 5 predicted driver genes have evidenced in the
   literature for their roles in THCA.

   In HNSC, TP53 was the top ranked driver in our analysis. Three (TP53,
   PIK3CA, and EGFR) of the top 5 genes were previously known HNSC genes
   [[86]38]. Daniel Martin and his colleges suggested that EP300 genomic
   alternations may promote HNSC initiation and progression [[87]39]. It
   is noted that FADD was ranked 5th in our result. FADD can recruit other
   proteins to active NFκB and MAPK pathways [[88]40]. It was
   overexpressed and considered to be a driver gene in HNSC [[89]41].

   In KIRC, PBRM1 was the top ranked driver in our analysis. PBRM1, SETD2,
   and BAP1 were previously known HNSC genes and recently found to be
   altered in clear cell renal cell carcinoma [[90]23, [91]42]. They are
   highly mutated (Table [92]1). SRC was identified as a novel rare gene.
   It is one of the markers for low-grade renal cell carcinoma [[93]43].
   EP300 is ranked 5th (Table [94]1) and we observed that it is mutated in
   6 KIRC samples. It has been reported that EP300 behaves as a classical
   tumor-suppressor gene in human cancers [[95]35].

   EP300 was identified as a top gene in three datasets simultaneously
   (Table [96]1). It was 25th-ranked, 14th-ranked, and 9th-ranked in THCA,
   HNSC, and KIRC respectively by DawnRank method. In DriverNet, it seems
   the same situation. EP300 was 16th-ranked, 15th-ranked, 19th-ranked in
   THCA, HNSC, and KIRC respectively by DriverNet method. These results
   further suggest that the important roles of EP300 in cancers.

DISCUSSION

   In recent years, various computational approaches and tools have been
   developed to identify drivers, however, there are some limitations in
   detecting driver genes, for example, DriverNet has a bias toward long
   mutated genes. In two benchmark datasets, our method outperforms
   DriverNet, MUFFINN and frequency-based method in all three cancer
   types, although the performance of DawnRank is slightly higher than
   that of our approach in one of three databases. Our method is simple
   and parameter free. Meanwhile, the result indicates that our pipeline
   shows its ability to detect driver genes, even rare driver genes. Two
   potential reasons may contribute to the result. First, we consider the
   dynamic of network. Second, we construct a bipartite graph, which
   represents the relationship between the DCGs and the functional genes
   impacted by them in network. If a DCG impacts the more functional
   genes, the more likely it may be the driver. Our method has the
   following advantages. Firstly, this approach detects common and known
   drivers with a better performance than previous method. Secondly, it
   can filter out long genes with a higher mutation rate, such as the TTN
   gene which can't contribute to cancer. Thirdly, it can find some
   potential co-expression gene pairs, for instance, PTPN6 and GRB2 in
   KIRC. Our benchmarking analysis suggests that our algorithm is robust
   to noise and works well in three TCGA cancer types, making it general
   to different cancer types if the mutation data and gene expression data
   are available. In essence, the approach demands information on the
   context of the DCG (differentially coexpressed gene) of interest,
   functional gene set that constitute known cancer related pathways, and
   the connections between genes in the global network. There are also,
   however, limitations. One of the limitations is that HPRD is not large
   enough, DCGs or functional genes not in the PPI were filtered out and
   it may ignore some candidates. This can be improved with the growth of
   the database of HPRD in the future. Another limitation is that the
   network used in our method is not patient-specific or cancer-specific.
   Therefore, perturbations specific to patient may be obscured by this
   pipeline. The last limitation is that our approach detects potential
   drivers rely on the common effect of DCGs altering the functional gene.
   However, this may not be the case for all drivers. As more and more
   efforts are being devoted into understanding cancer genomes, we expect
   that our method's ability to detect drivers would also improve.

   Taken together, we developed a practical analysis pipeline to predict
   potential driver genes in cancer. Although this study focuses on THCA,
   KIRC and HNSC, the method is broadly applicable to any other cancer
   types for which mutation and expression data are available. In
   addition, our results demonstrate the efficiency of integrative
   analysis across three cancer types, not only the known cancer genes
   were identified, but also the potential rare drivers were detected. In
   future, we will combine the gene expression, copy number variation, and
   methylation data to construct a heterogeneous network. In addition, we
   will apply machine learning method to improve the performance. We
   expect this approach can generalize well to perform the future studies,
   including determine the optimal treatment tactics for each patient
   through integrating patient-specific omics data.

MATERIALS AND METHODS

Datasets and pre-processing

   The RNASeqV2 data (level three), gene mutation data (level two) and CNV
   data were downloaded from TCGA data portal. RNAseq expression levels,
   available as RSEM (RNAseq by Expectation Maximization) were transformed
   to log2 (RSEM+1). The GISTIC (version 2) was applied to the DNA copy
   number data. The information of three cancer types used in our method
   was provided in Table [97]2. For gene expression dataset, NA values
   were replaced by mean value.

Table 2. Overview of the number of samples for three cancer types with gene
expression and mutation data.

   Cancer type Tumor expression samples Normal expression samples Somatic
   mutation samples
   KIRC 534 72 417
   HNSC 522 44 509
   THCA 513 59 435
   [98]Open in a new tab

Mutation matrix

   Mutation matrix combined somatic mutation data and CNV data by
   extracting genes from deleted and amplified fragments in CNV data. The
   common samples between the mutation data and CNV data were retained.
   Mutation matrix (i, j) is a binary matrix where M (i, j) =1 indicates
   sample j have a gene i mutated and M (i, j) =0 indicates sample j don't
   have a gene i mutated.

Differentially coexpressed genes

   Differential co-expression analysis is designed to examine the
   alternation in gene expression correlation between the tumor samples
   and the normal samples, which is developed as a complementary approach
   to traditional differential expression analysis [[99]44]. DCGs were
   obtained by using Differential coexpression profile (DCp)function in
   DCGL (differentially coexpressed genes and links) package [[100]44]. We
   used the Pearson Correlation Coefficient (PCC) to measure the
   relationships between the expression profiles of all gene pairs and
   calculated the false discovery rate (FDR) by Benjamini–Hochberg method
   to adjust the raw p values [[101]45]. Gene with threshold of FDR less
   than 0.25 were selected as DCGs [[102]46].

Network construction and functional gene sets

   The network is an undirected graph G (V, E) where V stands for the
   genes and edges (i, j) E are weighted by PCC. Protein interaction
   network was sourced from HPRD ([103]http://www.hprd.org) database and
   protein self-interactions were removed, resulting in 39240 interactions
   among 9616 proteins. Functional gene sets (FGS) were obtained from
   [[104]47], which contain all of the KEGG pathways [[105]48] and 15 GO
   terms [[106]49], which could be related to hallmarks of tumor
   [[107]50]. Then a matrix was used to leverage to DCGs to their
   consequent effect on functional gene. The associations between DCGs and
   functional genes were built using a bipartite graph where left nodes
   represent DCGs and right nodes stand for functional genes. We formulate
   the network with DCGs = {g[1],g[2],…g[n]}, functional genes =
   {g[1],g[2],…g[m]}. Nodes g[n] in the left partition and nodes g[m] in
   the right interact have an edge, if g[g] is DCG, g[m] is a member of
   functional gene set, and g[n] and g[m] interact according to known PPI
   network.

Details of our algorithm

   In this study, we developed a pipeline to identify drivers. Figure
   [108]5 shows the schematic overview of approaches used in our study.
   Firstly, using the DCp function in DCGL package, we picked out DCGs. To
   improve statistical confidence, DCGs must have FDR<0.25. Secondly, we
   computed the DCGs score (Z-score) as:
   [MATH:
   <mrow><mtext>z</mtext><mo>=</mo><mfrac><mrow><msub><mi>d</mi><mrow><mte
   xt>AF</mtext></mrow></msub><mo>−</mo><msub><mi>μ</mi><mrow><mtext>AF</m
   text></mrow></msub></mrow><mrow><msub><mi>σ</mi><mrow><mtext>AF</mtext>
   </mrow></msub></mrow></mfrac><mo>,</mo></mrow> :MATH]

Figure 5. Identification of cancer related genes based on protein-protein
interactions (PPIs).

   Figure 5
   [109]Open in a new tab

   where d[AF] is the total score of weighted network between genes in the
   DCGs and the FGS, μ[AF] is the expected mean of d[AF], and σ[AF] is the
   standard deviation of d[AF]. Thirdly, in order to improve the accuracy
   of detecting driver genes, mutation information was combined.
   Therefore, we calculated the number of each mutated gene in mutation
   matrix. We normalized it and got the mutated gene Z-score. Finally, the
   driver gene score was assigned by summing the corresponding DCG Z-score
   and mutated gene Z-score. We only considered genes with driver gene
   score ≥2 as potential driver genes.

Performance benchmarking analysis of our method

   In order to metric the efficiency of our results, we took CGC
   ([110]http://cancer.sanger.ac.uk/cancergenome/projects/census/) as a
   benchmark to evaluate the effect of our method. In practice, the gold
   standard of known drivers is impractical in the absence of ground
   truth. However, well-studied CGC provides an approximate benchmark of
   known drivers [[111]12, [112]13]. Besides, we consider additional
   dataset to assess the method. As defined by the 20/20 rule, only 138
   driver genes have been discovered to date. Both of these datasets were
   used to assess the accuracy of our method. We compared our method with
   the DriverNet, DawnRank, frequency-based method and MUFFINN method.

   In accordance with both approaches mentioned above, we used the same
   datasets to perform the analysis and restricted the comparisons to
   three tumor types only (HNSC, KIRC, and THCA). We adopted three
   measures (Precision, Recall and F1 score) as follows:
   [MATH:
   <mrow><mtext>Precision</mtext><mo>=</mo><mfrac><mrow><mrow><mo>(</mo><m
   row><mtext>#Mutated genes in CGC</mtext></mrow><mo>)</mo></mrow><mo>∩</
   mo><mrow><mo>(</mo><mrow><mtext>#Genes found in our method</mtext></mro
   w><mo>)</mo></mrow></mrow><mrow><mrow><mo>(</mo><mrow><mo>#</mo><mtext>
   Genes found in our method</mtext></mrow><mo>)</mo></mrow></mrow></mfrac
   ></mrow> :MATH]
   [MATH:
   <mrow><mtext>Recall</mtext><mo>=</mo><mfrac><mrow><mrow><mo>(</mo><mrow
   ><mo>#</mo><mtext>Mutated genes in CGC</mtext></mrow><mo>)</mo></mrow><
   mo>∩</mo><mrow><mo>(</mo><mrow><mo>#</mo><mtext>Genes found in our meth
   od</mtext></mrow><mo>)</mo></mrow></mrow><mrow><mrow><mo>(</mo><mrow><m
   o>#</mo><mtext>Mutated genes in CGC</mtext></mrow><mo>)</mo></mrow></mr
   ow></mfrac><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mtext>F1 score</mtext><mo>=</mo><mn>2</mn><mo>×</mo><mfrac><mrow
   ><mtext>Precision</mtext><mo>×</mo><mtext>Recall</mtext></mrow><mrow><m
   text>Precision</mtext><mo>+</mo><mtext>Recall</mtext></mrow></mfrac></m
   row><mo>.</mo> :MATH]

Significance estimation of the potential driver genes

   In order to evaluate the significance of the identified driver genes,
   we performed a hypergeometric test to calculate the probability of a
   random overlap:
   [MATH:
   <mrow><mtext>P</mtext><mrow><mo>(</mo><mrow><mtext>X</mtext><mo>≥</mo><
   mtext>x</mtext></mrow><mo>)</mo></mrow><mo>=</mo><mn>1</mn><mo>−</mo><m
   underover><mstyle><mo>∑</mo></mstyle><mrow><mtext>k</mtext><mo>=</mo><m
   n>0</mn></mrow><mrow><mtext>x</mtext><mo>−</mo><mn>1</mn></mrow></munde
   rover><mfrac><mrow><mrow><mo>(</mo><mrow><mtable><mtr><mtd><mtext>M</mt
   ext></mtd></mtr><mtr><mtd><mtext>k</mtext></mtd></mtr></mtable></mrow><
   mo>)</mo></mrow><mrow><mo>(</mo><mrow><mtable><mtr><mtd><mrow><mtext>N<
   /mtext><mo>−</mo><mtext>M</mtext></mrow></mtd></mtr><mtr><mtd><mrow><mt
   ext>n</mtext><mo>−</mo><mtext>k</mtext></mrow></mtd></mtr></mtable></mr
   ow><mo>)</mo></mrow></mrow><mrow><mrow><mo>(</mo><mrow><mtable><mtr><mt
   d><mtext>N</mtext></mtd></mtr><mtr><mtd><mtext>n</mtext></mtd></mtr></m
   table></mrow><mo>)</mo></mrow></mrow></mfrac><mo>,</mo></mrow> :MATH]

   where N is the total number of genes, M and n are the number of genes
   in two sets, and k is the number of the overlapped genes of the two
   sets.

Functional enrichment analysis of driver genes and recognition modules

   In order to annotate driver genes detected in our result, we used the
   online DAVID [[113]51] website and observed significant enrichment of
   these genes in the term of KEGG pathway. Briefly, KEGG pathway terms
   were annotated to statistical significance in the gene set. Enrichment
   was calculated through the hyper-geometric test using a FDR less than
   0.05. Molecular Complex Detection (MCODE) [[114]52] that detects
   densely connected regions in large protein interaction networks were
   used to recognize modules. MCODE weights all nodes depended on their
   local network density by setting the highest k-core of the vertex
   neighborhood. We set the highest k-core is 2.

SUPPLEMENTARY MATERIALS FIGURES AND TABLES

   [115]oncotarget-08-58050-s001.pdf^ (1,006KB, pdf)
   [116]oncotarget-08-58050-s002.xls^ (57.5KB, xls)
   [117]oncotarget-08-58050-s003.xls^ (57.5KB, xls)

Acknowledgments