Abstract
Background
Cancer as a kind of genomic alteration disease each year deprives many
people’s life. The biggest challenge to overcome cancer is to identify
driver genes that promote the cancer development from a huge amount of
passenger mutations that have no effect on the selective growth
advantage of cancer. In order to solve those problems, some researchers
have started to focus on identification of driver genes by integrating
networks with other biological information. However, more efforts
should be needed to improve the prediction performance.
Methods
Considering the facts that driver genes have impact on expression of
their downstream genes, they likely interact with each other to form
functional modules and those modules should tend to be expressed
similarly in the same tissue. We proposed a novel model named by
DyTidriver to identify driver genes through involving the gene
dysregulated expression, tissue-specific expression and variation
frequency into the human functional interaction network (e.g. human
FIN).
Results
This method was applied on 974 breast, 316 prostate and 230 lung cancer
patients. The consequence shows our method outperformed other five
existing methods in terms of Fscore, Precision and Recall values. The
enrichment and cociter analysis illustrate DyTidriver can not only
identifies the driver genes enriched in some significant pathways but
also has the capability to figure out some unknown driver genes.
Conclusion
The final results imply that driver genes are those that impact more
dysregulated genes and express similarly in the same tissue.
Keywords: Driver genes, Dysregulated expression, Tissue-specific
expression, Human functional interaction network, Variation frequency
Background
Cancer as a kind of genomic alteration disease each year deprives many
people’s life [[31]1–[32]3]. It is acknowledged that cancer arise is
due to the accumulation of mutations in a subgroup of genes which
conferring growth advantage, allowing uncontrolled proliferation and
avoiding apoptosis [[33]4, [34]5]. With the development of
next-generation sequencing technology, several large-scale cancer
projects have generated a large amount of cancer genomic data, such as
The Cancer Genome Atlas (TCGA) [[35]6], International Cancer Genome
Consortium (ICGC) [[36]7], which enable the detection of thousands of
mutations. However, not all mutations contribute to the cancer
initiation and progression. The mutations that are important to the
cancer development and provide selective growth advantage are called
driver mutations, the opposite is termed as the passenger mutations
[[37]8, [38]9]. Some researches show that the number of passenger
mutations far beyond the number of driver mutations [[39]9]. For
example, from 11 cancer types, there are only 2 to 6 mutations have
been regarded as the driver mutations among 200 somatic mutations which
including missense, nonsense, silent, non-coding, splice-site, non-stop
mutations, frameshift insertions and deletions (indels) and inframe
indels [[40]9–[41]12]. Besides, those important alterations are not
uniformly distributed across the genome and target to some specific
genes associated with important cellular functions such as cell
survival, cell fate etc. [[42]4, [43]13–[44]15]. For example, the
well-known tumor suppressor TP53 participate in defense mechanisms
against cancer and their inactivation by alteration can increase the
selective growth advantage of the cell [[45]16]. The alterations of
ERBB2 [[46]17] and KRAS [[47]18] can lead to the acquisition of new
properties that provide some selective growth advantage or spread to
remote organs. Hence, the biggest challenge to overcome cancer is how
to precisely discriminate those driver genes which harboring driver
mutations and have the capability to promote cancer development from
those irrelevant passenger genes [[48]11]. This act is essential to
understand the tumor biology and designing precision therapies [[49]4,
[50]19].
Traditional methods to identify cancer driver genes are based on the
assumption that driver mutations confer a selective advantage to tumor
growth and they occur more frequently than expected by random chance
[[51]20]. This kind of methods such as Mutsig [[52]21] and MuSic
[[53]22] successfully pinpoints part of recurrence genes. However, in
fact, only a small number of genes are altered in a high percentage of
patient. Much larger number of genes are altered infrequently [[54]11].
Besides, due to the heterogeneity of cancer, it is so hard to properly
estimate the background mutation rate that many errors may be
introduced [[55]23].
A promising angle to identify cancer driver genes is based on network
since it is acknowledged that cancer genes are more closely related
with each other within a group to perform a certain function [[56]24].
HotNet [[57]25] and HotNet2 [[58]26] apply a propagation process that
diffuse the score of mutation frequency through the whole gene-gene
interaction network and extract significantly mutated subnetworks to
identify driver genes. NBS [[59]27] detects driver genes by taking the
strategies similar to HotNet. However, NBS detects mutated subnetworks
of each patient and uses a consensus clustering framework to merge
subnetworks across all patients. Unlike previous methods that use
global network information, MUFFINN [[60]28] prioritizes the cancer
driver genes by measuring the impact from all neighbors of mutated
genes in the functional network. Although these network-based methods
mentioned above proposed a new focus on the interacting relationship of
cancer driver genes, most of them identified cancer driver genes only
consider the patient-gene mutation profiles and topology of networks.
Besides, they are too much rely on the known network which may create
some false positive data [[61]23].
To overcome these limitations, some researchers focus on combining the
cancer gene’s functional interactive relationship and other biological
properties to improve the precision of detecting cancer driver genes.
For example, DriverNet [[62]29] identifies cancer driver genes by
estimating their effect on mRNA expression. Inspired by the rationale
that cancer driver genes may be determined by their impact on
expressions of downstream genes, DriverNet firstly identifies the
downstream genes (called outlying genes) with significantly
differential expressions and then constructs a bi-graph where one side
is mutated genes and the other side is outlying genes. It selects the
driver genes that connect to the most nodes in the outlying gene side.
Shi et al. [[63]30] further improve DriverNet method by introducing
diffusion process on the bi-graph. DawnRank [[64]31] ranks potential
cancer driver genes based on both their own expression difference and
their impact on the overall differential expression of the downstream
genes in the molecular interaction network. LNDriver [[65]24] is also
designed on the basis of bi-graph, while it incorporates the DNA length
to filter mutated gene at the first step.
Above mentioned bi-graph-based methods to some degree improve the
accuracy of identifying cancer driver genes by adding biology profiles
to the gene itself. However, the reliability of network still needs to
do further improvement since most of known networks are built based on
either or mix of large scale of computational and experimental data.
This may directly impact the efficiency and precision of detecting
novel driver genes [[66]23]. Hence, the fundamental problem is to
establish one model that can improve the reliability of network so as
to improve the power of prediction. To achieve this, some researchers
consider to incorporate specific biological profiles to assign a weight
for each interaction such as the impact of differential expression
information [[67]32]. However, seldom of them considered the facts that
the majority of cancer genes interact with each other to form
functional modules and those modules should tend to be expressed
similarly in the same tissue. Ganegoda et.al [[68]33] use the
tissue-specific data to predict the new disease-gene associations by
measuring the gene expression in disease related tissues and achieved
higher performance. Besides, previous studies found genetic disorders
tend to manifest only in a single or a few tissues for a given disease
[[69]34]. Motivated by these, we want to refine the gene functional
interaction network by considering expression similarity between each
pair of mutated genes in the cancer’s related one or two tissues.
Moreover, from the previous research, it is known that cancer driver
genes are more likely to be frequently mutated across a cohort of
patients and also dysregulate downstream genes’ expression.
Based on the facts mentioned above, we proposed a model called
DyTidriver to predict cancer driver genes by integrating dysregulated
expression profiles, tissue-specific expression profiles, modularity of
mutated genes and variation frequency into the gene functional
interaction network. In DyTidriver, considering the fact that cancer
driver genes are likely dysregulate downstream genes’ expression,
mutated genes were firstly filtered according to their impact on the
expression of downstream genes. After that, mutated genes’ interactive
network was weighted by considering gene-gene co-expression in specific
tissues of each query disease and the relationship between mutated
genes. Because the majority of cancer driver genes interact with each
other to form functional modules and those modules tend to be expressed
similarly in the same tissue. Finally, with respect to the facts that
driver genes are more likely to be frequently mutated across a cohort
of patients and interact with each other to form functional modules,
the mutated genes were ranked by summing up the weighted graph and
multiplying itself variation frequency. We explored our method to
detect cancer driver genes of lung cancer, breast cancer and prostate
cancer. The result shows that our method significantly outperforms
other five existing methods [[70]28–[71]31] in terms of Fscore,
Precision and Recall. Besides, the cociter analysis illustrates our
method can not only identify some well-known cancer driver genes but
also detects the unknown cancer driver genes with high co-occurrence
ratio in some publications. Furthermore, the identified cancer driver
genes also enrich in some significant pathways and biological
functions.
Methods
Our method consists of four steps (see Fig. [72]1). At first, we
filtered the mutated genes for each patient according to whether or not
it influenced the expression of downstream genes. Only the mutated
genes that dysregualte downstream genes’ expression will be included in
our study. Then, the remaining mutated genes for all patients were
mapped to the human functional interaction network (human FIN) to
construct the Mut-Mut matrix. Thirdly, the tissue-specific pearson
correlation coefficient (PCC) matrix was constructed by calculating the
co-expression values of mutated genes derived from downloaded tissue
expression information after searching the disease-tissue matrix.
Finally, we calculated the edge clustering coefficient (ECC) values for
the interactions in the network which established at the last step and
assigned each mutated gene in the network a score by firstly summing up
ECC values of its connected edges and then multiplying the addictive
result to its corresponding variation frequency. According to the
scores, the mutated genes were ranked in a descending order and those
ranked at the top of the list were considered as potential cancer
driver genes.
Fig. 1.
[73]Fig. 1
[74]Open in a new tab
The workflow of Dytidriver. We divided our whole process of cancer
driver gene identification into four steps and marked with ‘a’,’ b’,
‘c’, ‘d’. In the step ‘a’, we filtered the mutated genes for each
patient according to whether or not it influenced the expression of
downstream genes. Only the mutated genes which connect at least one
outlying genes would be included in our study. Then, the filtered
mutated genes for all patients were mapped to the human functional
interaction network to construct the Mut-Mut matrix. The ‘b’ step is to
generate the tissue-specific PCC matrix. For each cancer, we chose the
top one or two tissues with the higher association score in
disease-tissue matrix as the cancer related tissues such as the tissue
1 and tissue 2 for disease D1. For each tissue, we calculated its
gene-gene pearson correlation values across the whole patients and then
generated the gene-gene PCC matrix by keeping the absolute PCC values
more than 0.3 while left setting to 0. If there are more than one
tissue related to a cancer, the final tissue-specific PCC matrix is
constructed by averaging the values in the gene-gene PCC matrix of each
tissue. In the ‘c’ step, we constructed the ECC mutated matrix by
utilizing the ECC equation. In the final ‘d’ step, we assigned each
mutated gene in the network a score by summing up all the ECC values of
its connecting edges and then multiply to its corresponding variation
frequency. According to the scores, the mutated genes were ranked in a
descending order and those ranked at the top list the were considered
as potential driver genes
Experimental data
The datasets in this study derived from three places. The first part
includes the somatic mutation data and their corresponding
transcriptional expression data for each patient. Both of these
datasets were downloaded from the TCGA website by utilizing the
TCGA2STAT R packages. For our analysis, we focused on the somatic
mutation and gene transcriptional expression data for 230 lung cancer
patients, 974 breast cancer patients and 331 prostate cancer patients.
The downloaded TCGA datasets include both tumor and normal patients: 58
of 230 lung, 110 of 974 breast and 52 of 331 prostate are normal
patients.
The second part of dataset is the tissue-specific expression profiles.
In order to find the most related tissues for each cancer type, we
searched the tissue-disease matrix which can be downloaded from the
reference [[75]34]. Each entry in the matrix represents the covariance
of a disease with a tissue through the way of counting the number of
publications co-appearing the disease and tissue, relative to the
number of publications mentioning the disease or tissue alone. It is
acknowledged that genetic disorders tend to manifest only in a single
or few tissues for a given disease [[76]34]. Hence, we chose one or two
of the most relevant tissues for each cancer type. Fortunately, the
directly related tissue can be found for most of cancer type e.g. the
lung tissue for lung cancer, prostate tissue for prostate cancer.
However, we cannot find the breast tissue in the disease-tissue matrix.
Instead, we chose the top two relevant tissues (e.g. prostate, ovary)
with higher association score for breast cancer. In order to obtain the
tissue-specific expression profiles, we used the Gene Expression
Omnibus (GEO) database. Because GEO database is currently the largest
and most famous expression data platform which stores relatively
complete expression data. According to the identified most related
tissues for each cancer type, we downloaded the gene expression details
of each tissue sample from the GEO website by querying dataset
[77]GSE7307. The database lists the transcriptional profile of both
normal and disease human tissues representing over 90 distinct tissue
types by using the Affymetrix human U133 plus 2.0 array. At here, we
used the R package called GEOquery to download the corresponding tissue
expression information from the platform [78]GPL570. The downloaded
data is the expression profile matrix with genes and patients as the
columns and rows respectively.
The last part of the dataset comes from the currently release version
(2016) of human functional interaction network (human FIN) in which
involving 12,275 genes and 46,0434 edges [[79]35]. This network is
constructed by extending curated pathways with non-curated sources of
information, including protein-protein interactions, gene
co-expression, protein domain interaction, Gene Ontology (GO)
annotations and text-mined protein interactions, which cover close to
50% of the human proteome. The benchmarking of driver genes was
downloaded from the NCG 4.0 which included 537 known cancer genes from
the Cancer Gene Census [[80]36] and 1463 candidate cancer genes that
were derived from the manual curation of 77 whole genome or whole exome
cancer-resequencing screenings [[81]37] .
Filtering mutated genes and constructing Mut-Mut matrix
The somatic mutation data were downloaded from TCGA website where
records the information of mutated gene across patients. The genes that
were mutated in at least one patient were kept and regarded as the
mutated genes. Previous researches have pointed out that driver genes
are more likely to regulate the expression of downstream genes
[[82]29–[83]31]. Those gene whose expression were impacted
significantly are called outlying genes. In order to acquire the
outlying genes, we downloaded the transcriptional expression
information from the TCGA website and calculated their z-scores. More
specifically, for each gene and each patient, a gene was regarded as
the outlying gene for the patient if its z-score > 2.0 or its
z-score < − 2.0. The setting of threshold as ± 2.0 was referred to the
DriverNet [[84]29]. Then, we kept the mutated genes which have at least
one connection with outlying genes in the human FIN while filtered out
those having no connections with outlying genes. Finally, the remaining
mutated genes were mapped to the human FIN to generated the binary
Mut-Mut matrix in which the rows and columns are the remaining mutated
genes and the element is 1 if there is a connection between the two
mutated genes in the human FIN, 0 otherwise.
Assigning weight to Mut-Mut matrix by PCC values
Since the majority of disease genes forming a common functional module
tend to be expressed similarly in the same tissue and there exist too
much false positive connections in the gene networks, in this work, we
use tissue-specific expression profile to assign weights for the
interactions of genes in order to improve the reliability of genes
interactive network. For each cancer type, at first, we chose the most
related tissue according to its association score in the disease-tissue
matrix [[85]34]. If there is at least one tissue related with a cancer
in the disease-tissue matrix, its corresponding tissue expression
information across a cohort of patients can be downloaded from the GEO
website. After that, we calculated the gene-gene PCC values of
downloaded tissue expression matrix across the whole patients and then
generated the PCC matrix by keeping their absolute PCC values more than
0.3 while left setting to 0. The threshold setting was according to
previous research [[86]34]. At last, the average score of PCC matrix of
each tissue was regarded as the final tissue-specific PCC matrix of the
cancer type. We assigned a weight to values in the Mut-Mut matrix based
on the tissue-specific PCC matrix. Specifically, if a mutated gene i
connects to a mutated gene j in the Mut-Mut matrix (e.g. W(i,j) = 1),
the PCC value of genes i and j was assigned to the corresponding entry
of the Mut-Mut matrix otherwise the value was set to 0. Consequently, a
weighted mutated PCC matrix denoted by W is constructed.
Calculating the mutated gene score
Previous studies have found that cancer is the fact that genes act
together in various signaling pathway and protein complexes [[87]25].
Hence, in order to highlight the modularity of cancer driver genes, we
calculated the ECC values for each pair of mutated genes in the mutated
PCC matrix. The ECC value was normally used to measure the degree of
closeness between two nodes in a network, which has been widely applied
in detecting network modules [[88]38–[89]40]. We calculated the ECC
values for each pair of mutated genes in the weighted mutated PCC
matrix (denoted by Matrix W in Eq. [90]1). The higher ECC value means
two genes are more likely to act together in a common module. The
definition of ECC is as Eq. [91]1. After calculating the ECC score for
each pair of mutated genes in the weighted mutated PCC matrix, we
assigned each mutated gene a score (Mi) by summing up all ECC values of
its connecting edges (see Eq. [92]2). It is known that cancer driver
genes are more likely to be those frequently mutated in many patients.
Hence, the final ranking score of each mutated gene was calculated by
multiplying its variation frequency to its additive score (see Eq.
[93]3). After that, all mutated genes were ranked in a descending order
according to their ranking scores and the genes with the higher rank
are more likely to be the cancer driver genes.
[MATH: ECCij=∑k∈i∩jnWik+Wjkmindidj
mi>
:MATH]
1
[MATH: Mi=∑j∈Ni<
/msub>nECCij :MATH]
2
[MATH:
Fi=Vi
msub>∙Mi :MATH]
3
Where W denotes weighted mutated PCC matrix. k denotes the common
neighbors between mutated gene i and gene j in the matrix W. W[ik] is
the weight between mutated gene i and gene k. d[i] and d[j] are the
degrees of nodes i and j, respectively. Min (d[i],d[j]) represents the
maximal possible number of triangles that might include the edge(i,j).
N[i] is the set of all neighbors of mutated gene i. V[i] denotes
variation frequency of gene i which is measured by mutated times of
gene i out of total patient counts.
Statistic evaluation metrics
In order to evaluate the performance of our method, top N of ranked
genes were selected as potential cancer driver genes. The accuracy of
prediction depends on how well the predicted cancer driver genes match
the real ones, which was measured by three widely used statistic
metrics, Precision, Recall and Fscore.
[MATH: Precision=TPTP+FP :MATH]
[MATH: Recall=TPTP+FN :MATH]
[MATH: Fscore=2∙
Precision∙RecallPrecision+Recall :MATH]
where TP (true positive) is the number of predicted driver genes
matched by known driver genes in benchmarking dataset. TN (true
negative) is the number of not predicted driver genes that are not
matched by known ones. FP (False Positive) is the number of predicted
driver genes that are not matched by known driver genes. FN (false
negative) is the number of known driver genes that are not matched by
predicted ones.
Enrichment analysis
Another evaluation metric is pathway and GO enrichment analysis in
order to evaluate whether or not the predicted cancer driver genes
share common biological functions. It is widely known that cancer is a
disease of pathways and the somatic mutations target the cancer genes
in a group of regulatory and signaling networks [[94]25]. Besides,
those cancer-related driver mutations recurrently occur in the
functional regions of protein (such as kinase domains and binding
domains) to interrupt the major biological functions [[95]41]. In this
study, we leveraged the DAVID database to do the KEGG pathway
enrichment analysis and GO enrichment analysis [[96]42].
Results
In order to testify the effectiveness of our method, we applied our
method and other four models:
DriverNet [[97]29], DawnRank [[98]31] and Diffusion algorithm [[99]30],
Muffinn [[100]28] on the breast cancer, prostate cancer and lung cancer
to identify their driver genes. Among them, the DriverNet, DawnRank and
Shi’s Diffusion algorithm utilize the gene dysregulated expression
information to identify outlying genes and construct the bipartite
graph. These methods ranked mutated genes according to their
connections with the outlying genes. The Muffinn method leverages both
the variation frequency of mutated genes and the impact of their
neighbors to design the ranking scores. It was further classified into
two models: Muf_max and Muf_sum, according to considering the impact of
either the most frequently mutated neighbor or all direct neighbors
[[101]28]. Unlike the DriverNet, DawnRank and Shi’s diffusion method
that use gene dysregulated expression to construct bipartite graph, our
study only employs the dysregulated expression profile to filter the
mutated genes. Moreover, similar to the Muffinn method, we also
consider the variation frequency of mutated genes and the impact of
their direct neighbors. However, compared with other methods, our
method not only integrates the features of dysregulated expression
information, variation frequency and human FIN but also considers the
modularity of mutated genes and their co-expression in the same tissue.
Running DawnRank demands expression data with normal and tumor samples.
From the three cancer datasets, we can only download 110, 58, 52 tumor
samples that have normal gene expression profiles for breast, lung and
prostate respectively. Besides, we set the free parameter of DawnRank
as three which was recommended by DawnRank authors [[102]31].
Comparing performance
All the mutated genes were ranked in a descending order based on the
scores assigned by each comparing method. After that, K of genes ranked
in the top list were selected as candidate driver genes. According to
the benchmark dataset, the Fscore, Recall, Precision values can be
calculated to evaluate the performance of each method. With different
values of K ranging from 1 to 200, the Fscore curve, Recall curve and
Precision curve is drawn. The results are shown in the Fig. [103]2. In
general, our results are superior to all of other four methods on the
lung, prostate and breast cancer datasets. Compared with the other five
methods, our model identifies the largest number of known drivers from
NCG 4.0. For lung cancer, the Dytidriver and the other methods are
tangled together when predicting small number of potential driver genes
and then Dytidriver is significantly better than the other methods when
the number of predicted driver genes increases from top 40 to 200. For
prostate and breast cancer, our model demonstrated the best performance
from beginning to the end. Similar to Muffinn, considering the
variation frequency and the functional impact of direct neighbors, our
method additionally takes advantage of the tissue-specific
co-expression property and the modularity property which improve the
precision of detecting driver genes to a higher level. Besides, the
performance of Muf_max is worse than that of Muf_sum, which means it is
inappropriate to judge a driver only based on the impact of single
gene. DawnRank performed poorly among all comparing methods. The reason
might be that only a limited number of cancer patients both have normal
and tumor expression data for DawnRank.
Fig. 2.
[104]Fig. 2
[105]Open in a new tab
A comparison of the Precision, Recall, and Fscore for top ranking genes
in the six methods. The X-axis represents the number of top-ranking
genes. The Y-axis represents the score of the given metric
Enrichment analysis
We select the top 200 of cancer driver genes to do GO and pathway
enrichment analysis. For lung cancer, in the biological process, the
genes detected by our method enrich in the signal transduction,
intracellular signaling cascade, transcription, metabolic process,
regulation of cell death and apoptosis etc. In the cellular component,
our results focus on the plasma membrane, organelle, cytoskeleton,
lumen and cell fraction etc. In the molecular function, our results
enrich in ion binding, nucleotide binding, ATP binding, transcription
regulator activity etc. From the pathway aspect, our identified cancer
driver genes enrich in some important cancer pathway, such as calcium
signaling pathway, PI3K-Akt signaling pathway, mTOR signaling pathway.
With respect to the breast cancer, in biological process, our results
enrich in the intracellular signaling cascade, signal transduction,
regulation of transcription, metabolic process, regulation of cell
death, phosphorylation, transcription, phosphorylation and cell
proliferation. In the cellular component, our results enrich in the
plasma membrane, organelle, lumen and cell fraction. In the molecular
function, our results mainly enrich in the nucleotide binding, ATP
binding, DNA binding, transcription regulator activity and kinase
activity. In pathway analysis, our results enrich in Calcium signaling
pathway, MAPK signaling pathway, PI3K signaling pathway, p53 signaling
pathway etc.
In terms of prostate cancer, our results enrich in the regulation of
transcription, signal transduction, adhesion molecules, regulation of
GTPase activity etc. in biological process. For cellular component, our
results enrich in nucleus, plasma membrane, cytosol, intracellular,
protein complex etc. For molecular function, our results focus on
protein binding, ATP binding, DNA binding, protein kinase activity and
so on. From pathway aspect, our results enrich in the Calcium signaling
pathway, PI3K signaling pathway, cAMP signaling pathway, mTOR signaling
pathway.
Cociter analysis
Because the benchmark cancer driver genes are incomplete, to further
prove the prediction capability of our method in distinguishing
potentially cancer driver genes, we adopted the literature mining
method to figure out the co-citation times of the predicted driver
genes with the keywords ‘cancer type’(i.e. breast, prostate or lung),
‘driver’ and ‘cancer’ in the cociter website [[106]25]. The larger the
number of times the gene co-appeared with the keywords, the stronger
associations between them. In this study, Tables [107]1, [108]2 and
[109]3 show the cociter analysis of top 30 of genes identified by our
method for each cancer type. In order to illustrate the capability of
our method to prioritize significant well-known cancer driver genes, we
also listed genes ranking position in other five methods.
Table 1.
Cociter analysis of top 30 lung cancer driver genes identified by our
method
Genes Cancer Lung Driver Is_driver DyTidriver Diffusion DriverNet
DawnRank Muf_max Muf_sum
TP53 6772 999 110 1 1 20 1 1 5 6
ZNF536 4 0 1 1 2 5015 NA 2689 849 79
EGFR 4748 2849 166 1 3 1 3 4 7 26
TSHZ3 4 1 1 0 4 2748 1295 2463 1268 188
PRUNE2 12 1 1 0 5 5211 NA 2623 2018 332
RYR2 4 3 2 0 6 757 20 558 128 25
SPTA1 3 2 1 0 7 221 6 15 12 36
ATP10D 1 0 0 0 8 1836 NA 2825 2667 873
ANKIB1 2 1 0 0 9 1607 NA 2572 4107 2080
ZNF521 2 0 1 1 10 5025 NA 3058 1906 302
NES 192 31 5 0 11 1483 NA 1461 3094 1138
PIK3CA 1199 183 54 1 12 2 5 112 430 81
TLR4 417 591 9 1 13 71 45 3 672 138
NF1 165 16 11 1 14 34 56 21 389 139
FAT4 45 7 2 0 15 3106 839 1961 970 119
ASH1L 4 1 1 0 16 1506 NA 2289 2549 761
PRKCB 41 11 1 1 17 5 12 NA 442 92
SLC12A1 2 2 1 0 18 1647 NA 3038 4006 1750
CTNNB1 2517 340 44 1 19 6 21 NA 51 27
PLCB1 9 7 1 0 20 25 22 27 745 91
APOB 27 4 2 0 21 117 7 8 664 42
MET 1045 348 40 0 22 21 37 7 427 186
GRIN2B 13 3 2 0 23 18 39 120 397 135
UBC 134 17 2 0 24 3 4 NA 137 1
SASH1 13 3 1 0 25 1537 NA 1325 5100 3080
HGF 393 174 7 0 26 47 84 40 398 1192
BRAF 2175 270 126 1 27 70 75 155 392 150
UBA6 1 1 1 0 28 5263 NA NA 2957 980
PTPRZ1 12 1 1 0 29 3366 NA 2402 894 289
TAF1L 2 1 1 0 30 557 57 547 10 130
[110]Open in a new tab
The second to the fourth column show the co-appeared times of top 30
identified genes with ‘driver’, ‘lung’ and ‘cancer’ (from the left to
the right). Is_Driver indicates whether the given gene is a driver gene
or not in the benchmark dataset. The left columns represent the ranking
positions of identified genes in Dytidriver, Diffusion, DriverNet,
DawnRank, Muf_max, Muf_sum respectively
Table 2.
Cociter analysis of top 30 prostate cancer driver genes identified by
our method
Genes Cancer Prostate Driver is driver DyTidriver Diffusion DriverNet
DawnRank Muf max Muf sum
TP53 6772 298 110 1 1 1 1 1 38 4
CTNNB1 2517 170 44 1 2 2 2 21 40 9
ASH1L 4 0 1 0 3 1703 NA NA 653 78
SPOP 43 24 4 1 4 1721 3 169 8 3
ATM 1377 61 5 0 5 13 11 12 36 14
PTEN 3047 642 64 1 6 700 94 NA 39 37
TTN 10 0 2 0 7 1724 22 14 2 2
FOXA1 182 69 10 0 8 17 5 3 37 10
KMT2D 25 2 2 0 9 855 54 NA NA NA
PIK3CA 1199 34 54 1 10 7 10 NA 282 36
DYNC1H1 9 1 2 0 11 66 19 51 219 72
CDH12 4 0 0 0 12 1511 NA 755 349 296
BRAF 2175 33 126 1 13 326 63 36 348 34
AKT1 2152 317 23 1 14 20 23 NA 52 33
FAT3 1 1 1 0 15 19 26 75 NA NA
LRP4 7 0 2 0 16 1440 NA NA 1426 541
GRIN2B 13 0 2 0 17 74 33 NA 220 90
KMT2C 23 2 4 0 18 613 27 NA NA NA
NCOR1 109 27 3 1 19 59 77 58 41 60
HSPA8 96 9 1 0 20 10 8 NA 438 67
OBSCN 7 0 0 0 21 1714 168 408 1 24
GRIN2A 5 0 1 0 22 285 92 85 374 73
PCDHA12 1 0 0 0 23 1453 271 197 324 65
MED12 19 4 4 0 24 376 162 157 317 84
STAT3 1824 147 27 0 25 16 15 5 58 8
PCDH18 2 1 1 0 26 1656 93 66 262 39
CDH23 5 0 1 0 27 457 97 NA 295 63
SPTA1 3 0 1 0 28 1719 16 9 221 15
UFL1 7 0 1 0 29 NA NA NA 1238 1265
SP1 393 38 3 1 30 8 9 NA 86 5
[111]Open in a new tab
The second to the fourth column show the co-appeared times of top 30
identified genes with ‘driver’,‘prostate’ and ‘cancer’ (from the left
to the right). Is_driver indicates whether the given gene is a driver
or not in benchmark dataset. The left columns represent the ranking
positions of identified genes in Dytidriver, Diffusion, DriverNet,
DawnRank, Muf_max, Muf_sum respectively
Table 3.
Co-citer analysis of top 30 breast cancer driver genes identified by
our method
Genes Cancer Breast Driver is driver DyTidriver Diffusion DriverNet
DawnRank Muf max Muf sum
TP53 6772 1356 110 1 1 233 1 2 7 2
PIK3CA 1199 334 54 1 2 156 2 1 2 3
MAP 3 K1 135 62 2 1 3 128 18 4 899 28
GATA3 154 122 8 1 4 85 13 6 888 17
CDH1 1410 358 19 1 5 42 4 10 1 6
ERBB2 5335 4332 78 1 6 72 64 90 8 73
UBC 134 30 2 0 7 240 3 122 22 1
NCOR1 109 45 3 1 8 139 12 48 6 68
ASH1L 4 0 1 0 9 1097 NA 1986 1846 729
PIK3R1 131 21 7 1 10 160 10 26 13 45
EP300 269 86 4 1 11 68 5 178 367 4
DYNC1H1 9 2 2 0 12 63 8 17 1017 107
HUWE1 29 4 3 0 13 251 28 45 9 112
PTEN 3047 672 64 1 14 185 98 193 3 79
MAP 3 K13 2 0 1 1 15 6189 NA 3303 2654 2045
NF1 165 24 11 1 16 141 41 19 4 144
TTN 10 1 2 0 17 2581 6 5 717 5
TPP2 4 0 2 0 18 1041 NA 2674 3172 2926
UFL1 7 1 1 0 19 802 NA NA 3493 3129
BRCA1 4652 4017 22 1 20 25 11 NA 361 27
BACH2 8 1 2 0 21 810 1182 2366 2298 1079
JAK2 382 92 19 1 22 118 32 NA 73 119
ERBB3 354 178 4 1 23 73 29 8 10 207
ERBB4 350 220 4 1 24 74 56 276 18 410
MAP 2 K4 70 10 2 1 25 127 34 23 898 86
CTCF 63 21 3 1 26 55 20 211 1027 29
PRKCB 41 9 1 1 27 174 59 31 80 151
SASH1 13 8 1 0 28 1011 NA NA 3706 4179
TAF1 10 3 1 1 29 225 86 33 359 19
SPTA1 3 0 1 0 30 212 17 25 1018 109
[112]Open in a new tab
The second to the fourth column show the co-appeared times of top 30
identified genes with ‘driver’, ‘breast’ and ‘cancer’ (from the left to
the right). is_driver indicates whether the given gene is a driver or
not in the benchmark dataset. The left columns represent the ranking
positions of identified genes in Dytidriver, Diffusion, DriverNet,
DawnRank, Muf_max, Muf_sum respectively
For lung cancer, Table [113]1 shows some well-studied cancer driver
genes were ranked in the top 30 by our methods, but were put in the
latter positions by other methods. For example, Phosphatidylinositol
3-kinases (PI3Ks) are well known regulators of cellular growth and
proliferation. It was ranked 12th by our method while ranked 112th by
Dawnrank, 430th by Muf_max, 81th by Muf_sum. Toll-like receptor-4
(TLR4) in human tumors often correlates with chemoresistance and
metastasis [[114]43] which was ranked 13th by our method, ranked 71th
by Diffusion algorithm while ranked 672th by Muf_max and 138th by
Muf_sum. The oncogenic BRAF(V600E) mutation results in an active
structural conformation characterized by greatly elevated ERK activity
[[115]44]. It was identified as the known cancer driver genes but
ranked 70th, 75th, 155th, 392th and 150th by Diffusion, DriverNet and
DawnRank, Muf_max and Muf_sum respectively. Our method can not only
prioritize the significant cancer driver genes but also identify some
potential cancer driver genes which were neglected by the NCG 4.0 such
as the NES, MET and HGF. Especially for the MET, some researchers found
that high MET gene copy number leads to shorter survival in patients
with non-small cell lung cancer. MET co-existed with key words,
‘cancer’, ‘lung’ and ‘driver’ for 1045, 348 and 40 times.
For the prostate cancer as shown in Table [116]2, our method also
identified some high-ranking significant driver genes, including TP53,
CTNNB1, PTEN, PIK3CA and so on. What we want to mention is the famous
tumor suppressor PTEN which is frequently inactivated in human prostate
cancer [[117]45]. It was ranked 6th by our method but strangely put in
the 700th by Diffusion algorithm, 94th by DriverNet and even neglected
by DawnRank. Furthermore, the results show DawnRank missed more than
one significant cancer driver genes including PTEN, PIK3CA and AKT1.
BRAF which involves in prostate related RAS/RAF/ERK signaling pathway
[[118]28] was ranked 13th by our methods while 326th by Diffusion
algorithm, 63th by DriverNet, 36th by DawnRank, 348th by Muf_max and
34th by Muf_sum. Besides, some high associated genes ignored by NCG 4.0
are also ranked in the top list of our method. The ATM (ataxia
telangiectasia mutated) kinase plays an essential role in maintaining
genome integrity by coordinating cell cycle arrest, apoptosis, and DNA
damage repair [[119]46]. It was missed by the NCG 4.0 but co-appeared
with ‘cancer’ for 1377 times, with ‘prostate’ for 61 times and with
‘driver’ for 5 times. Forkhead box protein A1 (FOXA1) modulates the
transactivation of steroid hormone receptors and thus may influences
tumor growth and hormone responsiveness in prostate cancer [[120]47].
It was ranked 8th by our method while neglected by NCG 4.0. In
addition, the transcription factors SP1 also has been missed by NCG
4.0.
For breast cancer in Table [121]3, our method successfully achieved a
high precision in identifying the top 10 cancer driver genes with 8 out
of 10 accuracy rates. The well-studied breast cancer driver genes
including TP53, PIK3CA, MAP 3 K1, CDH1, ERBB2 and PTEN were also put in
the top list of our method. Among those known breast cancer driver
genes, the top three cancer driver genes (TP53, PIK3CA, MAP 3 K1)
identified by our methods were ranked 233th, 156th and 128th
respectively by Diffusion algorithm. The HER2 (official name is ERBB2)
gene encodes a membrane receptor in the epidermal growth factor
receptor family amplified and over expressed in adenocarcinoma
[[122]48]. It was regarded as the important cancer driver gene by many
researchers and ranked 6th by our method while 72th, 64th, 90th, 73th
by Diffusion algorithm, DriverNet, DawnRank and Muf_sum respectively.
The breast cancer suppressor gene PTEN was ranked 14th by our method
while 185th, 98th, 93th and 79th by Diffusion, DriverNet, DawnRank and
Muf_sum receptively. Besides, the BRCA1 and JAK2 that co-cited with
‘cancer’ and ‘breast’ for many times were also missed by the DawnRank.
Discussion
The core step to overcome cancer is to identify the cancer driver genes
which can promote cancer evolvement and development. However, it is a
hard task since cancer is heterogeneous and there are too much
irrelevant passenger genes. Recently, many methods try to shorten the
distance to the truth. However, these methods still have some
limitations. For example, they ignored many driver genes with low
variation frequency and highly depend on the error-prone network.
Inspired by the fact that cancer genes forming functional modules tend
to be expressed similarly in the same tissue, we considered to improve
the reliability of the gene functional interaction network by
incorporating the expression similarity between mutated gene pairs in
the cancers’ related tissues. In order to obtain the tissue-specific
expression profiles, we used the GEO database. Because GEO database is
currently the largest and most famous expression data platform which
stores relatively complete expression data. The GEO dataset which we
used in this work was consisted of a total of 677 patients, including
cancer and normal patients, covered over 90 distinct tissue types and
was created by the same organization using the same experimental
technology. Although our model is superior to the other methods, it
still has some limitations. For example, the datasets used in this work
come from different projects: TCGA and GEO. Although, we just use the
GEO dataset to calculate the co-expression values of mutated genes in a
specific tissue. The likelihood is that there exists ambiguous since
the heterogeneous within different patients. Therefore, in order to
release this concern, in the future, we consider to unify the dataset
as far as possible.
Conclusion
In this work, we proposed a new method to identify cancer driver genes
by integrating the gene dysregulated expression, tissue-specific
expression and variation frequency into the functional interaction
network. Compared to other network-based methods, our method not only
considered that driver genes have impact on the expression of
downstream genes, but also took advantage of the modularity property of
driver genes, their co-expression in specific tissues and itself
variation frequency. We compared our results with other four similar
methods and did cociter analysis and enrichment analysis. From the
results, we can easily draw the conclusion that our method has the
capability to identify the cancer driver genes with high precision and
meanwhile detect some potential unknown cancer driver genes. Besides,
the enrichment analysis also illustrates that the top ranking cancer
driver genes in our list enrich in some significant cancer-related
pathways and implement important functions [[123]48].
Acknowledgements