Abstract
Background
Embedding techniques for converting high-dimensional sparse data into
low-dimensional distributed representations have been gaining
popularity in various fields of research. In deep learning models,
embedding is commonly used and proven to be more effective than naive
binary representation. However, yet no attempt has been made to embed
highly sparse mutation profiles into densely distributed
representations. Since binary representation does not capture
biological context, its use is limited in many applications such as
discovering novel driver mutations. Additionally, training distributed
representations of mutations is challenging due to a relatively small
amount of available biological data compared with the large amount of
text corpus data in text mining fields.
Methods
We introduce Mut2Vec, a novel computational pipeline that can be used
to create a distributed representation of cancerous mutations. Mut2Vec
is trained on cancer profiles using Skip-Gram since cancer can be
characterized by a series of co-occurring mutations. We also augmented
our pipeline with existing information in the biomedical literature and
protein-protein interaction networks to compensate for the data
insufficiency.
Results
To evaluate our models, we conducted two experiments that involved the
following tasks: a) visualizing driver and passenger mutations, b)
identifying novel driver mutations using a clustering method. Our
visualization showed a clear distinction between passenger mutations
and driver mutations. We also found driver mutation candidates and
proved that these were true driver mutations based on our literature
survey. The pre-trained mutation vectors and the candidate driver
mutations are publicly available at
[31]http://infos.korea.ac.kr/mut2vec.
Conclusions
We introduce Mut2Vec that can be utilized to generate distributed
representations of mutations and experimentally validate the efficacy
of the generated mutation representations. Mut2Vec can be used in
various deep learning applications such as cancer classification and
drug sensitivity prediction.
Electronic supplementary material
The online version of this article (10.1186/s12920-018-0349-7) contains
supplementary material, which is available to authorized users.
Keywords: Mut2Vec, Distributed representation, Deep learning, Mutation
embedding, Cancer
Background
Mutation representation by simple binary values (e.g., each existing
mutation is given a value of 1; if a mutation does not exist, it is
given a value of zero) has been commonly used in various machine
learning models designed for cancer analysis. However, since binary
representation does not capture mutational context (e.g., mutations
that frequently co-occur, distinction between driver mutations and
passenger mutations), it provides insufficient information for cancer
analysis such as cancer subtype classification, patient clustering, or
drug sensitivity prediction. Although significant amounts of mutations
have been discovered due to advances in sequencing techniques, it is
generally known that passenger mutations have no role in cancer
progression. In contrast, driver mutations directly affect cancer
progression, and they tend to be observed frequently in the cancer
profiles of patients. Applying these important mutational properties to
mutation representation is critical for improving cancer analysis.
Furthermore, if a mutation representation captures the characteristics
of driver mutations, it is possible to discover novel driver mutations
by calculating the similarity between a candidate mutation and each of
the driver mutations. Based on this motivation, we aim to address the
problem by developing continuous and distributed representations of
mutations using deep learning techniques.
Recently, Deep Learning, one of the artificial neural network-based
machine learning techniques has been making remarkable improvements in
various applications such as text mining [[32]1], speech recognition
[[33]2], image classification [[34]3] and even the prediction tasks in
biomedical domain such as protein secondary structure prediction
[[35]4] and DNA-protein binding prediction [[36]5]. Various continuous
distributed representations were introduced to be jointly used with
deep learning models. Word2Vec [[37]6] is one of the well-known models
trained to represent words in continuous space. This model is a
multi-layered neural network consisting of an input layer, embedding
lookup layer, and prediction layer. For the representation of documents
in a continuous space, Doc2Vec [[38]7] which is an extension of
Word2Vec, adds document vectors to the embedding lookup layer. Since
the distributed representation of words includes semantic relationships
among vocabularies such as the semantic similarity between two words,
the representations can contain additional information compared with
binary representation which contains information on the existence of
words.
Similar attempts to represent data in a continuous vector space have
been made in the biomedical domain. ProtVec [[39]8] applies Word2Vec to
a protein sequence to obtain distributed representations of a 3-gram
amino acid sequence. The protein sequence is initially split into
3-grams each having a biological significance and regarded as a “word”.
The next step is to run the Word2Vec algorithm using Skip-Gram. Seq2Vec
[[40]9], which extends the approach of ProtVec, applies Doc2Vec to
represent a sequence not just by combining all the sequential elements
of ProtVec 3-grams, but by directly embedding the sequence itself.
Finally, Dna2Vec [[41]10] generalizes the 3-gram structure of ProtVec
and Seq2Vec to a k-gram structure. Another approach involves SNP2Vec
[[42]11], which embeds individual SNPs into a continuous space by using
a denoising autoencoder [[43]12] and Diet Networks.
Nevertheless, since Skip-Gram relies on co-occurrence information
between data units (words or k-grams), it is difficult to guarantee the
quality of the vectors if the input data lacks co-occurrence
information. To address this issue, some studies that apply existing
structured or graph knowledge to embedding processes have been
introduced. RC-NET [[44]13] adds two regularization functions to the
Skip-Gram objective function, which capture the relational distance
between the words based on their categorical information. Faruqui et
al. [[45]14] proposed a method that applies synonym-based graph
knowledge to existing word vectors. Using a simple mathematical
process, graph information is added to the word vector while
information on its previous state is preserved.
In this work, we propose a novel pipeline, Mut2Vec, to generate
distributed representations of mutations for the characterization of
cancer cells. Because our vector space captures the characteristics of
driver mutations and distinguishes driver mutations from passenger
mutations, it has the potential to improve performance in other
applications. Our mutation vectors can help identify driver mutations
by investigating the vector space. We hypothesized that when an
unidentified mutation is near many driver mutations in the vector
space, the mutation could be a candidate driver mutation. Our mutation
vectors can also help machine learning applications capture important
biological information and yield better results than conventional
binary representation. We assume that mutations are critical to the
development of cancer when they co-occur in many cancer samples. Our
assumption is similar to the text mining assumption that words are
semantically meaningful when the words co-occur in many sentences. Word
embedding algorithms such as Skip-Gram utilize co-occurrence
information to embed words in a semantically meaningful distributed
continuous space that places words with similar meanings close to each
other. In this work, we attempt to leverage such word embedding
techniques to embed gene-level mutations in a continuous distributed
space that captures the semantic relations among the cancerous
gene-level mutations.
To produce precise mutation vectors, a sufficient amount of information
on co-occurring mutations is needed. However, the number of cancer
samples with co-occurring mutations is limited. In the case of the
Google News corpus, which is a standard text corpus for training word
vectors, there are more than 100 billion tokens for three million
words. In comparison, the database of the International Cancer Genome
Consortium (ICGC) [[46]15] has only about 13,000 cancer samples for
more than 20,000 mutated genes. Because of this limitation, it is
difficult to make reliable observations of co-occurring mutations,
which is essential to producing high quality embedding. As a result,
rare mutations do not have enough information on co-occurring
mutations, so they do not learn proper mutation vectors. Therefore,
these mutation vectors are placed in the wrong location on the vector
space and act as noise in the analysis using distance between vectors
such as clustering. To resolve this problem, we utilized biomedical
literature and a protein-protein interaction (PPI) network to enhance
the quality of mutation vectors.
To evaluate our embedding process, we visualized driver mutations and
passenger mutations using our vectors. We confirmed that the two
mutation groups were mutually exclusive to each other. The experimental
results demonstrate that our mutation vector can determine whether each
mutation is a driver or passenger mutation. We also identified driver
mutation candidates using a clustering method. To evaluate the
candidates and confirm their validity, we referenced recent biomedical
literature in which true driver mutations are reported.
Method
Cancer cells do not arise from random combinations of mutations. Cancer
cells are due to their accumulated mutations that occurred during their
evolutionary process [[47]16]. Though mutations are abnormal in terms
of their origin, their occurrence is inevitable. From this aspect, we
set these co-occurring gene mutations in a sample as “context.” Among
them, we also exclusively selected protein-altering mutations. Using
the Skip-Gram model, we constructed the basic Mut2Vec model and
obtained basic Mut2Vec vectors, where each vector is a 300-dimensional
distributed representation of mutations and contains co-occurrence
information of gene mutations from ICGC dataset.
However, there still exists an insufficient amount of data in the
biomedical domain, compared with other domains such as the Natural
Language Processing (NLP) domain. In the biomedical literature, gene
names are mentioned in their biological context. By extracting contexts
from the literature and adding them to our vectors, we overcome the
limitations of data insufficiency and enhance the vectors to capture
more precise gene-level mutational properties. We used the Skip-Gram
model to train word representations on PubMed abstracts. Based on the
learned word representations, we initialized the weight matrix of the
embedding lookup layer with the word vectors of each gene when training
mutation representations on the ICGC dataset. Our Mut2Vec+PI (PubMed
Initialized) model initializes mutation vectors using PubMed word
vectors and trains the Skip-Gram model on the ICGC dataset using the
initialized vectors. Furthermore, we added structured biological
knowledge using the PPI network BioGRID [[48]17]. Assuming similar
proteins are involved in similar cellular processes and their
alteration effects are alike, we utilized a retrofitting process to
post-process the output vectors [[49]14]. Our Mut2Vec+R (Retrofitted)
model applies retrofitting to the basic Mut2Vec output. Our
Mut2Vec+PI+R model employs both PubMed initialization and retrofitting.
Our Mut2Vec pipeline is summarized as follows. First, we initialize the
weight matrix in embedding lookup layer of Skip-Gram model using word
vectors, which is pre-trained on PubMed abstracts. Because we needed
initial gene vectors, we selected only gene word vectors from the
pre-trained word vectors. Next, we trained the gene-level mutation
vectors with the ICGC mutation profiles using the initialized Skip-Gram
model. We considered co-occurring gene mutations in a sample as
contexts, just like words co-occurring in a sentence are considered as
contexts in the NLP domain. Finally, we retrofitted the trained
mutation vector on the Protein-Protein Interaction network data of
BioGRID. The whole pipeline is described in Fig. [50]1.
Fig. 1.
Fig. 1
[51]Open in a new tab
The overview of Mut2Vec Pipeline. Our pipeline is composed of two
modules: an embedding module based on Skip-Gram and a vector
post-processing module equipped with retrofitting. In our pipeline, we
make four mutation embedding models. The first model uses only the
Skip-Gram module on mutation profiles, and we call the model basic
Mut2Vec. In our Mut2Vec+PI model, the weight matrix in the Skip-Gram
model is initialized with PubMed word vectors. In our Mut2Vec+R model,
the output vectors of the basic Mut2Vec model is post-processed in the
retrofitting module. In our Mut2Vec+PI+R model, both the initialization
with PubMed word vectors and the post-processing are applied
Skip-Gram model
The Skip-Gram model is a multi-layered neural network, as shown in
Fig. [52]2. The ultimate objective of this model is to correctly
predict the surrounding entities based on the entity that is embedded
in the network. To achieve this objective, we need to train the model
by using our “entity” and its contextual “entities”. The embedded
entities are the mutated genes while the output or their contextual
entities are the co-occurring mutated genes. Thus, we train the
Skip-Gram model in iteration by using mutated genes as input and
minimizing the prediction error gap between the output and their
co-occurred mutations.
Fig. 2.
Fig. 2
[53]Open in a new tab
An overview of the Skip-Gram model. The Skip-Gram model consists of an
input layer, embedding lookup(hidden) layer, and prediction layer. The
result of the embedding lookup layer is the distributed representation
of the target word
By using the Skip-Gram model, we maximize the probability as follows,
[MATH: p(C
i|e
mi>i)=
p(c1,c2<
/msub>,…,c
k,…,c
l−1,cl|<
msub>ei)≈∏ej∈C
ip(ej|e
i)(ei∈Eand<
msub>ej,ck∈Ci⊂E
mi>) :MATH]
where C[i]={c[1],c[2],…,c[k],…,c[l−1],c[l]} is the context set of an
entity e[i], context size l is |C[i]|, and E is a set of entities to be
embedded. When embedding words in text, context size l is fixed.
However, in our case, it is difficult to fix the context size because
the number of mutations in each sample varies. Some samples have less
than 10 gene mutations, while others have more than 1000 gene
mutations. In addition, since mutations included in a single patient
sample are not sorted according to a certain biological order, drawing
a mutation vector by shifting the context window is illogical, unlike
the case of NLP.
To assign various co-occurring contexts to a mutation, we performed
random sampling without replacement on each patient sample 10 times.
The size of the random samples was 10. We assumed that patient samples
with an excessive number of mutations tend to be highly noisy. Also, we
found the information extracted from patient samples with small
quantities of mutations was critical for embedding each mutation
vector. As we conducted the same random sampling procedure regardless
of the mutation quantity of each patient sample, noisy samples with an
excessive number of mutations were used less in vector embedding
processes. On the other hand, patient samples with small mutation
quantities were used frequently.
The conditional probability mentioned above can be expressed with
latent parameters of a neural network and a softmax function as below,
[MATH: p(e
j|ei)=expuiTvj∑k=1
|E|expuiTvkJ(U,V)
=1N∑iN∑ej∈C
ilog(p(ej|ei)) :MATH]
where U is a weight matrix for an embedding lookup layer,
[MATH:
uiT
:MATH]
is a distributed representation of i-th entity, N is the number of all
training entities which can be defined with contexts, (e[i],C[i]). V is
an output weight matrix, and v[j] is j-th row of the matrix V. Our goal
is to maximize the objective function J(U,V) above.
However, the basic Skip-Gram model described above suffers from high
computational cost. Due to the summation calculation in the denominator
of p(e[j]|e[i]), the computational cost for calculating J(U,V) is often
high especially for large vocabularies (entities). To address this
issue, Mikolov et al. [[54]18] proposed a Skip-Gram model that has an
additional feature called negative sampling. Instead of using the
softmax function, negative sampling directly uses the sigmoid function
σ(x) to represent each entity’s conditional probability.
[MATH:
σ(x)=11+exp(−x)p(<
mi>ej|ei)=<
/mo>σuiTvjp
mi>ej¯|ei=1−σuiTvj :MATH]
1
Using the re-defined conditional probability above, negative sampling
maximizes the objective function J[NEG](U,V) as below
[MATH:
JNEG(U,V)i=∑j∈C
ilog(p(ej|ei))+∑l∈D
ilog(p(el¯|ei))JNEG(U,
V)=1
mrow>N∑i=1
NJ
NEG(U,V)iDi⊂CiC,C<
/mrow>iC
mfenced>=E−Ci :MATH]
where D[i] is a sampled subset of
[MATH:
CiC
:MATH]
, which is a complement of C[i]. Also, |D[i]| is fixed. The sampling
process is done using a distribution of entities raised to the 3/4rd
power. In conclusion, the Skip-Gram model equipped with negative
sampling maximizes the occurrence probability of contextual entities
and minimizes the occurrence probability of non-contextual entities,
conditioned on the occurrence of the entity. We used Gensim [[55]19],
which is a Python library, for the implementation.
Retrofitting
As words and phrases have synonyms and paraphrases respectively,
Faruqui et al. [[56]14] utilized structured lexical meaning networks,
WordNet [[57]20], FrameNet [[58]21] and the Paraphrase database (PPDB)
[[59]22], to post-process vectors of entities. The purpose of this
post-processing was to ensure the vectors have similar representations
if they are synonyms or paraphrases. The processing is done by
minimizing the function
[MATH: J(Q,Q~)
:MATH]
defined by
[MATH:
Ji(Q,Q~)
:MATH]
and J[(i,j)](Q) as follows,
[MATH: Ji(Q,Q~)=
αi<
mrow>qi−q~i2
msup>J(i
,j)(Q)
mo>=βijqi−qj<
/msub>2(j∈S<
mrow>i,qi,qj∈Qandqi~∈Q~) :MATH]
where
[MATH: qi~ :MATH]
is a trained vector, q[i] is a post-processed vector, S[i] is a set of
entities similar to an entity i, and q[j] is a vector of which its
entity is similar to an entity i and is included in S[i]. α[i] and
β[ij]are hyperparameters for each entity and a pair of entities, where
α[i]=1, and β[ij]=|S[i]|^−1. The hyperparameter values were used as the
default values for the model. Finally, the objective function can be
obtained by the formula below.
[MATH: J(Q,Q~)=
∑inJi(Q,Q~)+
∑j∈S
i<
mi>J(i,j)(Q) :MATH]
Likewise, if two gene mutations were involved in the same cellular
process, we assumed that they have similar effects, such as
malfunctions or abnormal activations, on biological processes. From the
BioGRID network, we selected genes one hop apart from a certain gene as
similarly functioning genes, and made them similar each other using the
retrofitting process described above. Retrofitting codes are available
at [60]https://github.com/mfaruqui/retrofitting.
Results
Driver/Passenger mutation visualization
Many mutations in a single cancer sample are not entirely related to
cancer. The driver mutation directly affects the progression of the
cancer, while the passenger mutation does not play any particular role.
In fact, driver mutations are common in many cancer cells of patients,
while passenger mutations are not [[61]23].
We performed data visualization to see if our mutation vectors reflect
the mutual distinction between driver and passenger mutations in the
vector space. The driver/passenger mutation information was obtained
from the driver mutation database IntOGen [[62]24]. We also conducted
k-means clustering on the driver and passenger mutation vectors before
reducing dimensions by Principal Component Analysis, and calculated the
Normalized Mutual Information(NMI) to assess the clustering result. The
NMI is defined as
[MATH: NMI(Ω,C)=MI(Ω,C
)[H(Ω)+H(C)]/2 :MATH]
where Ω={ω[1],ω[2],...ω[I]} is the set of cluster labels and
C={c[1],c[2],…c[J]} is the set of class labels. In our case,
C={driver,passenger}. MI is mutual information defined as
[MATH: MI(Ω,C
)=∑i∑jp<
mo>(ωi,cj)
logp(
ωi,cj)p(ωi)p(c<
mrow>j) :MATH]
where p(ω[i]), p(c[j]), and p(ω[i],c[j]) are the probabilities of a
mutation occurring in cluster ω[i], class c[j], and the intersection of
ω[i] and c[j], respectively.
H is entropy defined as
[MATH: H(Ω)=−
∑ip<
mo>(ωi)log(p(ωi))<
/mtd> :MATH]
Experiments were performed on three cancer types (CM, BRCA, LUAD) with
the highest number of “known” driver mutations among 29 cancer types.
According to Table [63]1, driver mutation data contains far more
predicted mutations than known mutations. Passenger mutations are all
predicted mutations. Since known driver mutations are more reliable
than predicted driver mutations, we used only “known" driver mutations
for a more accurate comparison. Also, there are only “predicted” for
passenger mutations in the database. Since the number of passenger
mutations is much larger than the number of driver mutations, randomly
sampled passenger mutations were selected for the visualization
process.
Table 1.
IntOGen data description for three cancer types
Drivers Passengers
Type Known Predicted Predicted
BRCA 22 473 13702
CM 29 607 16863
LUAD 23 505 13929
[64]Open in a new tab
We obtained interesting results using our mutation vectors. As shown in
Fig. [65]3, vectors using ICGC dataset perform slightly better than
randomly generated vectors, and the mutual distinction between driver
mutations and passenger mutations in the vector space became clearer
when PPI network knowledge was added. After adding PubMed information,
we could confirm that both driver and passenger mutations were properly
classified. Furthermore, the improvements measured by NMI support our
visualization results. Compared with randomly generated vectors, our
mutation vectors improved the NMI scores of all three cancer types. We
could observe dramatic performance improvements when both literature
information and PPI network information were applied together.
Fig. 3.
Fig. 3
[66]Open in a new tab
Driver/passenger mutations visualization. Visualization with Principal
Component Analysis, shows the clear difference between driver and
passenger mutation classes when PubMed information is applied. Red dots
represent known driver mutations and blue dots represent sampled
predicted passenger mutations. Normalized Mutual Information (NMI) is
also calculated based on the results of k-means clustering
We also conducted this visualization experiment with binary mutation
representation. In the ICGC patient-mutation profile, we defined a
binary mutation vector of dimension equal to the number of samples in
the ICGC dataset. Each dimension of the vector encodes the existence of
a mutation in a corresponding sample. In other words, the binary vector
of a mutation has a value of 1 only for the dimension corresponding to
the sample that has the mutation; however, the binary vector has a
value of 0 for all other dimensions. Figure [67]4 shows the
visualization of binary mutation vectors. The distinction between
passenger mutations (blue dots) and driver mutations (red dots)
misleadingly seems to be accurate since the blue dots are notably more
clustered than the red dots. In fact, driver mutations are more
frequently observed in patients than passenger mutations. Therefore,
some of the binary vectors of driver mutations are larger and are
positioned far from the area where most mutations are clustered.
However, when we expanded the area where most passenger mutations
existed, we could observe that the drivers and passengers were actually
scrambled. The NMI scores of binary representation were also lower than
the Mut2Vec+PI+R scores. Binary representation obtained scores of
0.031, 0.039, 0.071 for CM, BRCA, and LUAD, respectively, using NMI,
whereas Mut2Vec+PI+R obtained scores of 0.650, 0.866, 0.713 for CM,
BRCA, and LUAD, respectively, using NMI. The mutation vectors from
Mut2Vec+PI+R model can better represent the information on driver
mutations than binary vector representation.
Fig. 4.
Fig. 4
[68]Open in a new tab
Driver/passenger mutations visualization of binary vectors.
Visualization with Principal Component Analysis of binary mutation
vectors. The binary vectors are made by selecting column vectors of
patient-mutation profiles. Because driver mutations tend to frequently
appear in cancer profiles of patients, there are many 1s in the driver
mutation vectors and the size of the driver mutation vectors are large.
Therefore, some of driver mutation vectors are far from most mutation
vectors, The “Boxed Area” is the expanded visualization of the boxed
area of “Binary” visualization where most of passenger mutation vectors
exist. We found that the drivers and passengers were actually scrambled
while they seemed to be well-separated in a broader scope
For the comparison, we trained 300-dimensional mutation vectors using
an autoencoder [[69]25] and a denoising autoencoder [[70]26], and
conducted a visualization experiment. The autoencoder obtained scores
of 0.007, 0.074, 0.071 for CM, BRCA, and LUAD, respectively, using NMI.
The denoising autoencoder obtained scores of 0.031, 0.040, 0.038 for
CM, BRCA, and LUAD, respectively, using NMI. Also, we found that the
vectors trained on the autoencoders could not effectively distinguish
driver mutations from passenger mutations. The visualization results
are listed in Additional file [71]1.
Driver mutation identification
Based on our previous visualization, we can infer that driver mutation
vectors have their own properties which help distinguish driver
mutations from other mutations. From this inference, we clustered the
entire set of mutations in vectors from Mut2Vec+PI+R and examined
whether the cluster contained many driver mutations.
The k-means algorithm with the option of 200 clusters was applied to
the clustering process. Next, we selected the most enriched cluster
that contains the most driver mutations, and built a contingency table,
which is shown in Table [72]2. We estimated the statistical
significance based on the hypergeometric test [[73]27] on entire driver
gene mutations in the IntOGen database. In the database, there were 67
unique “known" drivers and 594 “predicted" drivers, all of which
intersect with embedded mutations. In our most enriched cluster, we
could find 21 known drivers with a p-value of 3.74e-37. Considering
both known and predicted driver mutations, we found 45 drivers with a
p-value of 2.04e-51. The remaining 18 mutations were not referred to as
drivers in the database. As the cluster had high statistical
significance, we concluded that the mutation vectors from our
Mut2Vec+PI+R model captured the characteristics of driver mutations.
Other contingency tables of different clustering methods(Agglomerative
hierarchical clustering, BIRCH, Spectral clustering, Affinity
propagation, and Gaussian mixture) that performed based on different
numbers of clusters(50, 100, 300, and 500) are listed in Additional
file [74]2.
Table 2.
The most enriched cluster characterized using mutation vectors from
Mut2Vec+PI+R
Labels In cluster In population p-value
Known 21 67 3.74e-37
Known+predicted 45 661 2.04e-51
Candidates 18 17923
All 63 18584
[75]Open in a new tab
We also performed an enrichment analysis on the KEGG PATHWAY database
[[76]28] using Fisher’s exact test. We used a publicly available
enrichment analysis platform, which was given by Enrichr [[77]29]. The
p-value was adjusted using the Benjamini-Hochberg method for correcting
multiple hypotheses testing. As Table [78]3 shows, “Pathways in cancer"
was the most enriched pathway with an adjusted p-value of 9.98e-24 and
25 overlapped genes. All the five pathways were related to cancer or
the general characteristics of cancer such as metabolism and
misregulation, since various types of driver mutations were grouped in
the most enriched cluster.
Table 3.
Pathway enrichment analysis of the most enriched cluster characterized
using mutation vectors from Mut2Vec+PI+R
KEGG PATHWAY Adjusted p-value Overlap
Pathways in cancer 9.98e-24 25/397
Central carbon metabolism in cancer 2.51e-22 15/67
Transcriptional misregulation in cancer 6.76e-19 17/180
MicroRNAs in cancer 6.27e-14 16/297
Prostate cancer 1.93e-13 11/89
[79]Open in a new tab
Based on the above observation, we carried out a further experiment. We
hypothesized that if most of the mutations gathered in a cluster are
drivers, the unidentified mutations in the cluster are most likely to
be driver mutations. To test our assumption, we investigated mutations
in the most enriched cluster, as shown in Table [80]4. We focused
mainly on retrieving information on the 18 candidate mutations which
were not identified as drivers in the IntOGen database.
Table 4.
Genes in the most enriched cluster characterized using mutation vectors
from Mut2Vec+PI+R
Known Predicted Candidate
ABL1 ASXL1 BCL2
ALK BAP1 CISH
DNMT3A BCL6 CRLF2
EGFR CALR DUSP22
ERBB2 CCND1 ERG
FGFR2 CEBPA EWSR1
FGFR3 ETV6 FHIT
FLT3 FGFR1 MAML2
GNAQ H3F3A MYBL1
HRAS IKZF1 MYCL
IDH2 MET PDGFRB
JAK2 MYCN PLAG1
KIT NF2 PRKACA
MYC NOTCH1 SPINK1
MYD88 NTRK1 SS18
NPM1 PAX5 TERT
NRAS PDGFRA TFE3
RB1 PPM1D TP63
RUNX1 RET
SDHB RHOA
SMO ROS1
SMARCB1
TET2
WT1
[81]Open in a new tab
Through searching entire recently published biomedical papers since
2015, we found out that 11 of the candidate mutations were reported as
driver mutations. Table [82]5 shows the literature search results. BCL2
is an important driver of leukemia and referred to as a driver mutation
[[83]30]. ERG is reported to be a driver of carcinogenesis in prostate
cancer [[84]31, [85]32]. The loss of fragile histidine triad protein
(FHIT) is strongly related to pancreatic ductal adenocarcinomas
[[86]33]. MAML2-MECT1 fusion is a driver in salivary gland and
bronchial gland mucoepidermoid carcinoma [[87]34, [88]35]. MYBL1 is a
driver of adenoid cystic carcinoma when related to MYB [[89]36,
[90]37]. As a member of the myelocytomatosis oncogene family, MYCL is a
driver oncogene of lung carcinoma [[91]38, [92]39]. PDGFRB was recently
reported as a driver of the majority of sporadic infantile and adult
solitary myofibromas [[93]40]. PRKACA was identified by recent
sequencing as a driver of cortisol-producing adenomas and perihilar
cholangiocarcinoma [[94]41, [95]42]. SS18 with SSX fusion has been
reported as driver of synovial sarcoma in many research studies
[[96]43–[97]47]. TERT has also been reported as a cancer driver of
various tissues including thyroid and liver [[98]48–[99]57]. Variation
in TP63 is associated with drivers of squamous cell lung cancer
[[100]58, [101]59].
Table 5.
Literature search results on driver candidates
Gene Tissue References