Abstract

Background

   Embedding techniques for converting high-dimensional sparse data into
   low-dimensional distributed representations have been gaining
   popularity in various fields of research. In deep learning models,
   embedding is commonly used and proven to be more effective than naive
   binary representation. However, yet no attempt has been made to embed
   highly sparse mutation profiles into densely distributed
   representations. Since binary representation does not capture
   biological context, its use is limited in many applications such as
   discovering novel driver mutations. Additionally, training distributed
   representations of mutations is challenging due to a relatively small
   amount of available biological data compared with the large amount of
   text corpus data in text mining fields.

Methods

   We introduce Mut2Vec, a novel computational pipeline that can be used
   to create a distributed representation of cancerous mutations. Mut2Vec
   is trained on cancer profiles using Skip-Gram since cancer can be
   characterized by a series of co-occurring mutations. We also augmented
   our pipeline with existing information in the biomedical literature and
   protein-protein interaction networks to compensate for the data
   insufficiency.

Results

   To evaluate our models, we conducted two experiments that involved the
   following tasks: a) visualizing driver and passenger mutations, b)
   identifying novel driver mutations using a clustering method. Our
   visualization showed a clear distinction between passenger mutations
   and driver mutations. We also found driver mutation candidates and
   proved that these were true driver mutations based on our literature
   survey. The pre-trained mutation vectors and the candidate driver
   mutations are publicly available at
   [31]http://infos.korea.ac.kr/mut2vec.

Conclusions

   We introduce Mut2Vec that can be utilized to generate distributed
   representations of mutations and experimentally validate the efficacy
   of the generated mutation representations. Mut2Vec can be used in
   various deep learning applications such as cancer classification and
   drug sensitivity prediction.

Electronic supplementary material

   The online version of this article (10.1186/s12920-018-0349-7) contains
   supplementary material, which is available to authorized users.

   Keywords: Mut2Vec, Distributed representation, Deep learning, Mutation
   embedding, Cancer

Background

   Mutation representation by simple binary values (e.g., each existing
   mutation is given a value of 1; if a mutation does not exist, it is
   given a value of zero) has been commonly used in various machine
   learning models designed for cancer analysis. However, since binary
   representation does not capture mutational context (e.g., mutations
   that frequently co-occur, distinction between driver mutations and
   passenger mutations), it provides insufficient information for cancer
   analysis such as cancer subtype classification, patient clustering, or
   drug sensitivity prediction. Although significant amounts of mutations
   have been discovered due to advances in sequencing techniques, it is
   generally known that passenger mutations have no role in cancer
   progression. In contrast, driver mutations directly affect cancer
   progression, and they tend to be observed frequently in the cancer
   profiles of patients. Applying these important mutational properties to
   mutation representation is critical for improving cancer analysis.
   Furthermore, if a mutation representation captures the characteristics
   of driver mutations, it is possible to discover novel driver mutations
   by calculating the similarity between a candidate mutation and each of
   the driver mutations. Based on this motivation, we aim to address the
   problem by developing continuous and distributed representations of
   mutations using deep learning techniques.

   Recently, Deep Learning, one of the artificial neural network-based
   machine learning techniques has been making remarkable improvements in
   various applications such as text mining [[32]1], speech recognition
   [[33]2], image classification [[34]3] and even the prediction tasks in
   biomedical domain such as protein secondary structure prediction
   [[35]4] and DNA-protein binding prediction [[36]5]. Various continuous
   distributed representations were introduced to be jointly used with
   deep learning models. Word2Vec [[37]6] is one of the well-known models
   trained to represent words in continuous space. This model is a
   multi-layered neural network consisting of an input layer, embedding
   lookup layer, and prediction layer. For the representation of documents
   in a continuous space, Doc2Vec [[38]7] which is an extension of
   Word2Vec, adds document vectors to the embedding lookup layer. Since
   the distributed representation of words includes semantic relationships
   among vocabularies such as the semantic similarity between two words,
   the representations can contain additional information compared with
   binary representation which contains information on the existence of
   words.

   Similar attempts to represent data in a continuous vector space have
   been made in the biomedical domain. ProtVec [[39]8] applies Word2Vec to
   a protein sequence to obtain distributed representations of a 3-gram
   amino acid sequence. The protein sequence is initially split into
   3-grams each having a biological significance and regarded as a “word”.
   The next step is to run the Word2Vec algorithm using Skip-Gram. Seq2Vec
   [[40]9], which extends the approach of ProtVec, applies Doc2Vec to
   represent a sequence not just by combining all the sequential elements
   of ProtVec 3-grams, but by directly embedding the sequence itself.
   Finally, Dna2Vec [[41]10] generalizes the 3-gram structure of ProtVec
   and Seq2Vec to a k-gram structure. Another approach involves SNP2Vec
   [[42]11], which embeds individual SNPs into a continuous space by using
   a denoising autoencoder [[43]12] and Diet Networks.

   Nevertheless, since Skip-Gram relies on co-occurrence information
   between data units (words or k-grams), it is difficult to guarantee the
   quality of the vectors if the input data lacks co-occurrence
   information. To address this issue, some studies that apply existing
   structured or graph knowledge to embedding processes have been
   introduced. RC-NET [[44]13] adds two regularization functions to the
   Skip-Gram objective function, which capture the relational distance
   between the words based on their categorical information. Faruqui et
   al. [[45]14] proposed a method that applies synonym-based graph
   knowledge to existing word vectors. Using a simple mathematical
   process, graph information is added to the word vector while
   information on its previous state is preserved.

   In this work, we propose a novel pipeline, Mut2Vec, to generate
   distributed representations of mutations for the characterization of
   cancer cells. Because our vector space captures the characteristics of
   driver mutations and distinguishes driver mutations from passenger
   mutations, it has the potential to improve performance in other
   applications. Our mutation vectors can help identify driver mutations
   by investigating the vector space. We hypothesized that when an
   unidentified mutation is near many driver mutations in the vector
   space, the mutation could be a candidate driver mutation. Our mutation
   vectors can also help machine learning applications capture important
   biological information and yield better results than conventional
   binary representation. We assume that mutations are critical to the
   development of cancer when they co-occur in many cancer samples. Our
   assumption is similar to the text mining assumption that words are
   semantically meaningful when the words co-occur in many sentences. Word
   embedding algorithms such as Skip-Gram utilize co-occurrence
   information to embed words in a semantically meaningful distributed
   continuous space that places words with similar meanings close to each
   other. In this work, we attempt to leverage such word embedding
   techniques to embed gene-level mutations in a continuous distributed
   space that captures the semantic relations among the cancerous
   gene-level mutations.

   To produce precise mutation vectors, a sufficient amount of information
   on co-occurring mutations is needed. However, the number of cancer
   samples with co-occurring mutations is limited. In the case of the
   Google News corpus, which is a standard text corpus for training word
   vectors, there are more than 100 billion tokens for three million
   words. In comparison, the database of the International Cancer Genome
   Consortium (ICGC) [[46]15] has only about 13,000 cancer samples for
   more than 20,000 mutated genes. Because of this limitation, it is
   difficult to make reliable observations of co-occurring mutations,
   which is essential to producing high quality embedding. As a result,
   rare mutations do not have enough information on co-occurring
   mutations, so they do not learn proper mutation vectors. Therefore,
   these mutation vectors are placed in the wrong location on the vector
   space and act as noise in the analysis using distance between vectors
   such as clustering. To resolve this problem, we utilized biomedical
   literature and a protein-protein interaction (PPI) network to enhance
   the quality of mutation vectors.

   To evaluate our embedding process, we visualized driver mutations and
   passenger mutations using our vectors. We confirmed that the two
   mutation groups were mutually exclusive to each other. The experimental
   results demonstrate that our mutation vector can determine whether each
   mutation is a driver or passenger mutation. We also identified driver
   mutation candidates using a clustering method. To evaluate the
   candidates and confirm their validity, we referenced recent biomedical
   literature in which true driver mutations are reported.

Method

   Cancer cells do not arise from random combinations of mutations. Cancer
   cells are due to their accumulated mutations that occurred during their
   evolutionary process [[47]16]. Though mutations are abnormal in terms
   of their origin, their occurrence is inevitable. From this aspect, we
   set these co-occurring gene mutations in a sample as “context.” Among
   them, we also exclusively selected protein-altering mutations. Using
   the Skip-Gram model, we constructed the basic Mut2Vec model and
   obtained basic Mut2Vec vectors, where each vector is a 300-dimensional
   distributed representation of mutations and contains co-occurrence
   information of gene mutations from ICGC dataset.

   However, there still exists an insufficient amount of data in the
   biomedical domain, compared with other domains such as the Natural
   Language Processing (NLP) domain. In the biomedical literature, gene
   names are mentioned in their biological context. By extracting contexts
   from the literature and adding them to our vectors, we overcome the
   limitations of data insufficiency and enhance the vectors to capture
   more precise gene-level mutational properties. We used the Skip-Gram
   model to train word representations on PubMed abstracts. Based on the
   learned word representations, we initialized the weight matrix of the
   embedding lookup layer with the word vectors of each gene when training
   mutation representations on the ICGC dataset. Our Mut2Vec+PI (PubMed
   Initialized) model initializes mutation vectors using PubMed word
   vectors and trains the Skip-Gram model on the ICGC dataset using the
   initialized vectors. Furthermore, we added structured biological
   knowledge using the PPI network BioGRID [[48]17]. Assuming similar
   proteins are involved in similar cellular processes and their
   alteration effects are alike, we utilized a retrofitting process to
   post-process the output vectors [[49]14]. Our Mut2Vec+R (Retrofitted)
   model applies retrofitting to the basic Mut2Vec output. Our
   Mut2Vec+PI+R model employs both PubMed initialization and retrofitting.

   Our Mut2Vec pipeline is summarized as follows. First, we initialize the
   weight matrix in embedding lookup layer of Skip-Gram model using word
   vectors, which is pre-trained on PubMed abstracts. Because we needed
   initial gene vectors, we selected only gene word vectors from the
   pre-trained word vectors. Next, we trained the gene-level mutation
   vectors with the ICGC mutation profiles using the initialized Skip-Gram
   model. We considered co-occurring gene mutations in a sample as
   contexts, just like words co-occurring in a sentence are considered as
   contexts in the NLP domain. Finally, we retrofitted the trained
   mutation vector on the Protein-Protein Interaction network data of
   BioGRID. The whole pipeline is described in Fig. [50]1.

Fig. 1.

   Fig. 1
   [51]Open in a new tab

   The overview of Mut2Vec Pipeline. Our pipeline is composed of two
   modules: an embedding module based on Skip-Gram and a vector
   post-processing module equipped with retrofitting. In our pipeline, we
   make four mutation embedding models. The first model uses only the
   Skip-Gram module on mutation profiles, and we call the model basic
   Mut2Vec. In our Mut2Vec+PI model, the weight matrix in the Skip-Gram
   model is initialized with PubMed word vectors. In our Mut2Vec+R model,
   the output vectors of the basic Mut2Vec model is post-processed in the
   retrofitting module. In our Mut2Vec+PI+R model, both the initialization
   with PubMed word vectors and the post-processing are applied

Skip-Gram model

   The Skip-Gram model is a multi-layered neural network, as shown in
   Fig. [52]2. The ultimate objective of this model is to correctly
   predict the surrounding entities based on the entity that is embedded
   in the network. To achieve this objective, we need to train the model
   by using our “entity” and its contextual “entities”. The embedded
   entities are the mutated genes while the output or their contextual
   entities are the co-occurring mutated genes. Thus, we train the
   Skip-Gram model in iteration by using mutated genes as input and
   minimizing the prediction error gap between the output and their
   co-occurred mutations.

Fig. 2.

   Fig. 2
   [53]Open in a new tab

   An overview of the Skip-Gram model. The Skip-Gram model consists of an
   input layer, embedding lookup(hidden) layer, and prediction layer. The
   result of the embedding lookup layer is the distributed representation
   of the target word

   By using the Skip-Gram model, we maximize the probability as follows,
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"></mtd><mtd
   class="align-2"><mtable><mtr><mtd><mi>p</mi><mo>(</mo><msub><mrow><mi>C
   </mi></mrow><mrow><mi>i</mi></mrow></msub><mo>|</mo><msub><mrow><mi>e</
   mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo></mtd><mtd><mo>=</mo>
   <mi>p</mi><mo>(</mo><msub><mrow><mi>c</mi></mrow><mrow><mn>1</mn></mrow
   ></msub><mo>,</mo><msub><mrow><mi>c</mi></mrow><mrow><mn>2</mn></mrow><
   /msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi>c</mi></mrow><mrow>
   <mi>k</mi></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi>c
   </mi></mrow><mrow><mi>l</mi><mo>−</mo><mn>1</mn></mrow></msub><mo>,</mo
   ><msub><mrow><mi>c</mi></mrow><mrow><mi>l</mi></mrow></msub><mo>|</mo><
   msub><mrow><mi>e</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo></m
   td></mtr><mtr><mtd><mo>≈</mo><munder><mrow><mo
   mathsize="big">∏</mo></mrow><mrow><msub><mrow><mi>e</mi></mrow><mrow><m
   i>j</mi></mrow></msub><mo>∈</mo><msub><mrow><mi>C</mi></mrow><mrow><mi>
   i</mi></mrow></msub></mrow></munder><mi>p</mi><mo>(</mo><msub><mrow><mi
   >e</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>|</mo><msub><mrow><mi>e
   </mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo></mtd></mtr></mtabl
   e><mspace width="2em"></mspace></mtd><mtd><mspace
   width="2em"></mspace></mtd></mtr><mtr><mtd class="align-1"></mtd><mtd
   class="align-2"><mtable><mtr><mtd><mo>(</mo><msub><mrow><mi>e</mi></mro
   w><mrow><mi>i</mi></mrow></msub><mo>∈</mo><mi>E</mi><mtext>and</mtext><
   msub><mrow><mi>e</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>,</mo><ms
   ub><mrow><mi>c</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>∈</mo><msub
   ><mrow><mi>C</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>⊂</mo><mi>E</
   mi><mo>)</mo></mtd></mtr></mtable><mspace
   width="2em"></mspace></mtd><mtd><mspace
   width="2em"></mspace></mtd></mtr></mtable> :MATH]

   where C[i]={c[1],c[2],…,c[k],…,c[l−1],c[l]} is the context set of an
   entity e[i], context size l is |C[i]|, and E is a set of entities to be
   embedded. When embedding words in text, context size l is fixed.
   However, in our case, it is difficult to fix the context size because
   the number of mutations in each sample varies. Some samples have less
   than 10 gene mutations, while others have more than 1000 gene
   mutations. In addition, since mutations included in a single patient
   sample are not sorted according to a certain biological order, drawing
   a mutation vector by shifting the context window is illogical, unlike
   the case of NLP.

   To assign various co-occurring contexts to a mutation, we performed
   random sampling without replacement on each patient sample 10 times.
   The size of the random samples was 10. We assumed that patient samples
   with an excessive number of mutations tend to be highly noisy. Also, we
   found the information extracted from patient samples with small
   quantities of mutations was critical for embedding each mutation
   vector. As we conducted the same random sampling procedure regardless
   of the mutation quantity of each patient sample, noisy samples with an
   excessive number of mutations were used less in vector embedding
   processes. On the other hand, patient samples with small mutation
   quantities were used frequently.

   The conditional probability mentioned above can be expressed with
   latent parameters of a neural network and a softmax function as below,
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><mi>p</mi><mo>(</mo><msub><mrow><mi>e</mi></mrow><mrow>
   <mi>j</mi></mrow></msub><mo>|</mo><msub><mrow><mi>e</mi></mrow><mrow><m
   i>i</mi></mrow></msub><mo>)</mo></mtd><mtd
   class="align-2"><mo>=</mo><mfrac><mrow><mtext
   mathvariant="italic">exp</mtext><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>u</mi></mrow><mrow><mi>i</mi></m
   row><mrow><mi>T</mi></mrow></msubsup><msub><mrow><mi>v</mi></mrow><mrow
   ><mi>j</mi></mrow></msub></mrow></mfenced></mrow><mrow><munderover
   accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mo>|</mo><mi>E</mi><mo>|</mo></mrow></munderover><mtext
   mathvariant="italic">exp</mtext><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>u</mi></mrow><mrow><mi>i</mi></m
   row><mrow><mi>T</mi></mrow></msubsup><msub><mrow><mi>v</mi></mrow><mrow
   ><mi>k</mi></mrow></msub></mrow></mfenced></mrow></mfrac><mspace
   width="2em"></mspace></mtd><mtd><mspace
   width="2em"></mspace></mtd></mtr><mtr><mtd
   class="align-1"><mi>J</mi><mo>(</mo><mi>U</mi><mo>,</mo><mi>V</mi><mo>)
   </mo></mtd><mtd
   class="align-2"><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>N</mi
   ></mrow></mfrac><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi></mrow><mrow><mi>N</mi></mr
   ow></munderover><munder><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><msub><mrow><mi>e</mi></mrow><mrow><m
   i>j</mi></mrow></msub><mo>∈</mo><msub><mrow><mi>C</mi></mrow><mrow><mi>
   i</mi></mrow></msub></mrow></munder><mtext
   mathvariant="italic">log</mtext><mo>(</mo><mi>p</mi><mo>(</mo><msub><mr
   ow><mi>e</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>|</mo><msub><mrow
   ><mi>e</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>)</mo><ms
   pace width="2em"></mspace></mtd><mtd><mspace
   width="2em"></mspace></mtd></mtr></mtable> :MATH]

   where U is a weight matrix for an embedding lookup layer,
   [MATH:
   <msubsup><mrow><mi>u</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>T</mi>
   </mrow></msubsup> :MATH]
   is a distributed representation of i-th entity, N is the number of all
   training entities which can be defined with contexts, (e[i],C[i]). V is
   an output weight matrix, and v[j] is j-th row of the matrix V. Our goal
   is to maximize the objective function J(U,V) above.

   However, the basic Skip-Gram model described above suffers from high
   computational cost. Due to the summation calculation in the denominator
   of p(e[j]|e[i]), the computational cost for calculating J(U,V) is often
   high especially for large vocabularies (entities). To address this
   issue, Mikolov et al. [[54]18] proposed a Skip-Gram model that has an
   additional feature called negative sampling. Instead of using the
   softmax function, negative sampling directly uses the sigmoid function
   σ(x) to represent each entity’s conditional probability.
   [MATH:
   <mtable><mtr><mtd><mi>σ</mi><mo>(</mo><mi>x</mi><mo>)</mo></mtd><mtd><m
   o>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mn>1</mn><mo>+</mo><mtext
   mathvariant="italic">exp</mtext><mo>(</mo><mo>−</mo><mi>x</mi><mo>)</mo
   ></mrow></mfrac></mtd></mtr><mtr><mtd><mi>p</mi><mo>(</mo><msub><mrow><
   mi>e</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>|</mo><msub><mrow><mi
   >e</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo></mtd><mtd><mo>=<
   /mo><mi>σ</mi><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>u</mi></mrow><mrow><mi>i</mi></m
   row><mrow><mi>T</mi></mrow></msubsup><msub><mrow><mi>v</mi></mrow><mrow
   ><mi>j</mi></mrow></msub></mrow></mfenced></mtd></mtr><mtr><mtd><mi>p</
   mi><mfenced close=")" open="(" separators=""><mrow><mover
   accent="true"><mrow><msub><mrow><mi>e</mi></mrow><mrow><mi>j</mi></mrow
   ></msub></mrow><mo>¯</mo></mover><mo>|</mo><msub><mrow><mi>e</mi></mrow
   ><mrow><mi>i</mi></mrow></msub></mrow></mfenced></mtd><mtd><mo>=</mo><m
   n>1</mn><mo>−</mo><mi>σ</mi><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>u</mi></mrow><mrow><mi>i</mi></m
   row><mrow><mi>T</mi></mrow></msubsup><msub><mrow><mi>v</mi></mrow><mrow
   ><mi>j</mi></mrow></msub></mrow></mfenced></mtd></mtr></mtable> :MATH]
   1

   Using the re-defined conditional probability above, negative sampling
   maximizes the objective function J[NEG](U,V) as below
   [MATH:
   <mrow><mtable><mtr><mtd><msub><mrow><mi>J</mi></mrow><mrow><mtext
   mathvariant="italic">NEG</mtext></mrow></msub><msup><mrow><mo>(</mo><mi
   >U</mi><mo>,</mo><mi>V</mi><mo>)</mo></mrow><mrow><mi>i</mi></mrow></ms
   up></mtd><mtd><mo>=</mo><munder><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>∈</mo><msub><mrow><mi>C
   </mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></munder><mtext
   mathvariant="italic">log</mtext><mo>(</mo><mi>p</mi><mo>(</mo><msub><mr
   ow><mi>e</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>|</mo><msub><mrow
   ><mi>e</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>)</mo><mo
   >+</mo><munder><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>l</mi><mo>∈</mo><msub><mrow><mi>D
   </mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></munder><mtext
   mathvariant="italic">log</mtext><mo>(</mo><mi>p</mi><mo>(</mo><mover
   accent="true"><mrow><msub><mrow><mi>e</mi></mrow><mrow><mi>l</mi></mrow
   ></msub></mrow><mo>¯</mo></mover><mo>|</mo><msub><mrow><mi>e</mi></mrow
   ><mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>)</mo></mtd></mtr><mtr><mt
   d><msub><mrow><mi>J</mi></mrow><mrow><mtext
   mathvariant="italic">NEG</mtext></mrow></msub><mo>(</mo><mi>U</mi><mo>,
   </mo><mi>V</mi><mo>)</mo></mtd><mtd><mo>=</mo><mfrac><mrow><mn>1</mn></
   mrow><mrow><mi>N</mi></mrow></mfrac><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>N</mi></mrow></munderover><msub><mrow><mi>J</mi></mrow><mrow>
   <mtext
   mathvariant="italic">NEG</mtext></mrow></msub><msup><mrow><mo>(</mo><mi
   >U</mi><mo>,</mo><mi>V</mi><mo>)</mo></mrow><mrow><mi>i</mi></mrow></ms
   up></mtd></mtr><mtr><mtd><mfenced close="" open="("
   separators=""><mrow><msub><mrow><mi>D</mi></mrow><mrow><mi>i</mi></mrow
   ></msub><mo>⊂</mo><msubsup><mrow><mi>C</mi></mrow><mrow><mi>i</mi></mro
   w><mrow><mi>C</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi>C</mi><
   /mrow><mrow><mi>i</mi></mrow><mrow><mi>C</mi></mrow></msubsup></mrow></
   mfenced></mtd><mtd><mfenced close=")" open=""
   separators=""><mrow><mo>=</mo><mi>E</mi><mo>−</mo><msub><mrow><mi>C</mi
   ></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfenced></mtd></mtr></mt
   able></mrow> :MATH]

   where D[i] is a sampled subset of
   [MATH:
   <msubsup><mrow><mi>C</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>C</mi>
   </mrow></msubsup> :MATH]
   , which is a complement of C[i]. Also, |D[i]| is fixed. The sampling
   process is done using a distribution of entities raised to the 3/4rd
   power. In conclusion, the Skip-Gram model equipped with negative
   sampling maximizes the occurrence probability of contextual entities
   and minimizes the occurrence probability of non-contextual entities,
   conditioned on the occurrence of the entity. We used Gensim [[55]19],
   which is a Python library, for the implementation.

Retrofitting

   As words and phrases have synonyms and paraphrases respectively,
   Faruqui et al. [[56]14] utilized structured lexical meaning networks,
   WordNet [[57]20], FrameNet [[58]21] and the Paraphrase database (PPDB)
   [[59]22], to post-process vectors of entities. The purpose of this
   post-processing was to ensure the vectors have similar representations
   if they are synonyms or paraphrases. The processing is done by
   minimizing the function
   [MATH: <mi>J</mi><mo>(</mo><mi>Q</mi><mo>,</mo><mover
   accent="false"><mrow><mi>Q</mi></mrow><mo>~</mo></mover><mo>)</mo>
   :MATH]
   defined by
   [MATH:
   <msub><mrow><mi>J</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>(</mo><m
   i>Q</mi><mo>,</mo><mover
   accent="false"><mrow><mi>Q</mi></mrow><mo>~</mo></mover><mo>)</mo>
   :MATH]
   and J[(i,j)](Q) as follows,
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><msub><mrow><mi>J</mi></mrow><mrow><mi>i</mi></mrow></m
   sub><mo>(</mo><mi>Q</mi><mo>,</mo><mover
   accent="false"><mrow><mi>Q</mi></mrow><mo>~</mo></mover><mo>)</mo><mo>=
   </mo><msub><mrow><mi>α</mi></mrow><mrow><mi>i</mi></mrow></msub><msup><
   mrow><mfenced close="∥" open="∥"
   separators=""><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>i</mi></mrow
   ></msub><mo>−</mo><msub><mrow><mover
   accent="false"><mrow><mi>q</mi></mrow><mo>~</mo></mover></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></mfenced></mrow><mrow><mn>2</mn></mrow></
   msup></mtd></mtr><mtr><mtd
   class="align-1"><msub><mrow><mi>J</mi></mrow><mrow><mo>(</mo><mi>i</mi>
   <mo>,</mo><mi>j</mi><mo>)</mo></mrow></msub><mo>(</mo><mi>Q</mi><mo>)</
   mo><mo>=</mo><msub><mrow><mi>β</mi></mrow><mrow><mtext
   mathvariant="italic">ij</mtext></mrow></msub><msup><mrow><mfenced
   close="∥" open="∥"
   separators=""><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>i</mi></mrow
   ></msub><mo>−</mo><msub><mrow><mi>q</mi></mrow><mrow><mi>j</mi></mrow><
   /msub></mrow></mfenced></mrow><mrow><mn>2</mn></mrow></msup></mtd></mtr
   ><mtr><mtd class="align-1"><mo>(</mo><mspace
   width="1em"></mspace><mi>j</mi><mo>∈</mo><msub><mrow><mi>S</mi></mrow><
   mrow><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>q</mi></mrow><mr
   ow><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>q</mi></mrow><mrow
   ><mi>j</mi></mrow></msub><mo>∈</mo><mi>Q</mi><mspace
   width="1em"></mspace><mtext>and</mtext><mspace
   width="1em"></mspace><mover
   accent="false"><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>i</mi></mro
   w></msub></mrow><mo>~</mo></mover><mo>∈</mo><mover
   accent="false"><mrow><mi>Q</mi></mrow><mo>~</mo></mover><mspace
   width="1em"></mspace><mo>)</mo></mtd></mtr></mtable> :MATH]

   where
   [MATH: <mover
   accent="false"><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>i</mi></mro
   w></msub></mrow><mo>~</mo></mover> :MATH]
   is a trained vector, q[i] is a post-processed vector, S[i] is a set of
   entities similar to an entity i, and q[j] is a vector of which its
   entity is similar to an entity i and is included in S[i]. α[i] and
   β[ij]are hyperparameters for each entity and a pair of entities, where
   α[i]=1, and β[ij]=|S[i]|^−1. The hyperparameter values were used as the
   default values for the model. Finally, the objective function can be
   obtained by the formula below.
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><mi>J</mi><mo>(</mo><mi>Q</mi><mo>,</mo><mover
   accent="false"><mrow><mi>Q</mi></mrow><mo>~</mo></mover><mo>)</mo><mo>=
   </mo><munderover accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi></mrow><mrow><mi>n</mi></mr
   ow></munderover><mfenced close=")" open="("
   separators=""><mrow><msub><mrow><mi>J</mi></mrow><mrow><mi>i</mi></mrow
   ></msub><mo>(</mo><mi>Q</mi><mo>,</mo><mover
   accent="false"><mrow><mi>Q</mi></mrow><mo>~</mo></mover><mo>)</mo><mo>+
   </mo><munder><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>∈</mo><msub><mrow><mi>S
   </mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></munder><msub><mrow><
   mi>J</mi></mrow><mrow><mo>(</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>)</mo
   ></mrow></msub><mo>(</mo><mi>Q</mi><mo>)</mo></mrow></mfenced></mtd></m
   tr></mtable> :MATH]

   Likewise, if two gene mutations were involved in the same cellular
   process, we assumed that they have similar effects, such as
   malfunctions or abnormal activations, on biological processes. From the
   BioGRID network, we selected genes one hop apart from a certain gene as
   similarly functioning genes, and made them similar each other using the
   retrofitting process described above. Retrofitting codes are available
   at [60]https://github.com/mfaruqui/retrofitting.

Results

Driver/Passenger mutation visualization

   Many mutations in a single cancer sample are not entirely related to
   cancer. The driver mutation directly affects the progression of the
   cancer, while the passenger mutation does not play any particular role.
   In fact, driver mutations are common in many cancer cells of patients,
   while passenger mutations are not [[61]23].

   We performed data visualization to see if our mutation vectors reflect
   the mutual distinction between driver and passenger mutations in the
   vector space. The driver/passenger mutation information was obtained
   from the driver mutation database IntOGen [[62]24]. We also conducted
   k-means clustering on the driver and passenger mutation vectors before
   reducing dimensions by Principal Component Analysis, and calculated the
   Normalized Mutual Information(NMI) to assess the clustering result. The
   NMI is defined as
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><mtext
   mathvariant="italic">NMI</mtext><mo>(</mo><mi>Ω</mi><mo>,</mo><mi>C</mi
   ><mo>)</mo><mo>=</mo><mfrac><mrow><mtext
   mathvariant="italic">MI</mtext><mo>(</mo><mi>Ω</mi><mo>,</mo><mi>C</mi>
   <mo>)</mo></mrow><mrow><mo>[</mo><mi>H</mi><mo>(</mo><mi>Ω</mi><mo>)</m
   o><mo>+</mo><mi>H</mi><mo>(</mo><mi>C</mi><mo>)</mo><mo>]</mo><mo>/</mo
   ><mn>2</mn></mrow></mfrac></mtd></mtr></mtable> :MATH]

   where Ω={ω[1],ω[2],...ω[I]} is the set of cluster labels and
   C={c[1],c[2],…c[J]} is the set of class labels. In our case,
   C={driver,passenger}. MI is mutual information defined as
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><mtext
   mathvariant="italic">MI</mtext><mo>(</mo><mi>Ω</mi><mo>,</mo><mi>C</mi>
   <mo>)</mo><mo>=</mo><munder><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi></mrow></munder><munder><mr
   ow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi></mrow></munder><mi>p</mi><
   mo>(</mo><msub><mrow><mi>ω</mi></mrow><mrow><mi>i</mi></mrow></msub><mo
   >,</mo><msub><mrow><mi>c</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>)
   </mo><mtext
   mathvariant="italic">log</mtext><mfrac><mrow><mi>p</mi><mo>(</mo><msub>
   <mrow><mi>ω</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><msub><m
   row><mi>c</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>)</mo></mrow><mr
   ow><mi>p</mi><mo>(</mo><msub><mrow><mi>ω</mi></mrow><mrow><mi>i</mi></m
   row></msub><mo>)</mo><mi>p</mi><mo>(</mo><msub><mrow><mi>c</mi></mrow><
   mrow><mi>j</mi></mrow></msub><mo>)</mo></mrow></mfrac></mtd></mtr></mta
   ble> :MATH]

   where p(ω[i]), p(c[j]), and p(ω[i],c[j]) are the probabilities of a
   mutation occurring in cluster ω[i], class c[j], and the intersection of
   ω[i] and c[j], respectively.

   H is entropy defined as
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><mi>H</mi><mo>(</mo><mi>Ω</mi><mo>)</mo><mo>=</mo><mo>−
   </mo><munder><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi></mrow></munder><mi>p</mi><
   mo>(</mo><msub><mrow><mi>ω</mi></mrow><mrow><mi>i</mi></mrow></msub><mo
   >)</mo><mtext
   mathvariant="italic">log</mtext><mo>(</mo><mi>p</mi><mo>(</mo><msub><mr
   ow><mi>ω</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>)</mo><
   /mtd></mtr></mtable> :MATH]

   Experiments were performed on three cancer types (CM, BRCA, LUAD) with
   the highest number of “known” driver mutations among 29 cancer types.
   According to Table [63]1, driver mutation data contains far more
   predicted mutations than known mutations. Passenger mutations are all
   predicted mutations. Since known driver mutations are more reliable
   than predicted driver mutations, we used only “known" driver mutations
   for a more accurate comparison. Also, there are only “predicted” for
   passenger mutations in the database. Since the number of passenger
   mutations is much larger than the number of driver mutations, randomly
   sampled passenger mutations were selected for the visualization
   process.

Table 1.

   IntOGen data description for three cancer types
        Drivers         Passengers
   Type Known Predicted Predicted
   BRCA 22    473       13702
   CM   29    607       16863
   LUAD 23    505       13929
   [64]Open in a new tab

   We obtained interesting results using our mutation vectors. As shown in
   Fig. [65]3, vectors using ICGC dataset perform slightly better than
   randomly generated vectors, and the mutual distinction between driver
   mutations and passenger mutations in the vector space became clearer
   when PPI network knowledge was added. After adding PubMed information,
   we could confirm that both driver and passenger mutations were properly
   classified. Furthermore, the improvements measured by NMI support our
   visualization results. Compared with randomly generated vectors, our
   mutation vectors improved the NMI scores of all three cancer types. We
   could observe dramatic performance improvements when both literature
   information and PPI network information were applied together.

Fig. 3.

   Fig. 3
   [66]Open in a new tab

   Driver/passenger mutations visualization. Visualization with Principal
   Component Analysis, shows the clear difference between driver and
   passenger mutation classes when PubMed information is applied. Red dots
   represent known driver mutations and blue dots represent sampled
   predicted passenger mutations. Normalized Mutual Information (NMI) is
   also calculated based on the results of k-means clustering

   We also conducted this visualization experiment with binary mutation
   representation. In the ICGC patient-mutation profile, we defined a
   binary mutation vector of dimension equal to the number of samples in
   the ICGC dataset. Each dimension of the vector encodes the existence of
   a mutation in a corresponding sample. In other words, the binary vector
   of a mutation has a value of 1 only for the dimension corresponding to
   the sample that has the mutation; however, the binary vector has a
   value of 0 for all other dimensions. Figure [67]4 shows the
   visualization of binary mutation vectors. The distinction between
   passenger mutations (blue dots) and driver mutations (red dots)
   misleadingly seems to be accurate since the blue dots are notably more
   clustered than the red dots. In fact, driver mutations are more
   frequently observed in patients than passenger mutations. Therefore,
   some of the binary vectors of driver mutations are larger and are
   positioned far from the area where most mutations are clustered.
   However, when we expanded the area where most passenger mutations
   existed, we could observe that the drivers and passengers were actually
   scrambled. The NMI scores of binary representation were also lower than
   the Mut2Vec+PI+R scores. Binary representation obtained scores of
   0.031, 0.039, 0.071 for CM, BRCA, and LUAD, respectively, using NMI,
   whereas Mut2Vec+PI+R obtained scores of 0.650, 0.866, 0.713 for CM,
   BRCA, and LUAD, respectively, using NMI. The mutation vectors from
   Mut2Vec+PI+R model can better represent the information on driver
   mutations than binary vector representation.

Fig. 4.

   Fig. 4
   [68]Open in a new tab

   Driver/passenger mutations visualization of binary vectors.
   Visualization with Principal Component Analysis of binary mutation
   vectors. The binary vectors are made by selecting column vectors of
   patient-mutation profiles. Because driver mutations tend to frequently
   appear in cancer profiles of patients, there are many 1s in the driver
   mutation vectors and the size of the driver mutation vectors are large.
   Therefore, some of driver mutation vectors are far from most mutation
   vectors, The “Boxed Area” is the expanded visualization of the boxed
   area of “Binary” visualization where most of passenger mutation vectors
   exist. We found that the drivers and passengers were actually scrambled
   while they seemed to be well-separated in a broader scope

   For the comparison, we trained 300-dimensional mutation vectors using
   an autoencoder [[69]25] and a denoising autoencoder [[70]26], and
   conducted a visualization experiment. The autoencoder obtained scores
   of 0.007, 0.074, 0.071 for CM, BRCA, and LUAD, respectively, using NMI.
   The denoising autoencoder obtained scores of 0.031, 0.040, 0.038 for
   CM, BRCA, and LUAD, respectively, using NMI. Also, we found that the
   vectors trained on the autoencoders could not effectively distinguish
   driver mutations from passenger mutations. The visualization results
   are listed in Additional file [71]1.

Driver mutation identification

   Based on our previous visualization, we can infer that driver mutation
   vectors have their own properties which help distinguish driver
   mutations from other mutations. From this inference, we clustered the
   entire set of mutations in vectors from Mut2Vec+PI+R and examined
   whether the cluster contained many driver mutations.

   The k-means algorithm with the option of 200 clusters was applied to
   the clustering process. Next, we selected the most enriched cluster
   that contains the most driver mutations, and built a contingency table,
   which is shown in Table [72]2. We estimated the statistical
   significance based on the hypergeometric test [[73]27] on entire driver
   gene mutations in the IntOGen database. In the database, there were 67
   unique “known" drivers and 594 “predicted" drivers, all of which
   intersect with embedded mutations. In our most enriched cluster, we
   could find 21 known drivers with a p-value of 3.74e-37. Considering
   both known and predicted driver mutations, we found 45 drivers with a
   p-value of 2.04e-51. The remaining 18 mutations were not referred to as
   drivers in the database. As the cluster had high statistical
   significance, we concluded that the mutation vectors from our
   Mut2Vec+PI+R model captured the characteristics of driver mutations.
   Other contingency tables of different clustering methods(Agglomerative
   hierarchical clustering, BIRCH, Spectral clustering, Affinity
   propagation, and Gaussian mixture) that performed based on different
   numbers of clusters(50, 100, 300, and 500) are listed in Additional
   file [74]2.

Table 2.

   The most enriched cluster characterized using mutation vectors from
   Mut2Vec+PI+R
   Labels          In cluster In population p-value
   Known           21         67            3.74e-37
   Known+predicted 45         661           2.04e-51
   Candidates      18         17923
   All             63         18584
   [75]Open in a new tab

   We also performed an enrichment analysis on the KEGG PATHWAY database
   [[76]28] using Fisher’s exact test. We used a publicly available
   enrichment analysis platform, which was given by Enrichr [[77]29]. The
   p-value was adjusted using the Benjamini-Hochberg method for correcting
   multiple hypotheses testing. As Table [78]3 shows, “Pathways in cancer"
   was the most enriched pathway with an adjusted p-value of 9.98e-24 and
   25 overlapped genes. All the five pathways were related to cancer or
   the general characteristics of cancer such as metabolism and
   misregulation, since various types of driver mutations were grouped in
   the most enriched cluster.

Table 3.

   Pathway enrichment analysis of the most enriched cluster characterized
   using mutation vectors from Mut2Vec+PI+R
   KEGG PATHWAY                            Adjusted p-value Overlap
   Pathways in cancer                      9.98e-24         25/397
   Central carbon metabolism in cancer     2.51e-22         15/67
   Transcriptional misregulation in cancer 6.76e-19         17/180
   MicroRNAs in cancer                     6.27e-14         16/297
   Prostate cancer                         1.93e-13         11/89
   [79]Open in a new tab

   Based on the above observation, we carried out a further experiment. We
   hypothesized that if most of the mutations gathered in a cluster are
   drivers, the unidentified mutations in the cluster are most likely to
   be driver mutations. To test our assumption, we investigated mutations
   in the most enriched cluster, as shown in Table [80]4. We focused
   mainly on retrieving information on the 18 candidate mutations which
   were not identified as drivers in the IntOGen database.

Table 4.

   Genes in the most enriched cluster characterized using mutation vectors
   from Mut2Vec+PI+R
   Known  Predicted Candidate
   ABL1   ASXL1     BCL2
   ALK    BAP1      CISH
   DNMT3A BCL6      CRLF2
   EGFR   CALR      DUSP22
   ERBB2  CCND1     ERG
   FGFR2  CEBPA     EWSR1
   FGFR3  ETV6      FHIT
   FLT3   FGFR1     MAML2
   GNAQ   H3F3A     MYBL1
   HRAS   IKZF1     MYCL
   IDH2   MET       PDGFRB
   JAK2   MYCN      PLAG1
   KIT    NF2       PRKACA
   MYC    NOTCH1    SPINK1
   MYD88  NTRK1     SS18
   NPM1   PAX5      TERT
   NRAS   PDGFRA    TFE3
   RB1    PPM1D     TP63
   RUNX1  RET
   SDHB   RHOA
   SMO    ROS1
          SMARCB1
          TET2
          WT1
   [81]Open in a new tab

   Through searching entire recently published biomedical papers since
   2015, we found out that 11 of the candidate mutations were reported as
   driver mutations. Table [82]5 shows the literature search results. BCL2
   is an important driver of leukemia and referred to as a driver mutation
   [[83]30]. ERG is reported to be a driver of carcinogenesis in prostate
   cancer [[84]31, [85]32]. The loss of fragile histidine triad protein
   (FHIT) is strongly related to pancreatic ductal adenocarcinomas
   [[86]33]. MAML2-MECT1 fusion is a driver in salivary gland and
   bronchial gland mucoepidermoid carcinoma [[87]34, [88]35]. MYBL1 is a
   driver of adenoid cystic carcinoma when related to MYB [[89]36,
   [90]37]. As a member of the myelocytomatosis oncogene family, MYCL is a
   driver oncogene of lung carcinoma [[91]38, [92]39]. PDGFRB was recently
   reported as a driver of the majority of sporadic infantile and adult
   solitary myofibromas [[93]40]. PRKACA was identified by recent
   sequencing as a driver of cortisol-producing adenomas and perihilar
   cholangiocarcinoma [[94]41, [95]42]. SS18 with SSX fusion has been
   reported as driver of synovial sarcoma in many research studies
   [[96]43–[97]47]. TERT has also been reported as a cancer driver of
   various tissues including thyroid and liver [[98]48–[99]57]. Variation
   in TP63 is associated with drivers of squamous cell lung cancer
   [[100]58, [101]59].

Table 5.

   Literature search results on driver candidates
   Gene   Tissue           References