Abstract

   In recent years it has been shown that silent mutations, in and out of
   the coding region, can affect gene expression and may be related to
   tumorigenesis and cancer cell fitness. However, the predictive ability
   of these mutations for cancer type diagnosis and prognosis has not been
   evaluated yet. In the current study, based on the analysis of 9,915
   cancer genomes and approximately three million mutations, we provide a
   comprehensive quantitative evaluation of the predictive power of
   various types of silent and non-silent mutations over cancer
   classification and prognosis. The results indicate that silent-mutation
   models outperform the equivalent null models in classifying all
   examined cancer types and in estimating the probability of survival 10
   years after the initial diagnosis. Additionally, combining both
   non-silent and silent mutations achieved the best classification
   results for 68% of the cancer types and the best survival estimation
   results for up to nine years after the diagnosis. Thus, silent
   mutations hold considerable predictive power over both cancer
   classification and prognosis, most likely due to their effect on gene
   expression. It is highly advised that silent mutations are integrated
   in cancer research in order to unravel the full genomic landscape of
   cancer and its ramifications on cancer fitness.

   Subject terms: Genome informatics, Predictive markers, Systems biology,
   Gene expression

Introduction

   The rapid developments of New Generation Sequencing (NGS) technologies
   and acceleration of computational abilities over the past few years
   have led to the availability of extensive genomic
   information^[30]1–[31]5. Multiple research utilizing these
   high-dimensional data establish cancer as a group of highly
   heterogeneous genomic diseases, characterized by large inter-tumor and
   intra-tumor diversities^[32]6–[33]8. Moreover, common genetic features
   were repeatedly identified among patients of different cancer types and
   significant diversities were found among patients diagnosed with the
   same cancer type^[34]9,[35]10. These findings highlight the need for
   personalized, gene-targeted cancer treatments.

   By now, hundreds of genes had been recognized as cancer drivers^[36]11
   and many more are currently researched. Some, like TP53^[37]12,
   BRAF^[38]13, EGFR^[39]14, or IDH1^[40]15 have already been targeted for
   gene therapy. Nonetheless, there are still numerous obstacles to
   overcome in order to fully unravel the cancer genomic landscape.
   Currently, most contemporary research is based on data derived by Whole
   Exome Sequencing (WES)^[41]2. In addition, most studies focus
   exclusively or predominantly on non-silent mutations; alterations in
   the coding regions that cause a change in the amino-acid sequence of
   the produced protein. Silent mutations, such as modifications in the
   introns, the untranslated regions (UTR’5 and UTR’3), or even synonymous
   mutations in the coding region itself are by and large excluded from
   the analyses^[42]16.

   Yet, cancerous silent mutations could have detrimental effects on gene
   expression^[43]16–[44]19, which in some cases could even lead to
   consequences more significant than non-silent mutations. Mutations in
   regulatory regions, such as promoters or enhancers, can destruct or
   form new transcription-factor binding sites and cause changes in
   transcription regulation^[45]20–[46]23. Mutations in the untranslated
   regions can affect translation regulation or modify microRNA binding
   sites and thus impact mRNA stability^[47]24. Synonymous mutations can
   alter all aspects of gene expression^[48]25, impacting translation
   rates^[49]26,[50]27, protein-folding^[51]28,
   transcription^[52]29–[53]31, mRNA stability^[54]32, and
   splicing^[55]33,[56]34. Overall, silent mutations could modify all
   phases of the gene expression process, causing amplification or
   reduction in protein quantities. Hence, even though most silent
   mutations do not cause a change in protein functionality, they could
   dramatically change protein abundance and could therefore influence
   cancer fitness.

   We believe that including these mutations in cancer research is
   imperative for acquiring a broader understanding of the genomic
   landscape profoundly linked with cancer development and progression.
   Specifically, we believe that silent mutations should be incorporated
   when building predictive models.

   The incredible heterogeneity of cancerous genomes, even for patients
   who presumably possess the same cancer type, highly complicates
   predictive tasks. When examining only non-silent mutations we miss a
   large part of the complex mutational patterns of these cancerous
   genomes; considering the full patterns could improve predictions.
   Additionally, silent driver mutations, even though considered today as
   infrequent compared to non-silent drivers, could be highly
   influential^[57]35 and thus also beneficial for predictive models.
   Indeed, there are previous studies that have demonstrated that silent
   mutations or non-silent mutations that modulate gene expression can
   significantly affect the phenotype of the cancer cell and its
   survival^[58]33,[59]36–[60]42. Additionally, some contemporary studies
   identified silent mutations that are recurrent for specific cancer
   types and are possible drivers of cancer^[61]20,[62]23,[63]35. However,
   to the best of our knowledge, no previous study has performed a broad,
   quantitative comparison between the predictive abilities of various
   mutation types on cancer classification and progression. In this study,
   we explore silent and non-silent mutations, aiming to quantify the
   predictive ability of various types of silent mutations to perform
   cancer diagnosis and to estimate patients’ survival probabilities over
   time, while comparing it to the performance of non-silent mutations.

Results

Data processing and feature engineering

   Genomic and clinical data of 9915 patients across 33 cancer types were
   obtained from The Cancer Genome Atlas (TCGA)^[64]43. Data
   characteristics are described in Fig. [65]1. The genomic data consisted
   of detailed information about the patients’ DNA mutations while the
   clinical data held personal information such as patients’ vital status.
   These data were used to perform two tasks- patients’ cancer type
   classification and survival estimation. The full flow chart of the
   study is depicted in Fig. [66]2.

Fig. 1. TCGA data characteristics.

   [67]Fig. 1
   [68]Open in a new tab

   Description of the data retrieved from TCGA after initial preprocessing
   (discarding patients with missing genomic or clinical data and patients
   with multiple genomic samples). Overall, 9915 patients across 33 cancer
   types are included in the study. a Patient distribution across cancer
   types. b Sorting TCGA mutations to five categories for the study. The
   x-axis depicts the mutation classification according to TCGA*. The
   y-axis depicts the number of mutations in the TCGA mutation categories.
   The legend depicts the five categories to which the mutations are
   sorted for this study. *Note: In TCGA, Synonymous mutations are
   referred to as “Silent”. As the terms are in fact not interchangeable
   (synonymous mutations are a subcategory of silent mutations) we replace
   the term “Silent” with “Synonymous” where needed. c Mutation type
   distribution. The distribution includes all mutations of the 9915
   patients. d Polymorphism type distribution. Mutations could be either
   Single Nucleotide Polymorphisms (SNP), Deletions (DEL) or Insertions
   (INS). The distribution includes all mutations of the 9915 patients, e
   Names and abbreviations of the 33 cancer types.

Fig. 2. The flow chart of the study.

   [69]Fig. 2
   [70]Open in a new tab

   Yellow boxes denote preprocessing steps performed for both tasks. Blue
   boxes denote steps performed for the cancer type classification task
   and green boxes denote steps performed for the survival probability
   estimation task.

   As Fig. [71]2 indicates, the genomic data was split into five
   categories. One category holds all non-silent mutations
   (amino-acid-altering exonic mutations). The other four categories
   consist of silent mutations from different regions within and adjacent
   to the genes; synonymous mutations (exonic mutations that do not
   directly affect the amino acids), mutations in introns, UTRs or
   flanking regions. It is important to note that a genomic position is
   considered mutated for a patient only if its nucleic acid content
   differs between the patient’s cancerous and healthy tissue samples.

   In the next preprocessing step, for each category, the initial data
   were used to create three kinds of features (Fig. [72]3), representing
   different resolutions:
    1. Low-resolution features—indicating the number of mutations each
       patient had in an entire gene.
    2. Medium-resolution features—indicating the number of mutations each
       patient had in a 50-nucleotide-long gene segment.
    3. High-resolution features—binary features indicating whether a
       specific mutation occurred or not, for each patient.

Fig. 3. A simplified illustration of the feature extraction process.

   [73]Fig. 3
   [74]Open in a new tab

   a A representation of the initial genomic information. The X’s denote
   mutations that two patients have in the same gene. The red rectangular
   frames represent the 50-nucleotide-long segments used for the
   medium-resolution features. b An example of the features that would
   have been extracted for the intron dataset and the UTR dataset
   according to the initial information shown in a.

   Analyzing features from multiple resolution levels improves the models’
   results (Fig. [75]4a, Supplementary Table [76]1) and could also
   identify specific mutations, regulatory regions, and entire genes that
   are related to cancer fitness.

Fig. 4. Classification task results.

   [77]Fig. 4
   [78]Open in a new tab

   a The F1 scores achieved in the cancer type classification task when
   using only high-resolution features, high and medium-resolution
   features and all resolutions combined. The scores shown are the average
   F1 scores achieved by all features models across all cancer types. b
   The F1 scores achieved by the OVA models per cancer type, using
   features from all levels of resolution. The x-axis depicts the cancer
   types, the y-axis depicts the F1 scores achieved by the models. Each
   bar color denotes a different dataset. Cancer types for which the all
   features model outperformed the non-silent model are denoted in red.
   See Fig. [79]1e for the unabbreviated names of the cancer types. c
   Feature-type distribution of the all features dataset and of the top
   ranked features chosen in the classification task. Feature-type
   distribution of the all features dataset* (top row), top ranked 100
   features (middle row) and top ranked 10 features (bottom row). The
   feature rankings were obtained from all features models classifying the
   19 cancer types and were averaged across them. The legend (below the
   image) indicates the enrichment in the amount of each feature-type in
   the top 10 features compared to its original amount in the all features
   dataset (ratio between bottom and top row). *Note: The distribution
   depicted in the top row is the distribution of the all features dataset
   after it underwent preprocessing relevant for the classification task.

   The features created for each of the five categories were used as five
   separate datasets (referred to as single-mutation-type datasets). A
   sixth dataset that combines features of all mutation types (referred to
   as all features dataset) was also created. The six datasets were used
   to perform cancer type diagnosis and patient survival estimation.
   Evaluating the performance of models trained on the six datasets
   enables us to compare the predictive ability of features derived from
   silent and non-silent mutations (referred to as silent features and
   non-silent features).

For all cancer types, the silent features improved cancer classification in
comparison to the null model

   In the cancer type classification task, only cancer types with more
   than 200 patients were included (a total of 19 types). A one-vs-all
   (OVA), supervised learning model was created for every pair of cancer
   type and dataset (see Methods). Specifically, each model deployed the
   features in the dataset in order to predict whether patients suffered
   from the specific cancer type (classified as “Positive”) or suffered
   from any of the other types (classified as “Negative”, since the model
   predicts only the existence of the specific cancer). This section
   presents the results of this analysis.

   As mentioned above, combining features from three levels of resolutions
   led to the best performance of cancer type classification. Figure
   [80]4b depicts the F1 scores (see Eq. ([81]1) for the definition of the
   F1 score) obtained by the OVA models by using features from all levels
   of resolutions. The worst performing model, which used flanking-region
   features in order to diagnose Glioblastoma (GBM), was 1.9 folds better
   than the comparable null model (see Methods for details about the null
   models). The best performing model that used silent features was the
   intron model for diagnosing Ovarian Serous Cystadenocarcinoma (OV), and
   its F1 score was 20 folds higher than the comparable null model. Even
   though the non-silent models generally achieved better results than
   silent models, for several cancer types the performances were
   substantially similar. For example, for detection of Breast Invasive
   Carcinoma (BRCA), Liver Hepatocellular Carcinoma (LIHC) and OV the
   performance difference between the non-silent model and the intron
   model was less than 10%. For Sarcoma (SARC) diagnosis, the non-silent
   model outperformed the UTR model by a mere 2%, and the flank model was
   exceeded by only 12%. In addition, the all features models, which used
   both silent and non-silent features, obtained higher F1 scores than the
   non-silent models for 13 out of the 19 cancer types (denoted in red in
   Fig. [82]4b) and for the other cancer types, the performances were very
   similar.

   To control for the number of features, the same analysis was conducted
   using balanced datasets as well (see Methods) and the results, shown in
   Supplementary Figure [83]1, accentuate the high diagnostic ability of
   silent mutations; In the balanced version, the Intron model
   outperformed the non-silent model for six cancer types and the UTR and
   flank models were superior to the non-silent model for two cancer
   types. Quite similarly to the unbalanced datasets, combining silent and
   non-silent mutations rather than solely using the latter improved
   classification results for 12 out of 19 cancer types (keeping in mind
   that the all features dataset had the same number of features as the
   non-silent dataset in this analysis). All these findings support the
   hypothesis that silent mutations do affect cancer mechanisms and hold
   additional predictive information that could not be obtained from
   non-silent mutations alone. Another confounder that could have
   influenced the classification results is the total mutational burden.
   To ensure that the improvement gained from adding silent features to
   non-silent features is not mainly due to the increase in the total
   mutational burden that occurs because of the addition, we examined how
   the increase in total mutational burden is correlated with the
   improvement in the F1 scores of the different cancer types
   (Supplementary Fig. [84]2). Results demonstrate a Pearson correlation
   of R = 0.38 (p = 0.1), indicating that only 14% of the change in the F1
   score could be explained by the increase in mutational burden. So, even
   though the mutational burden does impact the results of classification,
   it is not the leading factor.

   Another interesting phenomenon demonstrated in Fig. [85]4b is the
   considerable differences in the models’ ability to diagnose different
   cancer types. While the majority of the BRCA, LGG (Lower Grade Glioma)
   or COAD (Colon Adenocarcinoma) patients were correctly diagnosed (by at
   least one model), KIRP (Kidney Renal Papillary Cell Carcinoma) and STAD
   (Stomach Adenocarcinoma) patients were often poorly diagnosed. To
   explore the origin of this difference, we examined the similarity
   between genetic profiles of the different cancer types and assessed
   whether cancers with higher genetic similarity have higher
   misclassification rates: For every pair of cancer types, the
   correlation between their Jaccard similarity score and their
   misclassification rate was inspected (see Methods). The results
   (Supplementary Fig. [86]3) indicate a Spearman correlation coefficient
   of 0.72 (p-value <10^−28), suggesting the similarity between genetic
   profiles of patients of different cancers is indeed a major cause for
   misclassifications. However, this is not the only cause as it only
   explains ~52% of the variance in their misclassification rate. Another
   factor that could lead to misclassifications is high mutation
   heterogeneity among patients of the same cancer type.

Silent features comprise 32% of the 10 most predictive features for cancer
classification, on average across cancer types

   Each OVA model provides an importance ranking for all its features.
   Examining the ranking of silent features among all features is another
   way to evaluate their predictive power. Reviewing the feature
   importance ranking produced by the all features models, silent features
   comprised nearly half of the top ranked 100 features and a third of the
   top ranked 10 features (chosen from hundreds of thousands of features),
   when averaged across cancer types (Fig. [87]4c). However, the ranking
   of silent features varied substantially between cancer types
   (Supplementary Tables [88]2,[89]3); while there were only non-silent
   features in the top 10 features of Lung Adenocarcinoma (LUAD), silent
   features constituted eight out of the top 10 features of Cervical
   Squamous Cell Carcinoma (CESC). Altogether, 18 out of the 19 cancer
   types had at least one silent feature in their top 10 features list,
   demonstrating their high significance. The analysis was repeated with
   balanced datasets and the results were similar (Supplementary Fig.
   [90]4).

   When evaluating the influence of the polymorphism type (whether a
   mutation is an insertion, a deletion, or an SNP) on the importance
   ranking, it was seen that the presence of deletions in the highly
   ranked features was notably higher than their presence in the initial
   datasets (Supplementary Figure [91]5). In fact, their prevalence in the
   top 10 features was 2.9–6.8 folds higher than their prevalence in the
   initial datasets (varying between the different models). The presence
   of SNPs and insertions in the highly ranked features was lower than
   their presence in the initial datasets, with the exception of the UTR
   dataset, for which the insertions were 1.3 folds more common in the top
   10 features lists than in the initial datasets, on average across
   cancer types.

A gene’s predictive power for cancer type classification varies drastically
when mutated by different types of mutations

   Table [92]1 lists the 10 most predictive features of three of the 19
   cancer types, as chosen by the all features models (Supplementary Data
   [93]1 holds the full feature importance rankings for classifying all
   cancer types). As seen in Table [94]1, some genes appeared in the top
   10 ranked genes for multiple cancer types. MUC4 was in the top 10 list
   for 16 out of the 19 cancer types and TP53 was on 11 lists, suggesting
   these genes could play an essential role in cancer mechanisms.
   Interestingly, MUC4 was predictive of many cancer types when it had
   either non-silent mutations or synonymous mutations. This last finding
   raises the following fundamental question: is the mutation type a
   determining factor in a gene’s ability to predict a cancer type? Or
   perhaps different kinds of alterations in various regions of the same
   gene would cause a similar loss or gain of function, leading to the
   same outcomes on cancer development?

Table 1.

   Examples of the top 10 ranked features for classifying various cancer
   types.
   Cancer Type Rank Feature Feature Type Importance Gene
   CESC 0 MUC4 Non_Silent Non-Silent 0.16 MUC4
   1 TP53 Non_Silent Non-Silent 0.06 TP53
   2 PABPC1 UTR UTR 0.04 PABPC1
   3 BCR UTR UTR 0.03 BCR
   4 NF1 5602.0 UTR UTR 0.01 NF1
   5 RGPD3 0.0 Flank Flank 0.01 RGPD3
   6 CSF1 UTR UTR 0.01 CSF1
   7 MUC4 Synonymous Synonymous 0.01 MUC4
   8 SRGAP3 UTR UTR 0.01 SRGAP3
   9 CARD11 793.0 Intron Intron 0.01 CARD11
   LIHC 0 MUC4 Non_Silent Non-Silent 0.25 MUC4
   1 SET 210.0 Intron Intron 0.08 SET
   2 PIK3CA Non_Silent Non-Silent 0.03 PIK3CA
   3 ALB Intron Intron 0.02 ALB
   4 240343-240343-chr5-Intron-DEL-T-T– Intron 0.02 SDHA
   5 APC Non_Silent Non-Silent 0.01 APC
   6 FAM46C UTR UTR 0.01 FAM46C
   7 SRGAP3 UTR UTR 0.01 SRGAP3
   8 MUC4 Synonymous Synonymous 0.01 MUC4
   9 SEPT9 3283.0 Intron Intron 0.01 SEPT1
   THCA 0 140753336-140753336-chr7-Missense_Mutation-SNP-A-A-T Non-Silent
   0.18 BRAF
   1 BRAF 378.0 Non_Silent Non-Silent 0.13 BRAF
   2 TP53 Non_Silent Non-Silent 0.07 TP53
   3 MUC4 Non_Silent Non-Silent 0.06 MUC4
   4 NRAS 189.0 Non_Silent Non-Silent 0.02 NRAS
   5 MUC4 Silent Synonymous 0.02 MUC4
   6 533874-533874-chr11-Missense_Mutation-SNP-T-T-C Non-Silent 0.01 HRAS
   7 BRAF Non_Silent Non-Silent 0.01 BRAF
   8 TP53 26.0 Non_Silent Non-Silent 0.01 TP53
   9 LRP1B Intron Intron 0.01 LRP1B
   [95]Open in a new tab

   The top 10 feature rankings for CESC, LIHC, and THCA are shown. For
   each feature, the table holds its name, mutation type, its importance
   for classifying the specific cancer type and the gene to which it is
   related. The rankings were obtained from all features models.

   To try and answer this question, the top 10 features list from every
   single-mutation-type OVA model was examined (all features models were
   excluded from this analysis). For each cancer, a top 10 genes list was
   derived from the top 10 features list (see Methods). Figure [96]5
   depicts a heatmap, presenting the number of top 10 genes lists a gene
   has appeared in (19 meaning the gene appeared in the top 10 genes lists
   of all cancer types, and zero meaning it had appeared in none). As seen
   in Fig. [97]5, the number of appearances a gene has in the top 10 lists
   changes dramatically when it is mutated by mutations of different
   types. For example, the aforementioned MUC4 gene appears in all 19
   lists when it is mutated by non-silent mutations or synonymous
   mutations, but when it is mutated in the UTR, introns or flanks it
   loses its predictive significance and does not appear in any of the
   lists. In fact, it is evident that most genes are highly predictive of
   multiple cancer types only when mutated by a specific mutation type.
   For example, MUC16 is highly predictive of 15 cancer types, but only if
   its mutations are synonymous. Altogether, it is evident that the
   mutation type does influence the predicative power a gene has on cancer
   diagnosis. Nonetheless, it can also be seen that for some genes, such
   as AK2 or KTM2C, more than a single-mutation type leads to high
   predictivity of multiple cancers. So, even though it has been
   established that not all mutations cause the same effect, perhaps some
   lead to more similar consequences than others.

Fig. 5. The number of top 10 ranked genes lists a gene had appeared in when
it was mutated by a specific mutation type.

   [98]Fig. 5
   [99]Open in a new tab

   The figure is constructed of four panels for readability purposes and
   is equivalent to a single long panel. Each row in a panel refers to a
   gene and every column in a panel refers to a mutation type. The results
   depicted in this figure were obtained from the five
   single-mutation-type models. Every gene in TCGA that is ranked in the
   top 10 genes list for at least one cancer type is presented in the
   figure (the figure includes a total of 216 genes). A lighter shade
   indicates that the gene was in the top 10 lists of a few cancer types
   and a darker shade indicates that the gene was in the top 10 lists of
   many cancer types. The minimum value possible is zero (the gene is not
   included in the top 10 genes list of any cancer type for that
   particular model) and the maximum is 19 (the gene is included in the
   top 10 genes lists for all examined cancers for that particular model).

Synonymous, non-silent and intronic mutations affect a gene’s predictive
power on cancer type classification in a positively correlated manner

   To assess whether some mutation types lead to similar consequences,
   every cancer type was separately examined. It was assumed that if two
   different mutation types have similar effects on a gene, then the
   predictive power of that gene for a specific cancer type would be
   similar when mutated by either one of them. Therefore, the gene’s
   importance in both models should be similar as well. Inferring to all
   genes, the gene importance ranking of both models should be correlated.

   For every cancer type, a Spearman correlation was performed between
   every pair of gene ranking lists obtained from the five
   single-mutation-type models (see “Methods”). The correlation
   coefficients were then averaged across all cancer types (Supplementary
   Fig. [100]6 depicts the correlations obtained for each cancer type).
   The results (Fig. [101]6) indicate a significant 0.4 correlation
   between the gene ranking lists of the non-silent and synonymous models,
   a 0.32 correlation between the lists of the non-silent and intron
   models and a correlation of 0.3 between the lists of the synonymous and
   intron models. These three correlations obtained a p-value smaller than
   8.5×10^−9. Correlations between all other pairs of models were neither
   high nor significant. A possible reason for these results is a common
   mechanism shared by the different mutation types. For example, both
   synonymous and non-silent mutations may affect co-translational
   folding, and both synonymous and intronic mutations may influence
   splicing. Thus, it is conceivable that these mutations could have
   similar consequences over the gene’s expression or functionality.

Fig. 6. The average Spearman correlation of every pair of gene ranking lists
of two models.

   Fig. 6
   [102]Open in a new tab

   For every cancer type, the correlation between the gene ranking lists
   of every pair of models was calculated. The average value across cancer
   types is shown. The respective average p-values are denoted in
   parentheses. The colors represent the correlation coefficient. A darker
   color indicates a higher correlation.

Combining both silent and non-silent features enables the detection of Gene
Ontology terms that are not detected by non-silent features alone

   Enrichment analysis was performed in order to examine whether genes
   that were considered important by the models are related to specific
   biological functions and processes. The affiliation of these genes to
   biological pathways could illuminate their contribution to the
   development and progression of the disease. The GOrilla^[103]44,[104]45
   and REVIGO^[105]46 tools were used to find non-redundant Gene Ontology
   terms (GO terms) that are enriched for any of the 19 cancer types. To
   find the terms, a gene ranking list was used as input for the GOrilla
   tool (see Methods). As demonstrated in Figs. [106]5, [107]6, different
   mutation types dramatically change the predictive power of genes and
   thus inputting gene rankings of the different models could illuminate
   different biological pathways.

   Figure [108]7 lists the GO terms that were enriched for the 19 cancer
   types when using the gene rankings from all features models. Examining
   these results, it can be seen that most GO terms that are repeatedly
   enriched across cancer types are related to DNA-protein bindings, to
   protein–protein bindings and to phosphorylation. As expected, these
   terms are associated with various regulation mechanisms of the gene
   expression process, such as transcription (interactions between
   transcription factors and RNA Polymerase, histone phosphorylation) or
   translation (attachment of ribosomes to the DNA sequence).

Fig. 7. GO terms enrichment for the 19 cancer types.

   [109]Fig. 7
   [110]Open in a new tab

   Received by using the gene rankings of all features models. The figure
   is constructed of two panels for readability purposes and is equivalent
   to a single long panel with 113 GO terms. Each row in a panel refers to
   a GO term and every column in a panel refers to a cancer type. Yellow
   positions indicate non-redundant enriched GO terms with a p-value
   smaller than 0.001 and a q-value (FDR correction) smaller than 0.05.
   Blue positions indicate GO terms that are not enriched under these
   requirements.

   As most research today encompasses mainly non-silent mutations, it is
   interesting to test whether the GO terms that were detected with the
   all features gene rankings are also detected with gene rankings
   obtained from non-silent models. Figure [111]8 depicts the number of
   cancer types for which a GO term was found significantly enriched when
   using the gene rankings from both models. It can be seen that most GO
   terms detected by the all features models across various cancer types
   are considerably less detected by the non-silent models. That is to
   say, adding silent features to non-silent features caused the gene
   ranking to encompass a broader biological significance and thus led to
   a more comprehensive detection of GO terms. Nonetheless, widening our
   prism involves a trade-off; 10 GO terms that were found significant by
   the non-silent model were missed by the all features model (in fact,
   eight of them were missed by all other models, making them unique to
   the non-silent model. See Supplementary Data [112]2). Among these terms
   are “endothelial cell migration” which is related to
   angiogenesis^[113]47 (a known cancer hallmark^[114]48), “negative
   regulation of morphogenesis of an epithelium” which is indeed effected
   in carcinoma development^[115]49 and “regulation of canonical Wnt
   signaling pathway” which is known to be profoundly related to cell
   tumorigenesis^[116]50. These terms were found significant only by the
   non-silent model and neither they nor semantically similar terms were
   detected by any other model. Even though the all features model missed
   these 10 terms, it did detect the other 21 terms that were found
   significant by the non-silent model, meaning that the majority of the
   information was preserved. Additionally, it detected 90 other
   significant GO terms that were not detected by the non-silent model.
   These include terms related to histone modifications (“histone
   binding”, “histone methyltransferase activity”, “histone
   acetyltransferase activity”), terms related to phosphorylation
   (“transmembrane receptor protein phosphatase activity”, “transmembrane
   receptor protein kinase activity”) and terms related to the binding of
   nucleic acids (“ATP binding”, “GDP binding”, “GTPase activator
   activity”). These biological functions and processes are known to have
   implications on tumorigenesis in various ways^[117]51–[118]54 and none
   of them (or terms with similar semantic meanings) were detected by the
   non-silent model. We also performed pathway enrichment analysis using
   REACTOME^[119]55 (see Methods) and the results indicate that all
   features highly ranked genes are associated with multiple pathways
   related to the regulation of DNA damage. Pathways such as “Cell cycle
   checkpoints” (and specifically “G1/S DNA Damage Checkpoints”, “G2/M DNA
   damage checkpoint” and “p53-Dependent G1 DNA Damage Response”), “DNA
   double-strand break repair”, “SUMOylation of DNA damage response and
   repair proteins” and “TP53 Regulates Transcription of DNA Repair Genes”
   were enriched. These pathways, or any semantically similar pathways
   were not found enriched in the highly ranked genes of the non-silent
   models and are known to be profoundly related to
   tumorigenesis^[120]56,[121]57. This further demonstrates the
   contribution of silent mutations to tumorigenesis and highlights the
   need to combine them in cancer research.

Fig. 8. The number of cancer types for which a GO term was enriched using
gene rankings from the non-silent models and all features models.

   [122]Fig. 8
   [123]Open in a new tab

   The figure is constructed of two panels for readability purposes and is
   equivalent to a single long panel with 123 GO terms. Each row in a
   panel refers to a GO term and every column in a panel refers to a model
   from which the gene ranking list was used as input for the GOrilla
   tool.

   Examining the single-feature-type silent models (Supplementary Data
   [124]2), we can detect more GO terms that were unique to a specific
   model. For example, the term “poly(A) binding” was found significant
   only by the UTR model. This may suggest that poly(A) binding genes tend
   to undergo regulation and thus also cancer evolution through mutations
   in their 3’UTR which affect regulation via the changes in the poly(A)
   tail. The poly(A) tail is related to mRNA stability and translation
   regulation^[125]58 and alternative polyadenylation processes are known
   to be related to tumorigenesis^[126]59. Another example for a term that
   is unique for a specific model only is “O-glycan processing” which was
   found significant only by the synonymous model. The O-glycans are
   oligosaccharides that are a major component of mucins. The mucins
   function as a protective layer of the epithelium and changes in their
   O-glycans are related to tumorigenesis^[127]60,[128]61.

   The intron model also detected many significant GO terms for the
   various cancer types (80), only three of which (“cell adhesion”,
   “biological adhesion” and “integral component of plasma membrane”) are
   common with the non-silent model. Exactly half of the terms (40) were
   also detected by the all features model. To conclude, there is a
   trade-off in examining gene rankings obtained from single-feature-type
   models and models that combine several feature types. The all features
   model allows for a broader view of biological pathways but also misses
   terms that are highly specific of a certain mutation type. However,
   this analysis strongly indicates that searching for biological
   significance by only analyzing non-silent mutations is insufficient.

   When examining the results depicted in Fig. [129]8, one must consider
   the uneven number of features in both models; all features models have
   almost seven times as many features as the non-silent models. Because
   the gene ranking is derived from the feature ranking it is bound to
   have some effect over the enrichment results. However, it is not the
   only determinant; if the silent features were unimportant for the
   model, adding them (even many of them) would not cause such a
   difference in the enrichment results. As the rank of a gene is derived
   from the rank of its most important feature (see Methods), unimportant
   silent features would have made a small impact on the gene ranking,
   leading to similar gene rankings of all features and non-silent models
   and thus to similar enrichment results. The fact that many more GO
   terms were found enriched by all features models demonstrates once
   again the importance of the silent features and the importance of
   examining the whole picture.

All silent features models outperformed the null model in predicting survival
probabilities for more than 10 years after an initial cancer diagnosis

   The purpose of this analysis was to assess whether the survival
   probabilities of patients could be estimated solely based on their
   silent mutations, and to compare the estimations of the silent features
   models to the estimations of the non-silent and all features models.
   Similarly to the cancer type classification task, no additional
   information, such as patient’s age, sex, race, or treatment history was
   used. In this analysis, patients across all 33 cancer types were
   included and a Random Survival Forest (RSF)^[130]62 algorithm was
   utilized (see Methods). Due to the high computational requirements of
   the algorithm, only a subset of the features was chosen from each of
   the six initial datasets. The models were trained to predict patients’
   survival probability at any time after an initial cancer diagnosis.
   Then, the models were used to estimate the survival probabilities of
   patients at 10 different time points. The estimations were evaluated
   using the Area Under the Curve (AUC)^[131]63 score and the results are
   presented in the following section.

   All the silent features models outperformed the null model for more
   than 10 years after the initial diagnosis (Fig. [132]9a). Additionally,
   the all features model achieved the highest AUC score for more than
   nine years (3500 days) after the diagnosis. This demonstrates that the
   addition of silent features to non-silent features is superior to the
   use of non-silent features alone for survivability prediction.

Fig. 9. Survival estimation results.

   [133]Fig. 9
   [134]Open in a new tab

   a AUC scores achieved by the six RSF models for various times after the
   initial cancer diagnosis. The x-axis depicts the days passed since the
   diagnosis and the y-axis depicts the AUC score achieved by the models.
   Each colored curve denotes a different dataset. The horizontal line
   depicts the AUC score of a null model. b Feature-type distribution of
   the all features dataset and of the top ranked features chosen in the
   survival probability estimation task. Feature-type distribution of the
   all features dataset* (top row), top ranked 100 features (middle row)
   and top ranked 10 features (bottom row). The feature rankings were
   obtained from the all features model. The legend indicates the
   enrichment in the amount of each feature-type in the top 10 features
   compared to its original amount in the all features dataset (ratio
   between bottom and top row). *Note: The distribution depicted in the
   top row is the distribution of the all features dataset after it
   underwent preprocessing relevant for the survival estimation task.

Silent features comprise 30% of the 10 most predictive features for survival
estimation

   Reviewing the feature importance ranking produced by the all features
   model for survival estimation, silent features comprised more than half
   of the top ranked 100 features and a third of the top ranked 10
   features (Fig. [135]9a). Table [136]2 holds the 10 most predictive
   features for survival estimation (and the full feature importance list
   is available at Supplementary Data [137]3). Note that due to technical
   reasons (see “Methods”) all patients are treated as a single cohort for
   the survival estimation (the cancer type of each patient is not
   considered by the model, only the patients’ genomic features and vital
   status at the last examination). If we were to perform a separate
   survival analysis for each cancer type as we did in the classification
   task, it is probable that the number of highly ranked silent mutations
   would vary significantly among the cancer types as seen in the previous
   task (Supplementary Tables [138]2,[139]3). However, the fact that three
   of the 10 features that are most predictive of the survivability of the
   entire cohort are silent (even though thousands of non-silent features
   were available for the model’s usage), is another indicator of the
   strong predictive ability of silent mutations.

Table 2.

   The top 10 ranked features for estimating patients’ survival
   probability.
   Rank Feature Feature type Importance Gene
   0 TP53 Non_Silent Non_Silent 0.0142 TP53
   1 MUC4 Non_Silent Non_Silent 0.0050 MUC4
   2 57466291-57466292-chr12-Frame_Shift_Ins-INS—G Non_Silent 0.0049 GLI1
   3 143147221-143147222-chr5-Intron-INS—C Intron 0.0035 ARHGAP26
   4 140753336-140753336-chr7-Missense_Mutation-SNP-A-A-T Non_Silent
   0.0032 BRAF
   5 NKX2-1 Flank Flank 0.0031 NKX2-1
   6 25743660-25743660-chr2-Missense_Mutation-SNP-T-T-G Non_Silent 0.0027
   ASXL2
   7 EGFR Non_Silent Non_Silent 0.0026 EGFR
   8 92956452-92956452-chr15-Splice_Region-SNP-A-A-T Intron 0.0026 CHD2
   9 35457937-35457938-chr6-Frame_Shift_Ins-INS—C Non_Silent 0.0026 FANCE
   [140]Open in a new tab

   For each feature, the table holds its name, mutation type, its
   importance ranking, the gene to which it is related to and the gene’s
   product description. The ranking was obtained from the all features
   model.

Discussion

   It has been suggested that silent mutations could affect tumorigenesis
   and cancer cell fitness through changes in gene expression
   regulation^[141]33,[142]36–[143]42. However, to the best of our
   knowledge, this study provides the first quantitative assessment of the
   predictive power of silent mutations over cancer classification and
   prognosis in comparison to non-silent mutations.

   The results demonstrate the predictive ability of silent mutations to
   perform both the classification and survival estimation tasks; we
   specifically show that for some cancer types, it is comparable to the
   performances of non-silent mutations. Moreover, combining both
   non-silent and silent mutations achieved the best classification
   results for 68% of the cancer types. When using the same number of
   features, a combination of silent and non-silent features was still
   superior to using only non-silent features for 63% of cancer types.
   Even though the survival estimation was not as comprehensive and
   precise as the classification task (as the patients were treated as a
   single cohort), the same conclusions are drawn from it; all silent
   feature models surpassed the null model for over ten years after an
   initial diagnosis and combining both silent and non-silent features led
   to the best survival estimations for more than 9 years. Additionally,
   silent features were highly ranked in both tasks, surpassing thousands
   of non-silent features. In fact, considering that numerous silent
   mutations (which affect gene expression regulation) were found highly
   predictive by the models and since protein functionality is quite
   robust to point mutations^[144]64, it is probable that some of the
   highly predictive non-silent mutations are such due to their impact on
   gene expression regulation rather than their impact on protein
   functionality. A recent study that has found similarities between the
   recurrency and distribution of synonymous and missense mutations also
   supports this claim^[145]65.

   As shown in Fig. [146]4b, the predictive power of silent mutations
   varies significantly between cancer types. This could suggest that some
   cancers are more affected by changes in genes’ functionality caused
   mostly by non-silent mutations, while others are more affected by
   changes in gene expression levels, caused by both silent and non-silent
   mutations. The importance of different mutation types also varies when
   examining specific genes and pathways; the predictive power of a gene
   changes dramatically when it is mutated by different types of
   mutations. This suggests that a mutation that causes high predictivity
   changes the gene’s functionality or regulation in a way that is optimal
   for the fitness of the cancer.

   Observing the feature rankings obtained by the different models, it can
   be seen that low-resolution features are generally ranked higher than
   high-resolution features (Supplementary Table [147]4), meaning that the
   number of mutations in an entire functional region of a gene was
   usually a better predictor than a single specific mutation. This
   phenomenon is noticed for both silent and non-silent features. A
   comprehensive understanding of the specific effect of all these
   mutations is a topic for future studies. However, here we provide few
   initial clues (see “Methods” for technical details regarding the
   analysis):

   When examining the few silent high-resolution features that were highly
   ranked, we did not find that they significantly impact mRNA expression
   levels, splicing, or have other regulatory effects. However, when
   examining the low-resolution silent features that were highly ranked,
   we found that some contain genomic positions that are assumed to cause
   a disruption of regulation if mutated (Supplementary Table [148]5). For
   example, the amount of intronic mutations in the TP53 gene was the
   second most important feature in the all features model for detection
   of LUSC. We found an SNP mutation in the intronic region, 17: 7673610:
   T -> C, which annuls a splice site; this mutation was not highly ranked
   by itself, possibly due to its infrequency (present in only 0.7% of
   LUSC patients). A recent study showed that possible driver mutations
   could be missed if they are uncommon, even if they have a significant
   effect^[149]35. The TP53 gene is maybe the most known tumor
   suppressor^[150]66 and annulling of one of its splice sites could
   affect tumorigenesis. The number of mutations in the 3′UTR of the
   SRGAP3 gene was the fourth most important feature in the all features
   model for diagnosing SARC. We found two deletions, 3: 8985094–8985095:
   AT and 3: 8985094–8985097: ATAT, that both cause the formation of a new
   miRNA binding site. The first mutation is considerably more common than
   the second (present in 23.1% and 1.2% of SARC patients respectively)
   and was in fact the most important mutation in the entire SRGAP3 gene
   according to the model. The second mutation alone is ranked appreciably
   lower, unsurprisingly given its low prevalence. The SRGAP3 gene was
   also reported as a tumor suppressor gene^[151]67 and an addition of a
   new miRNA binding site could be related to tumorigenesis. The number of
   intronic mutations in the EGFR gene was ranked the fourth most
   important feature by the all features model diagnosing GBM. We found an
   insertion in the intronic region, 7: 55020559–55020560: ACACACAC, which
   causes a small but significant decrease of mRNA expression levels
   (0.7%). This mutation is also uncommon as it is present in only 0.7% of
   GBM patients. The mutations presented above affect different aspects of
   the regulation process of known tumor suppressors (TP53, SRGAP3) and
   oncogenes (EGFR), and could thus influence tumorigenesis. Generally, it
   seems like there could be many uncommon silent mutations with
   regulatory affects that are missed for lack of statistical power. With
   the accumulation of genomic data and improvement in computational
   methods, we expect that more uncommon, silent mutations that affect
   regulation and function will be identified. For the non-silent highly
   ranked features, we also did not find high-resolution features that
   directly affect gene expression regulation. We found only two mutations
   in highly ranked low-resolution features that form and revoke splice
   sites in the KRAS and the IDH1 genes (Supplementary Table [152]6).

   When examining the results of this study, one should keep in mind some
   inherent biases of the data. For example, non-silent mutations are
   naturally about 20 times more frequent than synonymous mutations. Thus,
   even if the effect of a single mutation is similar for both types,
   non-silent mutations are expected to make a larger impact. Another bias
   originates from the source of the data; the genomic data in this study
   is derived using WES, which is highly biased towards exonic mutations.
   WES sequences the genome’s coding regions, ignoring most non-coding
   regions internal and external to genes^[153]68. In fact, an astonishing
   98% of the genome is overlooked when performing WES, resulting in a
   narrow prism, heavily biased in favor of exonic mutations. Great
   efforts are made these days in order to provide data of whole genomes;
   The International Cancer Genome Consortium (ICGC) and The Cancer Genome
   Atlas (TCGA) has collaborated in the creation of the Pan-Cancer
   Analysis of Whole Genomes (PCAWG) and offer the ability to perform
   meta-analyses that includes silent mutations^[154]35,[155]69–[156]74.
   While it currently contains significantly smaller amounts of data and
   therefore a weaker statistical power compared to WES databases, it will
   undoubtably become a significant milestone in deciphering the
   contribution of silent mutations to cancer. An additional source of
   bias in our analyses is the varying quantity of mutations in different
   genes: The importance of a gene for the models is greatly influenced by
   the number of mutations it has in TCGA. Specifically, there is an
   average 0.72 Spearman correlation between the number of mutations that
   genes have in TCGA and the gene rankings obtained for the 19 cancer
   types (Supplementary Figure [157]7). Nonetheless, even though this
   correlation is high and significant, it also indicates that 52% of the
   variation in gene ranking could not be explained by the amount of
   mutations per gene in TCGA. In fact, some genes, such as HRAS, YOD1,
   VHL, and CEBPA, were among the most important genes for several cancer
   types even though their number of mutations in TCGA is very small
   compared to other genes (ranging from the 4^th to 16^th percentile). We
   expect that without these biases the significance of silent mutations
   in cancer diagnosis and survival prediction will be even higher than
   the results reported here.

   Finally, this study provides a broad, statistical analysis of the
   predictive abilities of silent and non-silent mutations of various
   kinds. The results suggest that models based on silent mutations could
   be very useful in practice. For example, for analyzing liquid biopsy
   samples^[158]75,[159]76 in order to perform cancer diagnosis or track
   cancer prognosis. Nevertheless, extensive work is required in order to
   expand and deepen our understanding of silent mutations and their
   ramifications on cancer development. For example, specific silent
   mutations that were chosen predictive by the models should be further
   investigated in order to ascertain which regulatory regions and
   mechanisms they impact. Novel databases containing information of
   silent mutations such as PCAWG and SynMICdb^[160]65 should be used to
   validate the conclusions of this study. Driver silent mutations should
   be distinguished from passenger silent mutations by assessing their
   impact on protein expression and estimating their time of occurrence.
   Classification should be performed on both healthy individuals and
   cancer patients to understand the full diagnostic ability of silent
   mutations. Classification should also be performed using genomic
   information obtained from blood samples to see whether the diagnostic
   ability is similar under these circumstances. Once sufficient amounts
   of data are available, the survival analysis should be performed again,
   separately for each cancer type. This is expected to improve the
   survival estimations and to provide greater comprehension of the silent
   and non-silent mutations that affect survivability. Finally, it will
   make sense to validate some of the mutations experimentally. All these
   research suggestions form the tip of the iceberg in an understudied
   field, full of clinical potential that is yet to be revealed.

Methods

Data extraction

   The genomic and clinical data of patients across 33 cancer types were
   obtained from The Cancer Genome Atlas (TCGA)^[161]43. Patients with
   multiple genomic samples and patients with no genomic samples or
   clinical records were excluded, leaving a total of 9915 patients. The
   genomic data consists of the patients’ mutation information. A genomic
   position is considered mutated for a patient only if its nucleic acid
   content differs between the patient’s cancerous and healthy tissue
   samples.

Feature engineering

   Five categories of mutations were established:
    1. Non-silent mutations (coding sequence mutations that cause a change
       in the protein’s amino-acid sequence).
    2. Synonymous mutations (coding sequence mutations that do not cause a
       direct change in the protein’s amino-acid sequence).
    3. Intronic mutations.
    4. UTR mutations.
    5. Flank mutations.

   For each category, the genomic data obtained from TCGA was used to
   create three kinds of features, representing three levels of resolution
   (Fig. [162]3): low-resolution features, medium-resolution features, and
   high-resolution features. Low-resolution features count the number of
   mutations that appear in an entire gene. Medium-resolution features
   count the number of mutations that appear in a specific segment of a
   gene. Each gene is assembled from the 5′UTR, introns, exons and the
   3′UTR. The flanking regions are adjacent to the gene from both ends. A
   gene is split to 50-nucleotide long segments and the medium-resolution
   features count the number of mutations in each segment. Two additional
   features count the number of mutations in the 5′ flanking regions
   (upstream to the gene) and in the 3′ flanking region (downstream to the
   gene). High-resolution features indicate whether a specific mutation
   occurred in a specific location in the gene (For example, an A to G SNP
   would be considered a different mutation than an A to C SNP, even if it
   had occurred in the same position). If the specific mutation occurred
   only for a single patient in the TCGA database, its respective feature
   was discarded. The features of each category were used as a separate
   dataset and they were also combined in order to create the sixth
   dataset- the all features dataset.

One vs. all classifiers

   One vs. all classifiers were chosen to perform the classification task.
   As our aim was to conduct a broad, quantitative comparison between
   various types of mutations, we chose a classic, robust, measurable, and
   interpretable supervised model, to lay the grounds for a fair
   comparison. Choosing multiple OVA classifiers, as opposed to a single
   multiclass classifier, enables us to easily explore which features are
   more closely related to which cancer type. Additionally, OVA
   classifiers are expected to perform better than a single multiclass
   classifier (as predicting a positive or negative verdict for a single
   cancer type is an easier task than predicting one cancer type out of 19
   possibilities). Thus, if a doctor already suspects a certain cancer
   type, the suspicions could be validated by the relevant model with
   greater certainty.

   To ensure enough training examples, only cancer types with more than
   200 patients were included in the analysis, resulting in 8,364 patients
   spanning 19 cancer types. 114 OVA classifiers were generated and
   trained, one for each possible combination of cancer type (19) and
   dataset (6). The objective of each classifier was to distinguish a
   single cancer type from the rest. Specifically, predicting a “Positive”
   or “Negative” label for a particular cancer type. The OVA classifiers
   were constructed using the LightGBM^[163]77 python package. For each
   classifier, the patients were randomly split into stratified training
   and testing sets (0.7/0.3 respectively) for 10 times. A null classifier
   was also generated using scikit-learn’s Dummy Classifier^[164]78 for
   each cancer type; the null classifier randomly assigned labels to the
   test-set patients, only considering the label distribution of the
   training-set patients. The classifiers’ performance was evaluated with
   Accuracy, Recall, Precision, and F1 scores (Fig. [165]4b, Supplementary
   Table [166]7). Performances were averaged across the 10 splits.
   Precision is the fraction of correctly identified positive patients out
   of all patients that were identified as positive by the model. The
   recall is the fraction of correctly identified positive patients out of
   all the patients that are truly positive for the disease. The F1 score
   is a harmonic mean of precision and recall, taking both measures into
   account:
   [MATH: <mrow><mrow><mrow><mrow><mi
   mathvariant="normal">F</mi></mrow></mrow></mrow></mrow><mn>1</mn><mo>=<
   /mo><mn>2</mn><mo>∗</mo><mfrac><mrow><mi>P</mi><mo>∗</mo><mi>R</mi></mr
   ow><mrow><mi>P</mi><mo>+</mo><mi>R</mi></mrow></mfrac> :MATH]
   1

   where P is Precision and R is Recall. The F1 score ranges from zero to
   one, one indicating perfect Precision and Recall scores and zero
   indicating that either the Precision or Recall are also zero.

Gene ranking

   Each classifier provides a feature ranking. First, features with zero
   importance were discarded. Then, a gene ranking was obtained by
   assigning the features (that can be mutations, segments, or entire
   genes) to the gene they are related to while keeping the original
   order. Finally, only the highest rank of each gene was kept. The most
   important gene is ranked “0” and as the numbers increase the importance
   decreases.

Spearman correlation between gene rankings

   Spearman correlations were conducted between gene rankings of pairs of
   classifiers detecting the same cancer type (Fig. [167]6). For every
   cancer type:
    1. The all features classifier was excluded.
    2. For each of the single-mutation-type classifiers, a gene ranking
       list was created as described above.
    3. Every combination of two classifiers was examined; genes that were
       not in the intersection of both gene ranking lists were discarded.
       Spearman correlation was calculated between the revised gene
       ranking lists.

   The results were averaged across the 19 cancer types.

Gene Ontology enrichment

   Enriched GO terms (molecular functions, biological processes and
   cellular components) were detected for the 19 cancer types using the
   gene rankings obtained from the different models. For every combination
   of cancer type and model:
    1. The gene ranking list was created as described above.
    2. The gene ranking list was used as input to the GOrilla
       tool^[168]44,[169]45. The tool used maximum Hyper Geometric (mHG)
       statistics in order to report GO terms that are enriched in the top
       of the list compared to the rest of the list. The threshold for
       splitting the genes list to “top” and “rest” is dynamic and was
       chosen for each GO term individually by the tool.
    3. The yielded terms are enriched with a p-value smaller than 0.001
       and have passed an FDR correction of 0.05.
    4. The yielded terms were used as input to the REVIGO^[170]46 tool,
       which removed terms with a semantic similarity score higher than
       0.7. The similarity measure used was “SimRel”.

   The enriched GO terms detected for the 19 cancer types when using the
   all features gene ranking are detailed in Fig. [171]7. A comparison
   between the GO terms that are detected when using the all features gene
   ranking or the non-silent gene ranking is seen in Fig. [172]8.

Pathway enrichment

   Enriched pathways were detected for the 19 cancer types using the gene
   rankings obtained from the different models. For every combination of
   cancer type and model:
    1. The gene ranking list was created as described above.
    2. The highest ranked 50 genes in the list were used as input to the
       REACTOME pathway enrichment analysis tool^[173]55. The number of
       genes was chosen considering both statistical power and the total
       length of the gene list.
    3. The REACTOME yielded enriched pathways. An enriched pathway is a
       pathway for which the number of genes in the provided list that is
       associated to it is larger than expected by chance, considering
       both the total amount of genes known to be associated with the
       pathway and the number of gene in our list. The yielded pathways
       obtained an FDR value that is smaller than 0.01.

Mutational burden

   The analysis presented in Supplementary Figure [174]2 was conducted to
   evaluate whether the improvement in classification that was gained from
   adding silent features to non-silent features was obtained because of
   the additional mutational burden. For each cancer type:
    1. The percent of improvement gained from adding silent features was
       calculated as shown in Eq. ([175]2):

   [MATH:
   <mrow><mi>F</mi><msub><mrow><mn>1</mn></mrow><mrow><mi>i</mi><mi>m</mi>
   <mi>p</mi><mi>r</mi><mi>o</mi><mi>v</mi><mi>e</mi><mi>m</mi><mi>e</mi><
   mi>n</mi><mi>t</mi></mrow></msub><mo>=</mo><mfrac><mrow><mi>F</mi><msub
   ><mrow><mn>1</mn></mrow><mrow><mi>a</mi><mi>l</mi><mi>l</mi><mo>−</mo><
   mi>f</mi><mi>e</mi><mi>a</mi><mi>t</mi><mi>u</mi><mi>r</mi><mi>e</mi><m
   i>s</mi></mrow></msub><mo>−</mo><mi>F</mi><msub><mrow><mn>1</mn></mrow>
   <mrow><mi>n</mi><mi>o</mi><mi>n</mi><mo>−</mo><mi>s</mi><mi>i</mi><mi>l
   </mi><mi>e</mi><mi>n</mi><mi>t</mi></mrow></msub></mrow><mrow><mi>F</mi
   ><msub><mrow><mn>1</mn></mrow><mrow><mi>n</mi><mi>o</mi><mi>n</mi><mo>−
   </mo><mi>s</mi><mi>i</mi><mi>l</mi><mi>e</mi><mi>n</mi><mi>t</mi></mrow
   ></msub></mrow></mfrac><mo>*</mo><mn>100</mn></mrow> :MATH]
       2
       where
       [MATH:
       <mrow><mi>F</mi><msub><mrow><mn>1</mn></mrow><mrow><mi>a</mi><mi>l<
       /mi><mi>l</mi><mo>−</mo><mi>f</mi><mi>e</mi><mi>a</mi><mi>t</mi><mi
       >u</mi><mi>r</mi><mi>e</mi><mi>s</mi></mrow></msub></mrow> :MATH]
       is the F1 score of the all features model of the current cancer
       type.
    2. The percent of mutational burden gained from adding silent features
       (an average across patients) was calculated as shown in Eq.
       ([176]3):

   [MATH:
   <mrow><mi>M</mi><msub><mrow><mi>B</mi></mrow><mrow><mi>i</mi><mi>n</mi>
   <mi>c</mi><mi>r</mi><mi>e</mi><mi>a</mi><mi>s</mi><mi>e</mi></mrow></ms
   ub><mo>=</mo><mfrac><mrow><msubsup><mrow><mo>∑</mo></mrow><mrow><mi>i</
   mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></msubsup><mfenced
   close=")"
   open="("><mrow><mfrac><mrow><mi>M</mi><msub><mrow><mi>B</mi></mrow><mro
   w><mi>i</mi><mo>,</mo><mi>a</mi><mi>l</mi><mi>l</mi><mo>−</mo><mi>f</mi
   ><mi>e</mi><mi>a</mi><mi>t</mi><mi>u</mi><mi>r</mi><mi>e</mi><mi>s</mi>
   </mrow></msub><mo>−</mo><mi>M</mi><msub><mrow><mi>B</mi></mrow><mrow><m
   i>i</mi><mo>,</mo><mi>n</mi><mi>o</mi><mi>n</mi><mo>−</mo><mi>s</mi><mi
   >i</mi><mi>l</mi><mi>e</mi><mi>n</mi><mi>t</mi></mrow></msub></mrow><mr
   ow><mi>M</mi><msub><mrow><mi>B</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi
   >n</mi><mi>o</mi><mi>n</mi><mo>−</mo><mi>s</mi><mi>i</mi><mi>l</mi><mi>
   e</mi><mi>n</mi><mi>t</mi></mrow></msub></mrow></mfrac><mo>*</mo><mn>10
   0</mn></mrow></mfenced></mrow><mrow><mi>n</mi></mrow></mfrac></mrow>
   :MATH]
   3

   Where
   [MATH:
   <mrow><mi>M</mi><msub><mrow><mi>B</mi></mrow><mrow><mi>i</mi><mo>,</mo>
   <mi>a</mi><mi>l</mi><mi>l</mi><mo>−</mo><mi>f</mi><mi>e</mi><mi>a</mi><
   mi>t</mi><mi>u</mi><mi>r</mi><mi>e</mi><mi>s</mi></mrow></msub></mrow>
   :MATH]
   is the mutational burden (number of mutations) that the
   [MATH:
   <msup><mrow><mi>i</mi></mrow><mrow><mi>t</mi><mi>h</mi></mrow></msup>
   :MATH]
   patient in the all features dataset has and n is the number of patients
   of the current cancer type.

   We then examined the correlation between
   [MATH:
   <mrow><mi>F</mi><msub><mrow><mn>1</mn></mrow><mrow><mi>i</mi><mi>m</mi>
   <mi>p</mi><mi>r</mi><mi>o</mi><mi>v</mi><mi>e</mi><mi>m</mi><mi>e</mi><
   mi>n</mi><mi>t</mi></mrow></msub></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>M</mi><msub><mrow><mi>B</mi></mrow><mrow><mi>i</mi><mi>n</mi>
   <mi>c</mi><mi>r</mi><mi>e</mi><mi>a</mi><mi>s</mi><mi>e</mi></mrow></ms
   ub></mrow> :MATH]
   among the cancer types.

Spearman correlation between Jaccard similarity scores and misclassification
rates

   A Spearman correlation was conducted in order to evaluate the influence
   of genetic profile similarity on misclassification rates among pairs of
   cancer types. For this analysis binary versions of the features were
   used, meaning that rather than indicating how many mutations occur in
   genes and segments the features indicate whether any mutations had
   occurred or not (high-resolution features were originally binary and
   thus do not change). Calculating the Jaccard similarity scores for
   every pair of cancer types was performed in the following manner:
    1. 100 patients were randomly selected from each type, forming two
       equally sized groups of patients (groups A and B).
    2. A Jaccard score was calculated for every patient in the group A
       with every patient in group B. The average score was considered the
       Jaccard score between the groups. The calculation was performed as
       shown in Eq. ([177]4):

   [MATH:
   <mrow><msub><mrow><mi>J</mi></mrow><mrow><mi>A</mi><mo>,</mo><mi>B</mi>
   </mrow></msub><mo>=</mo><mfrac><mrow><msubsup><mrow><mo>∑</mo></mrow><m
   row><mi>a</mi><mo>=</mo><mn>1</mn></mrow><mrow><mn>100</mn></mrow></msu
   bsup><msubsup><mrow><mo>∑</mo></mrow><mrow><mi>b</mi><mo>=</mo><mn>1</m
   n></mrow><mrow><mn>100</mn></mrow></msubsup><mfrac><mrow><mfenced
   close="∣"
   open="∣"><mrow><msub><mrow><mi>F</mi></mrow><mrow><mi>a</mi></mrow></ms
   ub><mspace
   width="0.25em"></mspace><msubsup><mrow><mo>∩</mo></mrow><mrow><mi>b</mi
   ></mrow><mrow><mi>F</mi></mrow></msubsup></mrow></mfenced></mrow><mrow>
   <mfenced close="∣"
   open="∣"><mrow><msub><mrow><mi>F</mi></mrow><mrow><mi>a</mi></mrow></ms
   ub></mrow></mfenced><mspace width="0.25em"></mspace><mo>+</mo><mspace
   width="0.25em"></mspace><mfenced close="∣"
   open="∣"><mrow><msub><mrow><mi>F</mi></mrow><mrow><mi>b</mi></mrow></ms
   ub></mrow></mfenced><mspace width="0.25em"></mspace><mo>−</mo><mspace
   width="0.25em"></mspace><mfenced close="∣"
   open="∣"><mrow><msub><mrow><mi>F</mi></mrow><mrow><mi>a</mi></mrow></ms
   ub><mspace
   width="0.25em"></mspace><msubsup><mrow><mo>∩</mo></mrow><mrow><mi>b</mi
   ></mrow><mrow><mi>F</mi></mrow></msubsup></mrow></mfenced></mrow></mfra
   c></mrow><mrow><mn>100</mn><mo>*</mo><mn>100</mn></mrow></mfrac></mrow>
   :MATH]
       4
       Where F[a] is the binary feature set of patient a from group A and
       F[b] is the binary feature set of patient b from group B.
       [MATH: <mfenced close="∣"
       open="∣"><mrow><msub><mrow><mi>F</mi></mrow><mrow><mi>a</mi></mrow>
       </msub></mrow></mfenced> :MATH]
       is the number of features equal to “1” for patient a from group A
       (indicating all positions, segments and entire genes that were
       mutated).
       [MATH:
       <msub><mrow><mi>J</mi></mrow><mrow><mi>A</mi><mo>,</mo><mi>B</mi></
       mrow></msub> :MATH]
       is the average Jaccard similarity score between group A and group
       B.
    3. The random sampling process was repeated 5 times. The final Jaccard
       score for a pair of cancer types was the average of the five
       repetitions.

   Calculating the mistake rate for every pair of cancer types was
   performed in the following manner:
    1. 250 patients were randomly selected from each type (groups A and
       B).
    2. The patients were stratified split to train and test sets (the
       training-set contained 70% of patients from each cancer types).
    3. An OVA model was fit on the training-set patients.
    4. The model was used to classify the test-set patients to one of the
       two cancer types.
    5. The misclassification rate between the groups was calculated as
       shown in Eq. ([178]5):

   [MATH:
   <mrow><msub><mrow><mi>M</mi></mrow><mrow><mi>A</mi><mo>,</mo><mi>B</mi>
   </mrow></msub><mo>=</mo><mfrac><mrow><mfenced close="∣"
   open="∣"><mrow><mi>A</mi><mi>B</mi></mrow></mfenced><mo>+</mo><mfenced
   close="∣"
   open="∣"><mrow><mi>B</mi><mi>A</mi></mrow></mfenced></mrow><mrow><mfenc
   ed close="∣"
   open="∣"><mrow><mi>A</mi><mi>A</mi></mrow></mfenced><mo>+</mo><mfenced
   close="∣"
   open="∣"><mrow><mi>B</mi><mi>B</mi></mrow></mfenced><mo>+</mo><mfenced
   close="∣"
   open="∣"><mrow><mi>A</mi><mi>B</mi></mrow></mfenced><mo>+</mo><mfenced
   close="∣"
   open="∣"><mrow><mi>B</mi><mi>A</mi></mrow></mfenced></mrow></mfrac></mr
   ow> :MATH]
       5
       Where
       [MATH: <mfenced close="∣"
       open="∣"><mrow><mi>A</mi><mi>B</mi></mrow></mfenced> :MATH]
       is the number of group-A-patients that were classified as
       group-B-patients. M[A,B] is the misclassification rate between
       groups A and B.
    6. The random sampling process was repeated 10 times. The
       misclassification rate between the pair of cancer types was the
       average of the 10 repetitions.

Balanced datasets

   To evaluate whether the results are significantly influenced by the
   imbalance between the mutation categories, balanced datasets were
   created for the two analyses depicted in Fig. [179]4b and c. To
   maintain the balance, only high-resolution features were used in these
   datasets. Six same-size datasets were needed for the balanced version
   of Fig. [180]4b. For every cancer type:
    1. The patients were split to two equally sized groups. The first for
       feature selection and creation of the balanced datasets and the
       second for training models on the balanced datasets and evaluating
       the results.
    2. For creating the balanced datasets six OVA models (one per dataset)
       were trained using the first group of patients and all their
       features were ranked. For every model, the highest ranked 8,296
       features were chosen as the new dataset. This step resulted in six
       balanced datasets per cancer type, each containing 8,296 features.
       (The number of features was derived from the number of features in
       the smallest category, the flanking region mutations).
    3. The six OVA models (one per dataset) were trained using the second
       group of patients and the balanced datasets. The models were
       trained for 10 rounds, whereby on each round a stratified random
       0.7/0.3 split was performed. The performance was evaluated using
       the same measures as the imbalanced version of this analysis.

   For the balanced version of Fig. [181]4c an all features dataset with
   an internal balance between mutation types was needed. For every cancer
   type, the 8,296 features that were chosen from each of the five
   mutation categories were combined in order to create the internally
   balanced all features dataset. Then, an OVA model was trained using the
   balanced dataset and the second group of patients. The model was
   trained for 10 rounds, whereby on each round a stratified random
   0.7/0.3 split was performed. The mutation-types distribution among the
   top 10 and top 100 features chosen by the classifiers were averaged
   across cancer types.

Random survival forest models

   A random survival forest model is an adaptation of the random forest
   model, modified to perform survival estimations^[182]62. Its
   performance is comparable and sometimes better than classic survival
   models such as Cox regression^[183]79–[184]82. The RSF is a
   non-parametric data-driven approach that is independent of model
   assumptions. It was chosen for our survival estimation task because it
   is known to perform well specifically with high-dimensional datasets,
   compared to traditional approaches (for example, Cox regression relies
   on several assumptions that are usually violated in high-dimensional
   datasets)^[185]83.

   Patients spanning all 33 cancer types were included in this analysis
   (as this is not a classification task and there was no need to remove
   small cohorts). Patients with no available information after the date
   of diagnosis and patients who passed away less than 20 days after their
   diagnosis were not included. Overall, 9,551 patients were incorporated
   in the analysis. The patients are treated as a single cohort and the
   model is oblivious of their cancer type. Unlike the classification
   task, this analysis is not performed separately for each cancer type
   because it requires more data (e.g. while the OVA model that diagnose
   BRCA trains on both BRCA-positive and BRCA-negative patients, the RSF
   model that estimates the survival of BRCA patients only trains on
   BRCA-positive patients while aiming at estimating an entire survival
   curve, and thus has a much smaller patient cohort to train on). The
   vital status (alive or deceased) and appropriate time stamp were
   extracted from the clinical data and used as labels. A subset of
   features was chosen for each mutation category- all low-resolution
   features and 5,000 high-resolution features. The high-resolution
   features were selected based on mutation prevalence in TCGA; the
   features corresponding to the 5,000 most prevalent mutations were
   selected.

   A model was generated and trained for each one of the six datasets
   (non-silent, UTR, intron, synonymous, flank and all features). The
   objective of a model was to predict the probability of a patient to
   survive on a given time after its initial cancer diagnosis. The models
   were constructed using the Pysurvival^[186]84 Python package. 60 trees
   were grown with a maximal depth of 32 splits. At each split,
   Kaplan–Meier estimators and the log-rank test were used to find the
   feature that is the best separator. For each model, the patients were
   randomly split into training and testing sets (0.7/0.3 respectively).
   The model was trained using the training-set patients and then tested
   on the patients of the test set, which the model has never encountered
   before. To avoid biases introduced by a specific split, the process was
   repeated five times and the survival probability estimation is the
   average of the 5 repetitions.

   The models’ performances on the test set patients were evaluated using
   the Area Under the Curve (AUC) score for various times (100, 500, 1000,
   1500, 2000, 2500, 3000, 3500, 4000, and 4500 days) after the initial
   cancer diagnosis. After 4500 days the data is scarce, as most patients
   have stopped attending follow-ups or have passed away. Thus, the
   analysis was terminated at this point.

Predicting the regulatory effects of highly ranked features

   Predictive models were used to assess the influence of mutations
   spanned by the top ten ranked features of each cancer type (whether
   they are of low, medium or high resolution) on splice sites (using
   SpliceAI^[187]85), miRNA binding sites (using cnnMirTarget^[188]86),
   mRNA expression levels (using Xpresso^[189]87), polyadenylation (using
   SANPolyA^[190]88), 3D folding (using Akita^[191]89) and several
   protein-mRNA binding sites (using DeepCLIP^[192]90).

Approval for study of human subjects

   The need for Institutional Review Board Approval at our institution
   (Tel Aviv University) was waived for this study as all data used for
   this project had previously been generated as part of The Cancer Genome
   Atlas Project and none of the results reported in this manuscript can
   be used to identify individual patients.

Reporting summary

   Further information on research design is available in the [193]Nature
   Research Reporting Summary linked to this article.

Supplementary information

   [194]Supplementary Information^ (14.3MB, pdf)
   [195]Supplementary Data 1^ (10.9MB, xlsx)
   [196]Supplementary Data 2^ (64.6KB, xlsx)
   [197]Supplementary Data 3^ (209.9KB, xlsx)
   [198]Reporting Summary^ (68KB, pdf)

Acknowledgements