Abstract

Background

   Cancer is characterized by the accumulation of large numbers of genetic
   variations and alterations of multiple biological phenomena. Cancer
   genomics has largely focused on the identification of such genetic
   alterations and the genes containing them, known as ‘cancer genes’.
   However, the non-functional somatic variations out-number functional
   variations and remain as a major challenge. Recurrent somatic
   variations are thought to be cancer drivers but they are present in
   only a small fraction of patients.

Methods

   We performed an extensive analysis of amino acid substitutions (AASs)
   from 6,861 cancer samples (whole genome or exome sequences) classified
   into 30 cancer types and performed pathway enrichment analysis. We also
   studied the overlap between the cancers based on proteins containing
   harmful AASs and pathways affected by them.

Results

   We found that only a fraction of AASs (39.88 %) are harmful even in
   known cancer genes. In addition, we found that proteins containing
   harmful AASs in cancers are often centrally located in protein
   interaction networks. Based on the proteins containing harmful AASs, we
   identified significantly affected pathways in 28 cancer types and
   indicate that proteins containing harmful AASs can affect pathways
   despite the frequency of AASs in them. Our cross-cancer overlap
   analysis showed that it would be more beneficial to identify affected
   pathways in cancers rather than individual genes and variations.

Conclusion

   Pathways affected by harmful AASs reveal key processes involved in
   cancer development. Our approach filters out the putative benign AASs
   thus reducing the list of cancer variations allowing reliable
   identification of affected pathways. The pathways identified in
   individual cancer and overlap between cancer types open avenues for
   further experimental research and for developing targeted therapies and
   interventions.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12920-015-0125-x)
   contains supplementary material, which is available to authorized
   users.

   Keywords: Cancer genomes, Somatic mutations, Cancer pathways, Cancer
   relationship

Background

   Cancer is characterized by the accumulation of large numbers of genetic
   variations and alterations of multiple biological phenomena [[27]1,
   [28]2]. These alterations contribute directly or indirectly to
   increased ratio of cell birth to cell death [[29]3]. During recent
   years, cancer genomics has largely focused on the identification of
   such genetic alterations and the genes containing them, known as
   ‘cancer genes’. Variations that confer growth advantage and are
   positively selected during cancer development are known as drivers and
   other variations carried along during cancer progression are called for
   passengers [[30]4]. Recurrent somatic variations are thought to be
   drivers but they are present in only a small fraction of patients. On
   the other hand, previous studies showed that less frequent variations
   can have similar effects as recurrent variations [[31]5, [32]6].

   Large amounts of cancer genomic data are available by joint efforts of
   various genomic projects. These include the Cancer Genome Project (CGP)
   ([33]https://www.sanger.ac.uk/research/projects/cancergenome/), The
   Cancer Genome Atlas (TCGA) ([34]http://cancergenome.nih.gov/) and
   International Cancer Genome Consortium (ICGC) [[35]7]. Massive datasets
   provide unprecedented possibilities for data analysis. Various
   approaches have already been taken to understand the mechanisms of
   tumorigenesis [[36]8]. However, the vast majority of non-functional
   somatic variations remain the major challenge [[37]9].

   Here, we exploited the impacts of somatic amino acid substitutions
   (AASs) to prioritize relevant variations in cancers and identified
   pathways affected by them. We utilized PON-P2 [[38]10], a machine
   learning-based tool to identify harmful AASs. It classifies the AASs
   into three categories: pathogenic, neutral and unknown. Those AASs that
   are predicted with confidence level 0.95 are classified either as
   pathogenic or neutral and the remaining as unknown. Here, all AASs that
   were classified as pathogenic by PON-P2 were considered to be likely
   harmful. As cancer is a multigenic disease, single variants cannot be
   called pathogenic and therefore, we name them as ‘harmful’. PON-P2 does
   not predict mechanisms of AASs, instead it identifies deviations from
   normal amino acids in the sequence positions. This means that harmful
   AASs can be of either loss or gain of function type.

   In our analysis, we found that only a small fraction of AASs are
   harmful even in known cancer proteins. Proteins containing harmful AASs
   in cancer are often centrally located in protein interaction networks
   and they affect key pathways. Even proteins with low AAS frequency can
   affect key cancer pathways. We performed cross-cancer comparison based
   on prioritized proteins and affected pathways. Our analysis showed that
   cancers have higher similarities at pathway level than at protein
   level. Hence, it would be more beneficial to identify affected pathways
   in cancers than individual genes/proteins and variations.

Results

   We obtained 5,023,574 somatic variations in 7,042 cancer samples in 30
   cancer types [[39]11]. We mapped the variations to human reference
   sequence and identified 824,336 single nucleotide variations (SNVs)
   leading to AASs in altogether 6,861 samples (Fig. [40]1). The numbers
   of variations leading to synonymous alterations (308,896) and
   introducing stop codons in mRNA (63,866) are much smaller compared to
   the number of AASs. The ratio of non-synonymous to synonymous mRNA
   variations varies among cancer types. It ranges from 1.8 in melanoma to
   6.7 in lung small cell cancer while the overall ratio is 2.7. Even
   minor genetic changes can provide advantage for cancer cells. However,
   as practically all cancers contain harmful AASs, it is highly relevant
   to study proteins containing them and their biological processes. We
   utilized PON-P2 ([41]http://structure.bmc.lu.se/PON-P2/) [[42]10], a
   highly reliable tool, for identification of harmful AASs. Here, all
   AASs classified as pathogenic by PON-P2 are considered to be harmful.
   In total, 14.24 % of AASs were predicted as harmful in 91.88 % of the
   samples. AASs are common in cancers except in pilocytic astrocytoma and
   liver cancer, which contained AASs in only 32.67 % and 55.68 % of
   samples, respectively (Additional file [43]1: Table S1). The
   frequencies of AASs vary between and within the cancer classes
   (Additional file [44]2: Figure S1). Several factors including the age
   of patient at the time of sequencing, exposure to mutagens,
   microsatellite instability, etc. contribute to the frequency of
   variations.

Fig. 1.

   Fig. 1
   [45]Open in a new tab

   Variations in cancer. Venn diagram illustrates the numbers of samples,
   genes and variations at different levels during data filtration. The
   figures in the brackets are numbers of samples and numbers of genes,
   respectively. The values outside brackets are numbers of variations

Proportions of harmful variations are higher in cancer genes

   Cancer Gene Census (CGC) [[46]12] catalogues 138 genes in which somatic
   variations leading to AASs are causally implicated in cancer. Only
   56.46 % of the samples contained AASs in proteins translated from
   altogether 130 CGC genes (Additional file [47]1: Table S1) and 36.51 %
   of the samples contained harmful variations in totally 118 CGC genes
   (Fig. [48]1). The proportion of harmful variations is higher in CGC
   genes (39.88 %) compared to the whole dataset (14.24 %). In total,
   4.76 % of the harmful variations were present in CGC genes.

   Catalogue of Somatic Mutations in Cancer (COSMIC) [[49]13] release 68
   contains 1,646,844 (1,293,087 unique) somatic variations. Using PON-P2,
   14.71 % of the unique variations leading to AASs were predicted to be
   harmful in 9,140 genes in 43.89 % of the samples (available at
   [50]http://structure.bmc.lu.se/PON-P2/cancer30.html/). In total,
   96.55 % of the samples contain 15,176 AASs in the translated protein
   sequences of 124 CGC genes, 40.98 % of them are predicted as harmful
   (Additional file [51]1: Table S2). These results confirm that
   variations in the cancer genes are more often harmful than variations
   on average, however, the far majority of the variations are benign or
   have only minor effect even in the cancer genes. We analyzed the most
   frequent variants present in more than 25 samples in COSMIC, altogether
   327 AASs in 61 proteins. These frequent AASs show high predicted
   probability of harmfulness (mean = 0.76 and median = 0.83) (Additional
   file [52]3: Figure. S2a). There are large numbers of less frequent AASs
   with similar probabilities of harmfulness. Thus, frequent variations
   are often harmful, but less frequent variations can be equally harmful.

Evaluation of PON-P2 on cancer variant datasets

   The performance of PON-P2 has been extensively validated and compared
   to different tools [[53]10]. To assess the performance of PON-P2 for
   cancer variants, we used three somatic variation datasets. We collected
   the pathogenic somatic variations from ClinVar [[54]14], Database of
   Curated Mutations (DoCM) ([55]http://docm.genome.wustl.edu/) and TP53
   mutation database [[56]15]. In total, there were 1,058 AASs in 82
   proteins. The distribution of the probability of harmfulness is similar
   for all three datasets and the probabilities are concentrated near 1
   (Additional file [57]3: Figure S2b). In total, 733 (69.3 %) AASs were
   predicted as harmful, 4 as benign and the remaining 321 were
   unclassified. To estimate the false positive rate of PON-P2, we took
   the AASs that were annotated as not showing significant difference in
   protein activity from the TP53 mutation database. Among 454 AASs, only
   87 were predicted as harmful thus showing a low false positive rate
   (19.2 %). However, this number may be an overestimate as there are
   results only for one single protein and there is a possibility of
   random effects. True test would require a much larger dataset for AASs
   in several proteins. In the PON-P2 benchmark test data for 1,605 benign
   AASs the corresponding false positive rate is only 8.97 %. As the
   distributions of predicted probabilities were similar for the
   cancer datasets, we investigated the overlap between them. There is
   very little overlap between them, however the individual datasets
   overlap to some extent with the frequent AASs in COSMIC (Additional
   file [58]3: Figure S2c-d).

Landscape of somatic AASs

   The landscape for variations leading to harmful AASs was compared to
   that for the entire dataset. All possible base substitutions are
   represented by six classes of substitutions, C > T, C > A, C > G,
   T > C, T > A and T > G. C > T substitution is the most prevalent base
   alteration in most cancers and even more prominent among harmful
   substitutions leading to amino acid alterations (Additional file [59]2:
   Figure. S1). There are differences between cancer types: C > A
   substitution is common in the three types of lung cancer and
   neuroblastoma while C > G substitutions are enriched in the cancers of
   bladder and cervix. The landscape was investigated also based on the
   base substitutions and their immediate 5’ and 3’ nucleotides. The
   majority of the variations are C > T substitutions in CpG and TpC
   dinucleotides. In some of the cancers, C > G and C > A substitutions
   are enriched in TpC sites (Additional file [60]4: Figure S3). Among the
   harmful variations, the C > G substitutions are less frequent in most
   of the cancers. C > A substitutions remain prevalent in the three types
   of lung cancer and T > C substitutions in the liver cancer (Additional
   file [61]4: Figure S4).

   We studied the patterns of AASs in each cancer type. Arginine is the
   most frequently substituted residue in both datasets (AASs and harmful
   AASs) while the substitutions from alanine and glutamate are less
   frequent among harmful AASs (Additional file [62]5: Figures S5 and S6).
   The most common harmful substitutions are R > H, R > W, R > C and
   E > K. The high frequency of arginine may be explained by its six
   codons, four of which have CpG dinucleotide, a well-known mutation
   hotspot [[63]16]. On the other hand, glutamic acid is coded by only two
   codons and neither of them contains CpG. There are cancer type specific
   differences in the AAS distribution. For example in the lung cancers
   G > V substitutions are prevalent and in liver and thyroid cancers
   Y > C substitutions are prevalent (Additional file [64]5: Figure S6).

   We studied also the distribution of AASs in protein domains. The p53
   DNA binding domain has the highest AAS frequency in multiple cancers
   (Fig. [65]2a and Additional file [66]1: Table S3). Zinc finger domains
   have the second highest frequency among AASs and harmful AASs. We
   compared the protein domains with the most frequent AASs and harmful
   AASs in each cancer. We selected top 20 domains containing the highest
   AAS frequencies in each cancer. In total, 93 and 147 domains were
   selected for all AASs and harmful AASs, respectively (Fig. [67]2a,
   Additional file [68]6: Figure S7a). Among them, 70 domains overlap
   between the two sets. We compared also the frequency of AASs in domains
   in all the cancers together. The p53 DNA binding domain contained the
   highest frequency of harmful AASs in altogether 24 cancer types
   (Fig. [69]2b). In the data for all AASs, some domains contain AASs in
   more than 24 cancer types, however with a low frequency (Additional
   file [70]6: Figure. S7b). Epidermal growth factor receptor (EGFR)
   illustrates the distribution of AASs in protein domains. There are 233
   AASs (202 unique) in 21 cancer types. 73.3 % (148) of the unique AASs
   were predicted as harmful altogether in 18 cancer types. The AASs are
   scattered along the protein chain with slight enrichment in the kinase
   domain (Additional file [71]7: Figure. S8) while the harmful AASs are
   concentrated in the kinase domain. 56.6 % (86) of the harmful AASs
   appear in the kinase domain that represents one-fourth of the entire
   protein sequence. The harmful AASs appear frequently in secondary
   structural elements and likely affect the protein fold (Additional file
   [72]7: Figure. S8c). Overall, the benign AASs are located mainly on
   surface loops or in the termini of the α- and β-structures.

Fig. 2.

   Fig. 2
   [73]Open in a new tab

   Distribution of harmful AASs in protein domains. a Frequency of harmful
   AASs in InterPro domains. The top 20 domains containing the highest
   numbers of harmful AASs in each cancer typeare shown. b Prevalence of
   AASs in domains in multiple cancers and frequency of harmful AAS

Prioritizing most relevant proteins

   Next we identified proteins containing the highest numbers of harmful
   AASs in each cancer. As tumor cells multiply rapidly, the number of
   random variations also increases rapidly however many proteins
   containing harmful AASs may not have any implications in cancer.
   Therefore, we eliminated proteins that did not contain harmful AASs in
   at least two samples in each cancer type. Then we selected proteins
   that contained harmful AASs in the largest numbers of samples. In
   addition, proteins containing at least one harmful AAS in at least 2 %
   of the samples in a cancer type were selected. The latter step was
   introduced to include proteins with frequent harmful AASs even when the
   number of affected samples was less than the threshold. The number of
   selected proteins varied from 2 to 251 depending on cancer type
   (Additional file [74]1: Table S4). Several of the genes corresponding
   to the selected proteins are from CGC but there are numerous novel
   candidate genes (Additional file [75]8: Figure. S9).

   Since some of the selected proteins have very long sequences (TTN,
   SYNE1, RYR2, RYR3 etc.), we normalized the frequencies of harmful AASs
   by the lengths of the reference protein sequences. Proteins with higher
   normalized frequencies of harmful AASs (Additional file [76]1: Table
   S4) are likely implicated in cancer. Further studies could be
   prioritized based on the frequencies of the variations causing harmful
   AASs in these selected proteins.

Gene Ontology and pathway enrichment

   About half of the cancers have only a small number of selected proteins
   (<20) (Additional file [77]1: Table S4). In these cancers, genes
   corresponding to proteins containing at least one harmful AAS were
   further analyzed. For the other cancer types, we used the genes
   corresponding to selected proteins and performed Gene Ontology (GO) and
   pathway enrichment analysis in each cancer type. GO terms associated
   with biological processes like cell differentiation, cell death, cell
   cycle and more specific terms are significantly enriched in many cancer
   types (Additional file [78]1: Table S5). Significantly affected
   pathways include cell cycle, apoptosis, signaling by NOTCH, PI3K, mTOR,
   MAPK, Wnt, EGFR, PDGF, and others (Fig. [79]3, Additional file [80]1:
   Table S6 and Additional file [81]9: Figures S10-S37). Examples in head
   and neck cancer (HNC) and acute lymphocytic leukemia (ALL) are
   discussed to highlight the observations.

Fig. 3.

   Fig. 3
   [82]Open in a new tab

   Significantly enriched pathways in cancers. Each row represents a
   pathway. They are clustered based on the tree structure in Reactome
   pathway database. Pathway enrichment was performed using all proteins
   containing harmful AASs (left) or using selected proteins (right). The
   cancer types are clustered by complete linkage hierarchical clustering
   method. Medulloblastoma and pilocytic astrocytoma are excluded as no
   pathways were significantly enriched

Head and neck cancer (HNC)

   In HNC, we selected 56 proteins that contain at least one harmful AAS
   totaling 62.11 % of the samples. The corresponding genes for 10 of
   these proteins (TP53, EP300, EGFR, CREBBP, NFE2L2, FBXW7, NOTCH1,
   PIK3CA, RAC1 and STAG2) are catalogued in CGC. In addition, FAT1, SYNE1
   and TP63 have been reported as significantly affected genes in HNC
   [[83]17]. Our study revealed 43 additional candidate genes (Additional
   file [84]1: Table S4). Enrichment analysis of GO terms pinpointed
   biological processes including cell differentiation and multicellular
   organization (False Discovery Rate (FDR) < 0.001) (Additional file
   [85]1: Table S5). In the functional interaction network extracted from
   ReactomeFI, the selected proteins are highly connected (Fig. [86]4a).
   In a network, the degree of a node is the number of direct links of the
   node in the network. The nodes for selected proteins have higher
   average degree (95.2) compared to the nodes representing other proteins
   containing harmful AASs (49.0) and the overall degree of the nodes in
   the complete network (32.0). The proteins frequently containing harmful
   AASs are thus centrally located in the functional interaction networks.
   The selected proteins are distributed in several functional modules
   (Fig. [87]4a). Pathway enrichment analysis of the modules shows
   distinct pathways enriched in the modules. Pathways involved in
   transcription and its regulation are enriched in module 0. Cell surface
   interaction and muscle contraction pathways are enriched in module 1.
   Signaling pathways are enriched in modules 2, 3 and 4. NOTCH signaling,
   DNA replication and DNA replication are enriched in modules 5 and 7.
   Pathways of cell division are enriched in module 6. Hence, several
   pathways are affected by the prioritized proteins in HNC.

Fig. 4.

   Fig. 4
   [88]Open in a new tab

   Networks of proteins and pathways affected in HNC. a The selected
   proteins and their first neighbors in Reactome functional interaction
   network are highly connected. The nodes were clustered into modules
   (indicated by colors) using ReactomeFI plugin in cytoscape. b
   Statistically significantly enriched pathways (Reactome pathway
   database) affected in HNC. Nodes represent pathways and edges represent
   common proteins having roles in the connected pathways. The edge line
   thickness represents the number of common proteins in the two pathways.
   Only the selected proteins in HNC are included. c Pre-NOTCH expression
   and processing pathway from Reactome pathway database. This is the most
   frequently affected pathway in HNC samples

   We identified the significantly enriched pathways (FDR <0.05)
   (Additional file [89]1: Table S6). To unveil overlap between pathways,
   we generated a network of significant pathways (Fig. [90]4b). Factors
   involved in megakaryocyte development and platelet production have the
   highest degree i.e. proteins involved in this pathway are shared with
   many other pathways. NOTCH signaling pathway is the most frequently
   affected pathway in HNC. It includes NOTCH1-NOTCH4 signaling and
   pre-NOTCH expression and processing pathways. These pathways are
   affected in altogether 78 samples (20.53 % of all the HNC samples) with
   harmful AASs in at least one of the proteins corresponding to the 5
   genes (EP300, CREBBP, FBXW7, TP53, and NOTCH1) all of which are in the
   CGC. When we consider also variations that lead to substitution by stop
   codon, insertions and deletions in these 5 genes, the number of
   affected samples increases to 160 (42.1 %). These additional variation
   types are very likely harmful due to large alterations to genes and
   coded proteins. Other proteins involved in the pathway are the products
   of NOTCH2, NOTCH3, SEL1L and ATP2A2, all of which contain variations
   leading to harmful AASs in more than one sample (Fig. [91]4c). These
   proteins contained harmful AASs in 11 additional samples. Similar
   results are obtained in other cancer types where cancer related central
   pathways are affected by proteins with harmful AASs at different
   frequencies (Additional file [92]9: Figures S10-S37). Thus, harmful
   AASs in a cancer type impair proteins involved in different functions
   within certain pathway. Previously, idiosyncratic variations were found
   to have similar effects as recurrent variations [[93]5, [94]6]. Hence,
   it is essential to investigate also variations occurring at low
   frequency and explore the pathways affected by them.

   NOTCH signaling pathway (Fig. [95]4c) is highly conserved in most
   multicellular organisms and regulates cell differentiation,
   proliferation, and cell-fate determination. It has been reported to be
   affected in various cancers including HNC [[96]17–[97]19] and is
   emerging as a new therapeutic target. Another significantly enriched
   pathway is SCF-KIT pathway, which is affected in several cancer types
   including HNC. The pathway contains stem cell factor (SCF) and its
   receptor KIT. SCF homodimer binds to KIT activating the tyrosine kinase
   domain. Then, KIT stimulation activates several signaling pathways
   including RAF/MAP kinase, AKT and JAK/STAT pathways.

Acute lymphocytic leukemia (ALL)

   In ALL, only 2 proteins (encoded by PHF6 and NOTCH1) were selected,
   therefore we included all genes containing one or more variations
   leading to harmful AASs. GO enrichment analysis indicates that
   biological processes including cell differentiation, cell
   proliferation, and developmental processes are significantly enriched
   (FDR < 0.01) (Additional file [98]1: Table S5). Similar to HNC,
   proteins containing harmful AASs in more than one sample have higher
   connectivity with an average degree of 196.6 compared to other proteins
   containing harmful AASs (110.7) and the overall degree of the nodes in
   the functional interaction network (32.0) (Additional file [99]10:
   Figure S38). Pathway enrichment analysis identified significant
   pathways (FDR <0.05) (Additional file [100]10: Figure S39). There are
   53 proteins with harmful AASs involved in significantly enriched
   pathways, 16 of which have their corresponding genes in CGC. In the
   network of pathways, immune system and signaling pathways are highly
   connected (Additional file [101]10: Figure S39). Factors involved in
   megakaryocyte development are affected in 13 samples (13.3 %)
   containing harmful AASs in proteins corresponding to GATA2, GATA3,
   DOCK2, MYB, CREBBP, TP53 and EP300 genes. Including the insertions,
   deletions and nonsense substitutions, these 7 proteins contain AASs in
   23 samples (23.5 %). Pre-NOTCH expression and processing is also
   affected in 13 samples. Other significantly affected pathways include
   transcription regulation of white adipocytes and SCF-KIT signaling
   pathway.

Cancer network

   Large scale genomic studies have revealed the heterogeneous nature of
   cancers. Variation patterns are diverse even in tumors originating from
   the same tissue or organ [[102]11, [103]20] while similar patterns of
   genomic alterations are observed in cancers from different tissues of
   origin [[104]21]. We evaluated the similarities between cancer types
   based on the affected pathways. We generated a network for cancers
   which have more than 20 selected proteins and another network for the
   remaining cancers (Fig. [105]5). The nodes are highly connected to each
   other in both networks indicating that cancers share several pathways
   that contain harmful AASs even when they share fewer proteins.
   Variations can affect pathways at any step and therefore pathways are
   more relevant for cancer than individual genes and proteins. In
   Mendelian diseases, several examples are known of related diseases
   originating due to variations in proteins in the same signaling or
   metabolic pathways [[106]22, [107]23]. Also in cancers, it may not be
   that relevant which protein in a pathway is affected since they all
   would impair the function of the system and contribute to cancer.

Fig. 5.

   Fig. 5
   [108]Open in a new tab

   Cancer network. Nodes represent cancer types and edges indicate shared
   significantly enriched pathways between them. Node size represents the
   number of significant pathways affected in the cancer type and node
   color represents the number of proteins. Edge line width represents the
   number of common pathways between the cancers and edge color represents
   the number of common proteins between the cancers. GB, Glioblastoma;
   GLG, Glioma low grade; HN, Head and neck; KC, Kidney chromophobe; KCC,
   Kidney clear cell; KP, Kidney papillary; LA, Lung adeno; LS, Lung
   squamous; LSC, Lung small cell; NB, Neuroblastoma

Discussion

   We analyzed somatic AASs in 6,861 cancer samples (whole genome or exome
   sequences) classified into 30 cancer types. Several methods including
   MutSigCV [[109]20], MuSic [[110]24], InVEx [[111]25], Oncodrive
   [[112]26, [113]27] and HotNet2 [[114]28] have been developed to analyze
   the cancer genomes and identify cancer variants, genes, networks and
   pathways. Although some highly relevant cancer genes have been
   identified based on the assumption that genes with higher variation
   frequency than the background mutation rate are putative drivers, large
   numbers of tumors do not have any variations in these genes. Several
   variations leading to AASs in well-known driver genes do not have
   functional impact, for example in TP53 [[115]6]. Thus, the numbers of
   tumors with harmful variations in driver genes is even lower than
   previously presented (Tables S1 and S2). Here, we took a novel approach
   to identify harmful somatic AASs and to reveal pathways affected by
   them. Due to lack of benchmark datasets, it is not possible to compare
   the performance of PON-P2 to the methods based on substitution
   frequencies. We evaluated the applicability of PON-P2 on three cancer
   variation datasets. The validated cancer variants obtained high
   predicted probabilities of harmfulness (Additional file [116]3: Figure.
   S2). PON-P2 reliably identified 69.3 % of AASs as harmful and 0.4 % as
   benign. The remaining AASs were predicted as unknown.

   Our study revealed that many variations in known cancer genes are
   highly likely benign although the numbers of harmful variations in
   these proteins are higher than on average in proteins (Additional file
   [117]1: Tables S1 and S2). The relevance of harmfulness of AASs in
   cancer is further evidenced by our analysis of AASs in COSMIC. Frequent
   somatic AASs are highly likely harmful with high predicted
   probabilities (Additional file [118]3: Figure S2). The distribution of
   the probabilities of harmfulness for frequent AASs in COSMIC is very
   similar to those for the three additional somatic variation datasets.
   Also, large numbers of less frequent AASs are harmful many of which may
   have been introduced by random chance. The harmful AASs appear in
   proteins that have higher degree of connectivity in the functional
   interaction networks. The proteins that contain harmful AASs frequently
   are even more connected which is similar to cancer proteins in a
   previous study [[119]29] (Fig. [120]4a and Additional file [121]10:
   Figure S38). Pathway enrichment analysis revealed proteins with varying
   numbers of harmful AASs involved in key cancer-related pathways
   (Additional file [122]9: Figures S10-S37) confirming that cancer is an
   outcome of large numbers of accumulated harmful effects in a number of
   proteins and pathways. In addition, although cancer types have common
   affected pathways (Fig. [123]5), there are different pathways specific
   for individual cancer types (Additional file [124]1: Table S6 and
   Additional file [125]9: Figures S10-S37). Only a fraction of AASs in
   any protein are highly deleterious. Recent analysis of predicted
   harmful AASs in the kinase domain of Bruton tyrosine kinase revealed
   that 67 % of the single nucleotide change caused AASs are harmful
   [[126]30]. This number is higher than in most previous studies due to
   the importance of the kinase domain ([[127]30] and the references