Abstract

   Genes involved in cancer are under constant evolutionary pressure,
   potentially resulting in diverse molecular properties. In this study,
   we explore 23 omic features from publicly available databases to define
   the molecular profile of different classes of cancer genes. Cancer
   genes were grouped according to mutational landscape (germline and
   somatically mutated genes), role in cancer initiation (cancer driver
   genes) or cancer survival (survival genes), as well as being implicated
   by genome-wide association studies (GWAS genes). For each gene, we also
   computed feature scores based on all omic features, effectively
   summarizing how closely a gene resembles cancer genes of the respective
   class. In general, cancer genes are longer, have a lower GC content,
   have more isoforms with shorter exons, are expressed in more tissues
   and have more transcription factor binding sites than non-cancer genes.
   We found that germline genes more closely resemble single tissue GWAS
   genes while somatic genes are more similar to pleiotropic cancer GWAS
   genes. As a proof-of-principle, we utilized aggregated feature scores
   to prioritize genes in breast cancer GWAS loci and found that top
   ranking genes were enriched in cancer related pathways. In conclusion,
   we have identified multiple omic features associated with different
   classes of cancer genes, which can assist prioritization of genes in
   cancer gene discovery.

   Subject terms: Cancer genomics, Data integration, Genome informatics,
   Cancer genomics, Functional genomics, Gene expression, Evolutionary
   biology

Introduction

   One of the main challenges in cancer genetics is the identification of
   genes involved in cancer risk and prognosis and to characterise their
   molecular function in health and disease. While our knowledge on the
   function of a small number cancer genes is quite substantial, little is
   known about the role of most cancer genes due to the identification of
   an ever-increasing number of cancer genes, which outpaces functional
   characterization^[28]1,[29]2. However, the presently large number of
   known or suspected cancer genes also has the potential to enable the
   broad and comprehensive characterization of those genes to identify the
   molecular and evolutionary patterns underlying either all cancer genes
   or specific classes thereof.

   Generally, cancer genes can be divided into several distinct classes
   according to their involvement in cancer risk or prognosis as well as
   according to their molecular function, mode of inheritance and
   mutational landscape^[30]3,[31]4. Recent large-scale sequencing efforts
   in tumour cells and tissue provided a comprehensive assessment of
   aberrations in somatic cancer genes and their influence on different
   hallmarks of cancer^[32]5. Further advances in bioinformatic and
   experimental approaches allowed to leverage those large datasets which
   led to the description of new cancer driver genes^[33]3,[34]6. While
   somatically acquired mutations in cancer genes are important for the
   initiation and progression of tumours, they also seem to play an
   important role in cancer specific survival^[35]7,[36]8. Furthermore,
   the expression levels of thousands of genes are involved in survival
   from cancer, often with pleiotropic and contrasting effects on
   different types of cancer^[37]9,[38]10.

   In addition to genes mainly involved in cancer prognosis and
   aggressiveness, an ever-expanding number of genes are identified which
   play a role in cancer risk and thus are mostly responsible in disease
   processes before diagnosis. The occurrence of tumours is heritable,
   with heritability estimates ranging from 9 to 57%^[39]11. A sizable
   portion of familial cancer risk can be attributed to rare mutations in
   germline cancer genes^[40]4, although only around a hundred germline
   cancer genes have been identified so far^[41]12 most of which are
   tumour suppressors^[42]13. Conversely, a significant part of the
   remaining heritability is likely attributable to common cancer risk
   increasing variants^[43]14, which are routinely identified by
   genome-wide association studies (GWAS). In contrast to germline
   mutations which are usually located within the coding region of a gene
   and thus directly implicate the responsible gene, common cancer
   variants do not directly implicate a particular gene target or even
   molecular pathway due to extensive linkage between variants.
   Consequently, an association signal may span a large region in the
   genome covering many potentially relevant genes, thus posing an
   obstacle for future in-depth functional characterisation of the casual
   gene(s). Therefore, in order to prioritize the most likely cancer gene
   within a region, multiple algorithms have been developed which
   implicate genes according to mutational load, molecular function,
   involvement in specific pathways or expression^[44]15–[45]20.
   Alternatively, understanding the molecular characteristics of typical
   cancer genes promises to allow the prioritization of genes within those
   regions by implicating those genes which most closely resemble other
   typical cancer genes.

   Therefore, to further characterize the molecular properties of cancer
   genes, we systematically investigate the multiple omic features of
   different classes of cancer genes. We then aggregate the effect of
   those features to rank genes within breast cancer GWAS regions and
   perform pathway enrichment on genes to illustrate the utility of our
   findings.

Methods

Cancer gene definition

   658 genes frequently mutated in tumours (somatic genes) as well as 107
   genes that often harbour rare mutations causing hereditary cancer
   (germline genes) located were extracted from the Catalogue of Somatic
   Mutations in Cancer (COSMIC, URL:
   [46]https://cancer.sanger.ac.uk/cosmic) Cancer Gene Census (CGC)^[47]5.
   We considered both, Tier 1 and Tier 2 genes (i.e. genes with strong
   indications of a role in cancer) in our analyses as cancer genes in
   order to maximize the number of observations (i.e. genes) in each
   cancer class. Furthermore, we included 294 genes recently described to
   harbour cancer driver mutations (driver genes)^[48]2. Several germline,
   somatic and driver genes were identified to be part of multiple gene
   sets and we made the following exclusions: 64 genes which were listed
   as both germline and somatic were excluded from both lists to create a
   clean somatic and germline gene set, respectively. In addition, we
   excluded 188 genes from the driver gene list since they are also
   considered either somatic or germline genes as well as 137 genes from
   the somatic gene list which are driver genes.

   The associations of gene expression with cancer survival in The Cancer
   Genome Atlas (TCGA, URL: [49]https://portal.gdc.cancer.gov/) were
   extracted from GEPIA (Gene Expression Profiling Interactive Analysis,
   URL: [50]https://gepia.cancer-pku.cn/)10. We considered 1719 genes
   which are significantly associated with survival (false discovery
   rate < 0.01) in at least two datasets to be cancer survival genes to
   reduce the number of false positives and to include genes with a strong
   evidence for their involvement in cancer survival. We also separately
   investigated cancer survival genes whose expression is significantly
   positively (i.e. hazard ratio greater than one) or negatively (hazard
   ratio smaller than one) correlated to survival from cancer. 451
   survival genes which are also deemed germline, somatic or driver cancer
   genes or which are located in cancer GWAS loci (see below) were
   excluded from their respective analyses to create a clean set of genes
   involved in cancer survival.

   Finally, we included genes within cancer loci identified by genome-wide
   association studies. In order to account for the large number of
   associations tested in GWAS, the accepted threshold for a significant
   association with a trait (i.e. genome-wide significance) is
   5.00 × 10^−08. Therefore, we extracted all genome-wide significant
   variants associated with any cancer from the GWAS catalogue^[51]21
   (accession date: 2018-06-25, URL: [52]https://www.ebi.ac.uk/gwas/) and
   removed variants associated with cancer severity/aggressiveness,
   survival and other non-disease-risk associated outcomes. Furthermore,
   we only considered association signals primarily identified in
   Europeans.

   We grouped the cancer association signals into 20 clusters according to
   their respective tissue: breast, skin, colorectal, ovarian, uterus,
   testicular, thyroid, oesophageal, lung, renal, prostate, pancreas,
   oral, bladder, cervix, haematological and central nervous cancer as
   well as meningioma, sarcoma, and uveal melanoma. We separately
   investigated genes in GWAS regions with cancer association signals for
   only one tissues (GWAS genes) or with multiple signals from cancers of
   different tissues or types of cancer (pleiotropic GWAS genes).

   The lower and upper boundaries of a cancer GWAS locus were defined by
   the most distant variants in moderate linkage disequilibrium to the
   index GWAS variant (D’ < 0.5). We then merged GWAS loci with genomic
   overlap and considered a total of 4075 genes within those loci as
   cancer genes in case the transcription start site (TSS) or the
   transcription terminator of the longest transcript is located within
   the locus boundaries. In the analysis of GWAS cancer genes, we excluded
   1702 genes which were considered germline, somatic, driver or survival
   cancer genes.

Extraction of features

   All genomic coordinates mentioned in the manuscript are based on hg19,
   since various databases reported their features based on the hg19
   genome assembly at the time of analysis. Liftover from other builds was
   performed with the rtracklayer library (version 1.42.1) as implemented
   in R (version 3.5.1, URL: [53]https://www.R-project.org)^[54]22. The
   necessary liftover chain files were downloaded from UCSC Genome browser
   ([55]https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/). We
   used biomaRt^[56]23 (version 2.38.0) implemented in R to extract
   genomic features from the ENSEMBL database. In particular, we extracted
   the size of the gene body (i.e. genomic distance between transcription
   start site and transcription terminator of the longest transcript) and
   the number of isoforms and exons of each protein coding gene.
   Furthermore, we documented the percent GC content of each gene within
   the gene body and recorded the mean length of the 3′ and 5′ UTR as well
   as the average size of all exons of all isoforms.

   In addition, we calculated the number of orthologues within the family
   Hominidae to address recent evolutionary conservation as well as the
   number of paralogues within the human genome to investigate the
   presence of potentially redundant gene copies.

   In order to investigate the linkage disequilibrium structure within
   cancer genes, we computed the average and standard deviation of the LD
   Scores derived from European populations^[57]10 of all variants within
   the gene body.

   Genes which have fewer mutations than expected are considered essential
   and mutation intolerant. The degree of intolerance can be expressed as
   a Z Score of intolerance. For each coding gene, we extracted the Z
   scores for synonymous, non-synonymous as well the loss of function
   mutations from gnomAD^[58]10.

   Next, we computed multiple measures to summarize gene expression and
   regulation of cancer genes across all tissues included in GTEx^[59]24.
   We computed the mean and standard deviation of gene expression values
   (expressed as Transcripts Per Kilobase Million, TPM) across all 44
   tissues and also counted the number of tissues a gene is expressed in
   (i.e. TPM > 1). Furthermore, we extracted the number of unique variants
   across all tissues that are significantly correlated to gene expression
   (expression quantitative trait loci, eQTL) for each gene.

   We also calculated the number of transcription factor (TF) binding
   sites (identified by ChIPSeq experiments) reported in the Gene
   Transcription Regulation Database (GTRD)^[60]25 either within the gene
   body, within the promotor (i.e. 1000 bp up- and downstream of the TSS
   of a gene) or within the distal region of a gene (within 100,000 bp up-
   and downstream of the TSS or transcription terminator, respectively).
   The number of TF sites in the gene body as well as in the distal region
   was then normalized/divided by the gene length. Transcription factors
   were also grouped into canonical classes according to Lambert et
   al.^[61]26 and we estimated the association of TF binding sites in the
   promotor from individual classes with cancer genes.

   Finally, for each gene, we computed the total number of common
   post-translational protein modifications reported by dbPTM^[62]27 as
   well as the occurrence of individual modifications reported in at least
   10 genes.

Computation of scores from cancer features and statistical analyses

   All analyses were restricted to protein coding genes outside of the
   highly pleiotropic MHC region on chromosome 6 (hg19 coordinates:
   25,477,797 bp–33,448,354 bp). All features were scaled to have a mean
   of zero and a standard deviation of one in order to account for
   differences in the scale of the underlying data. The association of the
   features with the binary cancer gene status was evaluated with logistic
   regression as implemented in R, since logistic regression does not
   assume a specific distribution of the predictor variables, many of
   which are potentially not normally distributed. The outcome (dependent
   variable) of those models was the binary assignment to either the
   respective cancer class or to the background gene set. The exposures in
   the models were the different omic features and the effect sizes were
   visualized as a correlation plot with the corrplot function from the
   corrplot package (version 0.84). The association of genomic features
   with GWAS cancer genes as adjusted for the number of genotyping probes
   from genotyping chips within each gene body to account for potential
   confounding effects due to coverage.

   Similar to the computation of genetic risk scores^[63]28,[64]29, we
   computed genomic cancer feature scores of all cancer gene classes for
   each gene. We multiplied (weighted) the value of each feature by the
   respective effect size (log odds ratio) derived from the association of
   the feature with the respective cancer gene class (see Formula [65]1).
   Missing continuous features were imputed to the median to facilitate
   the computation of cancer feature scores for all genes. The correlation
   between the scores were visualized as a correlation plot (see above).
   [MATH:
   <mrow><mi>S</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi><mo>=</mo><mund
   erover><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>n</mi></munderover><msub><mi>β</mi><mi>i</mi></msub><mo>×</mo><msub
   ><mi>x</mi><mi>i</mi></msub></mrow> :MATH]

   where
   [MATH: <msub><mi>β</mi><mi>i</mi></msub> :MATH]
   is the log odds ratio of association of the nth feature with the
   respective cancer class and
   [MATH: <msub><mi>x</mi><mi>i</mi></msub> :MATH]
   is the numeric value of the respective feature. The variation explained
   by each score was estimated by fitting a logistic regression model with
   the respective cancer class as the outcome and the score as the
   exposure. From those models, we report the Nagelkerke pseudo R^2
   measures which denotes the proportion of the variability in the outcome
   that is explained by model.

Gene prioritization and pathway enrichment analyses

   As a proof-of-principle, we extracted 1250 genes located within 156
   loci with genome-wide significant association signals for breast cancer
   and ranked those genes according to the GWAS cancer score. We recorded
   the top two ranked genes (high GWAS cancer score set,) as well as all
   remaining genes (low GWAS cancer score set). We then performed pathway
   enrichment analyses on both gene sets separately with the gprofiler2
   package (version 0.1.4), as implemented in R using standard settings.
   Similarly, we performed pathway enrichment analyses in 1640 high- or
   low-ranking genes in 147 loci associated with coronary artery disease
   (CAD). We only considered pathways with a maximum term size of 1000
   genes and a precision of at least 2.5% (i.e. at least 2.5% of all genes
   in either gene set need to map to the respective pathway).

   In addition, we investigated the enrichment of pathways using a gene
   set enrichment algorithm implemented in Webgestalt 2019
   ([66]www.webgestalt.org^[67]30. We submitted the list of breast cancer
   genes ranked by the GWAS cancer score to Webgestalt. The gene set
   enrichment analysis was performed with standard settings but we only
   considered pathways with a maximum term size of 1000 and a minimum term
   size of 20.

   The pathway definitions were downloaded from MSigDB (URL:
   [68]https://www.gsea-msigdb.org/gsea/msigdb)^[69]31. In total, we
   included 15,922 Gene Ontology Biological Process (GO:BP) pathways, 4582
   Gene Ontology Molecular Function (GO:MF) pathways, 50 Hallmark
   pathways, 186 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways,
   2186 Reactome pathways (REAC) and 521 Wiki Pathways (WP).

Results

   First, we established a set of 23 distinct omic features from publicly
   available databases and evaluated the pairwise correlation between
   those features. The features were broadly clustered into four major
   groups (Fig. [70]1) according to their correlation coefficients. As
   expected, measures of the structure of genes such as size and number of
   isoforms/exons was highly significantly correlated with another.
   Similarly, genes which are intolerant towards deleterious mutations are
   also sensitive to synonymous and missense mutations and generally have
   fewer paralogs in the genome.

Figure 1.

   [71]Figure 1
   [72]Open in a new tab

   Correlation between omic features. 23 Omic features of 19,271 protein
   coding genes were extracted from multiple databases and the pairwise
   correlation was computed. The features were ordered and four distinct
   clusters were highlighted according to the hierarchical clustering as
   implemented in the corrplot function. The colour and the size of the
   circles represent the strength of the correlation (i.e. correlation
   coefficient) and statistical significance is indicated with asterisks
   in the upper triangle. Significant correlations (P < 0.05) are
   indicated by a black border in the lower triangle. TF transcription
   factor; LOF loss of function; TPM transcripts per kilobase million; S.D
   standard deviation *P < 0.05; **P < 0.01; ***P < 0.001. Generated with
   the corrplot function from the corrplot package (version 0.84).

   Next, we defined 4247 genes to be cancer genes according to expert
   curated databases as well as published large scale sequencing efforts
   and genome-wide association study results and grouped them into six
   cancer gene classes. In total, we included 43 genes which are often
   harbouring rare cancer predisposition mutations (germline genes), 457
   genes frequently mutated in tumours (somatic genes), 106 cancer driver
   genes as identified by Bailey et al. 2018^[73]2 (driver genes), 1268
   genes whose expression levels are associated with cancer mortality
   (survival genes) as well as 2373 genes located in cancer GWAS loci
   (GWAS cancer genes, 901 pleiotropic and 1472 non-pleiotropic). We also
   established a non-cancer background gene set consisting of 14,110
   genes. We evaluated the association of 23 omic features (Fig. [74]1)
   with cancer gene status and found multiple statistically significant
   correlations (Fig. [75]2 and Supplementary Table [76]1). In general,
   compared to the non-cancer genes, cancer genes are longer and have more
   isoforms and therefore are characterized by more numerous and shorter
   exons. While the GC content of the gene body of cancer genes seems to
   be lower than in the GC content observed in control genes, we observed
   the opposite effect for cancer genes found within GWAS loci.
   Interestingly, cancer genes seem to have fewer paralogs within the
   genome and as such are expected to be less tolerant towards deleterious
   mutations. Indeed, our results confirm that most types of cancer genes
   are indeed more intolerant towards missense and loss-of-function
   mutations. However, we found no such effect for germline genes
   (Fig. [77]2).

Figure 2.

   [78]Figure 2
   [79]Open in a new tab

   Omic features of cancer genes. The size and colour of the circle
   represent the log odds ratio of the association of 23 omic features
   with different classes of cancer genes compared to non-cancer genes.
   Associations which were statistically significant (P < 0.05) are marked
   with a black border around the circle. TF transcription factor; LOF
   loss of function; TPM transcripts per kilobase million. Generated with
   the corrplot function from the corrplot package (version 0.84).

   Furthermore, we observed that cancer genes are likely not located on
   longer haplotypes since variants in the gene body do not have a higher
   average LD Score than control genes. However, we observed an increased
   variation of the LD Score of variants within the gene body of cancer
   genes, indicating that those regions harbour a larger diversity of
   differently sized haplotypes.

   The average expression as well as the number of genes expressed across
   52 tissues from the GTEx project was significantly increased in cancer
   genes, highlighting their importance for homeostasis in multiple
   tissues. As such, cancer genes also harbour more transcription factor
   binding sites within the promotor across most transcription factor
   classes (Fig. [80]2). However, the spread/variation of expression
   across all GTEx tissues was lower for germline genes, indicating a more
   uniform expression throughout the body. Notably, we observed a reduced
   number of transcription factor binding sites in the gene body and in
   the distal regions of a gene as well as a reduced number of unique eQTL
   variants in somatic genes. Therefore, somatic cancer genes seem to be
   preferentially regulated by their promotor, which may indicate more
   direct and immediate expression control and less influence of distal
   regulatory processes. Finally, we also observed that cancer genes in
   general have more post-translational modification.

   The identification of features which are able to distinguish between
   cancer and non-cancer genes as well as between different classes of
   cancer genes allowed us to compute cancer feature scores for each gene
   by multiplying the observed effect size (log odds ratio, Fig. [81]2 and
   Supplementary Table [82]1) with the value of the respective feature and
   calculating the weighted sum. By computing those scores, we effectively
   create an omic feature profile for each gene which summarizes how
   strongly a gene resembles a typical cancer gene for a given cancer gene
   class. We found a generally positive pairwise correlation between all
   score with the strongest correlation observed between somatic and
   driver cancer genes. In contrast, the correlation was weakest between
   the scores for germline and somatic as well as driver cancer genes
   (Fig. [83]3). Of note, the correlation between the pleiotropic and
   non-pleiotropic GWAS cancer scores to other scores were similar,
   although the non-pleiotropic GWAS score was more strongly correlated to
   the germline score and the pleiotropic GWAS score showed stronger
   correlation with the somatic and driver score. Thus, genes in
   pleiotropic GWAS regions seem to be more similar to somatic and driver
   genes while genes in cancer GWAS regions with an association for a
   single type of tumour more closely resemble germline genes. The scores
   explained between 2 and 14% of the observed variation (Nagelkerke
   pseudo R^2), implicating that the scores are not completely capturing
   all molecular properties of cancer genes or that many cancer genes have
   yet to be identified.

Figure 3.

   [84]Figure 3
   [85]Open in a new tab

   Pairwise correlation between cancer feature scores. For each gene,
   seven different cancer feature scores were computed by calculating the
   sum of 23 omic features, weighted by the respective log odds ratio
   estimated from the association with the cancer gene class (see
   Fig. [86]2 and Supplementary Table [87]1). The size and colour as well
   as the number in the lower triangle represent the correlation
   coefficient. Statistical significance is indicated with asterisks.
   *** = P < 0.001. Generated with the corrplot function from the corrplot
   package (version 0.84).

   Our GWAS Cancer Score is derived from a pan-cancer approach and thus
   can potentially be applied to rank candidate genes in a variety of
   different types of tumours. However, as a proof-of-principle, we chose
   to prioritize genes in breast cancer loci for potential functional
   studies, since the number of independent, genome-wide significant
   signals identified to be associated with breast cancer is among the
   largest compared to other types of cancer. Therefore, we ranked 1250
   genes located within 156 loci with genome-wide significant association
   signals for breast cancer according to the (non-pleiotropic) GWAS
   cancer score. We created two gene sets, one including the top two
   highest-ranking genes within each region (i.e. genes which most closely
   resemble typical GWAS cancer genes) and another including the remaining
   genes, which share less similarities with other GWAS cancer genes
   according to the 23 omic features (Supplementary Figure [88]1). We then
   performed a pathway enrichment analysis on both, 254 high-ranking and
   834 low-ranking genes which mapped to any given pathway (Fig. [89]4).
   Importantly, we found no significant enrichment of any investigated
   pathway in the low cancer score gene set but instead multiple
   significantly enriched pathways for high ranking genes (Fig. [90]4). In
   particular, we found an enrichment for multiple pathways which play a
   role in cancer initiation and propagation such as pathways related to
   development, proliferation, cell cycle control, sex hormones as well as
   transcription factor binding (Supplementary Table [91]2). To contrast
   those results to similarly sized gene sets, we investigated the pathway
   enrichment of 1640 genes (284 high ranking and 1356 low ranking) within
   147 loci associated with coronary artery disease^[92]1. While only the
   high-ranking gene set of breast cancer GWAS genes showed significantly
   enriched pathways, we found statistically significant enrichment in
   both, high- and low-ranking gene sets for coronary artery disease
   (Supplementary Figure [93]2). The most significant pathways in both
   gene sets were related to epithelial cell migration and angiogenesis,
   transcription factor binding, apoptosis as well as immune response,
   enzyme inhibition, steroid metabolism and coagulation. Finally, we used
   a different pathway enrichment algorithm agnostic to our stratification
   method. To this end, we performed a gene set enrichment analysis on the
   full breast cancer gene list ranked by the GWAS cancer score
   (Supplementary Figure [94]3 and Supplementary Table [95]3). Similar to
   the results above, we observed a statistically significant enrichment
   of pathways relevant to breast cancer in high ranking genes and,
   conversely, and enrichment of less relevant pathways such as olfactory
   perception in low ranking genes.

Figure 4.

   [96]Figure 4
   [97]Open in a new tab

   Pathway enrichment analyses for high- and low-ranking breast cancer
   genes. 1250 genes within 156 breast cancer loci were ranked according
   to the GWAS Cancer Score (see Fig. [98]3). Within each locus, the two
   highest ranking genes as well as the remaining lower ranked genes were
   extracted. Pathway enrichment for (A) high- and (B) low-ranking genes
   were conducted with gprofiler2 and visualized as a Manhattan plot.
   Pathways with a Q-value smaller than 0.05 are plotted above the dashed
   horizontal line and are deemed statistically significantly enriched.
   GO:BP gene ontology biological process; GO:MF gene ontology molecular
   function; HALLMARK MSigDB hallmark gene set; KEGG KEGG pathways; REAC
   reactome pathways; WP wikiPathways; TF transcription factor; Panc.
   pancreatic; diff. differentiation.

Discussion

   In this study, we have shown that multiple omic features are associated
   with different classes of cancer genes and that cancer genes, in
   general, occupy a larger region in the genome with a lower GC content,
   have more isoforms and thus more and shorter exons and are higher
   expressed across all tissues with more transcription factor binding
   sites than non-cancer genes. Investigating those features also revealed
   that germline and somatic as well as driver genes share the least
   similarities among cancer gene classes and that single tissue cancer
   GWAS genes more closely resemble germline genes while pleiotropic
   cancer GWAS genes are more similar to somatic genes. Thus, our approach
   not only provides novel insights into the molecular properties and
   differences between cancer gene classes but also allows the
   prioritization of cancer genes according to their feature profile.

   The investigated omic features are not completely independent of each
   other and as such show a varying degree of correlation. While we were
   able to cluster those features into broadly four groups, the underlying
   reason for the grouping is not immediately obvious. For instance, in
   the first large cluster we found that larger genes have more isoforms,
   are expressed in more tissues with higher levels, are regulated by more
   eQTL variants and are located on larger haplotypes with more extensive
   LD. Having more isoforms would allow a gene to have more functions in
   different tissues and therefore is expected to be expressed at higher
   levels overall and that this expression is influenced by more eQTL
   variants, potentially targeting specific isoforms. At the same time,
   genes with a higher isoform count need to occupy a larger region in the
   genome in order to fit more alternatively spliced exons within the gene
   body. Those genes also tend to be under stronger recent positive
   selection as is evident by their larger extend of LD across the gene
   body^[99]32,[100]33 and are more essential due to increased mutation
   intolerance.

   In contrast, genes in the second large cluster are characterized by a
   higher variation in expression across tissues with more paralogs, a
   higher GC content in their gene body as well as larger exons and longer
   3′ and 5′ UTRs and are more tolerant towards deleterious mutations.
   Since there are multiple potentially redundant genes present, those
   genes do not need to rely on multiple isoforms to perform a broader
   range of functions. As such, they are more likely expressed at high
   levels in fewer tissues and therefore their variation of expression is
   larger. Furthermore, their regulation is less guided by eQTL variants
   but rather by factors binding to the longer UTR regions such as
   microRNAs or regulatory proteins. Those observations point towards
   genes which have long been established in the genome and are thus under
   less positive selection, as is evident by their diminished extend of LD
   in the gene body.

   Multiple omic features showed a high degree of correlation and as such
   there will be redundancy in the resulting score, potentially amplifying
   the effect of some features and leading to decreased accuracy or
   stability of the estimates^[101]34. However, the germline cancer gene
   class has only 43 genes and thus approaches such as lasso regression or
   multivariate logistic regression to extract the most informative
   features would not be feasible. Nevertheless, we observed that the
   condition number (the ratio of the largest to the smallest singular
   values) of the gene × 23 omic feature matrix was 12.01 and thus way
   below 30^[102]35,[103]36, indicating that the degree of collinearity is
   not too high to fit multivariate logistic regression models for
   outcomes (i.e. cancer classes) with more observations. Therefore, we
   fit a multivariate logistic regression model for GWAS cancer genes with
   all 23 omic features as exposure variables. When we used the resulting
   effect sizes in the score calculation, we found that this approach only
   marginally increased the explained variance from 2.4 to 2.9%. This
   increase is also likely inflated due to the effect sizes being
   estimated in the same dataset the score was evaluated. Furthermore, the
   resulting score was highly correlated (R^2 0.79) to the score computed
   from a linear combination of predictors, implicating that our current
   approach sufficiently captures the underlying molecular properties of
   cancer genes according to the 23 omics features.

   Among the most significant finding is the observation that the features
   of cancer genes are generally similar but also exhibit noteworthy
   differences. Although all cancer feature scores were positively and
   significantly correlated with each other, we found the lowest
   correlation coefficient between germline and somatic as well as driver
   genes, which can be attributed to multiple observations: while both,
   germline and somatic genes are expressed at higher levels across all
   investigated tissues, the variation in expression was markedly reduced
   for germline and not somatic genes. Therefore, germline genes are
   expressed at more stable levels since their function in development and
   maintenance is of importance in almost all tissue types, as previously
   described^[104]37. In general, we observed that genes which are more
   essential (i.e. have a higher LOF intolerance and are thus more
   intolerant towards mutations) have fewer paralogs thus fewer potential
   redundant gene copies which would be able to compensate for a potential
   loss of function in case those paralogues are actually functional and
   expressed. Whereas somatic genes are indeed less tolerant towards
   mutations and have fewer paralogs than background genes, we observed an
   opposite effect in germline genes, highlighting their unique
   evolutionary path in humans.

   By ranking genes within breast cancer GWAS regions, we effectively
   identified and prioritized a set of genes which share more features
   with typical GWAS cancer genes. Indeed, pathway enrichment analyses
   confirmed that the high GWAS gene score set is enriched for multiple
   processes known or suspected to be involved in breast cancer. Our
   prioritization approach selected the two top ranking genes within each
   locus (if present) in order to significantly reduce the number of
   potential cancer gene candidates for future in-depth characterisation.
   Alternatively selecting the top 10% ranked genes within each locus
   resulted in similar enrichment across the regions (data not shown),
   although the total number of implicated genes within each region varied
   greatly. We also observed a significant enrichment of breast cancer
   relevant pathways in a gene set enrichment analysis of the breast
   cancer genes ranked by the GWAS cancer score, which should be agnostic
   to specific gene score cut-offs. Those results indicate that different
   selection and ranking procedures or pathway enrichment procedures
   should not greatly influence the observed results. Since genes with
   similar function are often aggregated in the same genomic region it is
   possible that more than two genes within a given locus are influenced
   by the associated variants and thus play a role in disease risk.
   Therefore, for a given locus, the overall distribution of the cancer
   score should be considered when selecting candidates for further
   fine-mapping, functional annotation or experimental approaches.

   In this study, we used the GWAS cancer score to prioritize breast
   cancer genes as a proof-of-principle since the number of independent
   loci is among the largest for breast cancer compared to other types of
   tumours. While our GWAS cancer score is based on features associated
   with GWAS genes in any type of cancer, we cannot state with certainty
   that our approach will work in a similar fashion for other cancer types
   or for the other classes of cancer genes. Notably, we found that the
   somatic score is able to explain around 10.5% of the variance of the
   driver gene class status, probably because they share very similar
   feature profiles. The driver gene set is based on a pan-cancer approach
   and includes driver genes from all cancer types included in TCGA,
   implicating that our algorithm may indeed be applicable in a pan-cancer
   setting as well.

   Our current approach can potentially be augmented by including
   additional databases with additional and/or more specific features. For
   instance, aggregating codon level information to gene level from
   SNVBox^[105]38 or by downloading bulk gene information via
   BioThings^[106]39 may provide increase discriminatory power to
   distinguish cancer from non-cancer gene. Importantly, our methodology
   can be applied to other gene sets as well, which can be constructed
   from genes involved in various diseases or disease groups, from genes
   located in certain regions of the genome or from genes within specific
   pathways or processes.

Conclusion

   In conclusion, we have identified multiple omic features of different
   classes of cancer genes, which reveal novel insights into the molecular
   properties of cancer genes. Those features are generally similar
   between all investigated classes of cancer genes, although germline and
   somatic cancer genes share the fewest features, thus potentially
   reflecting different evolutionary pressure on those two classes of
   cancer genes. Importantly, the features can be utilized to prioritize
   candidate genes for future functional studies and may potentially be
   useful to support (pan-)cancer gene discovery in large-scale sequencing
   efforts of cancer patients. Our approach can possibly be applied to
   other gene sets as well to establish unique and shared genomic feature
   profiles.

Supplementary information

   [107]Supplementary Information 1.^ (1.6MB, docx)
   [108]Supplementary Information 2.^ (12.1KB, xlsx)
   [109]Supplementary Information 3.^ (19.6KB, xlsx)
   [110]Supplementary Information 4.^ (16.7KB, xlsx)

Author contributions

   Conception and design: F.G.; financial support: K.C.; collection and
   assembly of data: F.G., Y.P.; data analysis and interpretation: F.G.,
   Y.P.; Manuscript writing: F.G., Y.P., K.C.; Final approval of
   manuscript: All authors.

Funding

   Open Access funding provided by Karolinska Institute. This work was
   financed by the Swedish Research Council (Grant 2018-02547 to KC), the
   Swedish Cancer Society (Grant CAN 19 0266 to KC), the Stockholm County
   Council (Grant No. 20170088 to KC) and the Karolinska Institutet’s
   Research Foundation (Grant 2018-02146 to FG). FG was a Leopoldina
   Postdoctoral Fellow (Grant No. LPDS 2018-06) funded by the Academy of
   Sciences Leopoldina. The funders had no role in study design, data
   collection and analysis, decision to publish, or preparation of the
   manuscript.

Data availability

   The data was exclusively retrieved from public repositories and can be
   accessed from the sources as mentioned in the Methods section. The
   source code for the computation of the aggregate data will be made
   available at [111]https://github.com/GrassmannLab/CancerGeneFeatures.

Competing interests

   The authors declare no competing interests.

Footnotes

   Publisher's note

   Springer Nature remains neutral with regard to jurisdictional claims in
   published maps and institutional affiliations.

Supplementary information

   is available for this paper at 10.1038/s41598-020-75400-2.

References