Abstract

   The catalog of cancer driver mutations in protein-coding genes has
   greatly expanded in the past decade. However, non-coding cancer driver
   mutations are less well-characterized and only a handful of recurrent
   non-coding mutations, most notably TERT promoter mutations, have been
   reported. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole
   Genomes (PCAWG) Consortium, which aggregated whole genome sequencing
   data from 2658 cancer across 38 tumor types, we perform multi-faceted
   pathway and network analyses of non-coding mutations across 2583 whole
   cancer genomes from 27 tumor types compiled by the ICGC/TCGA PCAWG
   project that was motivated by the success of pathway and network
   analyses in prioritizing rare mutations in protein-coding genes. While
   few non-coding genomic elements are recurrently mutated in this cohort,
   we identify 93 genes harboring non-coding mutations that cluster into
   several modules of interacting proteins. Among these are promoter
   mutations associated with reduced mRNA expression in TP53, TLE4, and
   TCF4. We find that biological processes had variable proportions of
   coding and non-coding mutations, with chromatin remodeling and
   proliferation pathways altered primarily by coding mutations, while
   developmental pathways, including Wnt and Notch, altered by both coding
   and non-coding mutations. RNA splicing is primarily altered by
   non-coding mutations in this cohort, and samples containing non-coding
   mutations in well-known RNA splicing factors exhibit similar gene
   expression signatures as samples with coding mutations in these genes.
   These analyses contribute a new repertoire of possible cancer genes and
   mechanisms that are altered by non-coding mutations and offer insights
   into additional cancer vulnerabilities that can be investigated for
   potential therapeutic treatments.

   Subject terms: Cancer genomics, Cellular signalling networks
     __________________________________________________________________

   Understanding deregulation of biological pathways in cancer can provide
   insight into disease etiology and potential therapies. Here, as part of
   the PanCancer Analysis of Whole Genomes (PCAWG) consortium, the authors
   present pathway and network analysis of 2583 whole cancer genomes from
   27 tumour types.

Introduction

   Over the past decade, cancer genome sequencing efforts such as The
   Cancer Genome Atlas (TCGA) have identified millions of somatic
   aberrations; however, the annotation and interpretation of these
   aberrations remain a major challenge^[75]1. Specifically, while some
   somatic aberrations occur frequently in specific cancer types, there is
   a “long tail” of rare aberrations that are difficult to distinguish
   from random passenger aberrations in modestly sized patient
   cohorts^[76]2,[77]3. In many cancers, a significant proportion of
   patients do not have known driver mutations in protein-coding
   regions^[78]4, suggesting that additional driver mutations remain
   undiscovered. The vast majority of known driver mutations affect
   protein-coding regions. Only a few recurrent non-coding driver
   mutations, most notably mutations in the TERT promoter^[79]5,[80]6,
   have been identified. In other studies, a genome-wide analysis has
   identified recurrent mutations in several regulatory elements, and
   expression quantitative trait loci (eQTLs) analysis has identified
   non-coding somatic mutations that correlate with gene expression
   changes in some cancer types^[81]7.

   Cancer driver mutations unlock oncogenic properties of cells by
   altering the activity of hallmark pathways^[82]8. Accordingly, cancer
   genes have been shown to cluster in a small number of cellular pathways
   and interacting subnetworks^[83]3,[84]9. Consequently, pathway and
   network analysis has proven useful for implicating infrequently mutated
   genes as cancer genes based on their pathway membership and
   physical or regulatory interactions with recurrently mutated
   genes^[85]10–[86]14. However, the interactions between coding and
   non-coding driver mutations in known or novel pathways have not yet
   been systematically explored.

   As part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) project of
   the International Cancer Genome Consortium (ICGC)^[87]15, we performed
   pathway and network analysis of coding and non-coding somatic mutations
   from 2583 tumors from 27 tumor types. The PCAWG consortium curated
   whole-genome sequencing data from a total of 2658 cancers across 38
   tumor types. In the marker paper^[88]15, this work provided the largest
   collection of uniformly processed cancer whole genomes to date with
   germline and somatic variants from reanalyzed sequencing data aligned
   to the human genome (reference build hs37d5) using standardized and
   highly accurate pipelines. Recent work from the PCAWG project of the
   ICGC reveals few recurrent non-coding drivers in analyses of individual
   genes and regulatory regions^[89]16. Here, we employ seven distinct
   pathways and network analysis methods and derive consensus sets of
   pathway-implicated driver (PID) genes from the predictions of these
   methods. Specifically, we identify a consensus set of 93
   high-confidence pathway-implicated driver genes using non-coding
   variants (PID-N) and a consensus set of 87 pathway-implicated driver
   genes using coding variants (PID-C). Both sets of PID genes,
   particularly the PID-N set, contain rarely mutated genes that interact
   with well-known cancer genes, but were not identified as significantly
   mutated by single gene tests^[90]16. In total, 121 novel PID-N and
   PID-C genes are revealed as promising candidates, expanding the
   landscape of driver mutations in cancer.

   We examined the relative contributions of coding and non-coding
   mutations in altering biological processes, finding that while
   chromatin remodeling and some well-known signaling and proliferation
   pathways are altered primarily by coding mutations, other important
   cancer pathways, including developmental pathways, such as Wnt and
   Notch, are altered by both coding and non-coding mutations in PID
   genes. Intriguingly, we find many non-coding mutations in PID-N genes
   with roles in RNA splicing, and samples with these non-coding mutations
   exhibit similar gene expression signatures as samples with well-known
   coding mutations in RNA splicing factors. Our analysis demonstrates
   that somatic non-coding mutations in untranslated and in cis regulatory
   regions constitute a complementary set of genetic perturbations with
   respect to coding mutations, affect several biological pathways and
   molecular interaction networks, and should be further investigated for
   their role in the onset and progression of cancer.

Results

The long tail of coding and non-coding driver mutations

   We analyzed the genes targeted by single-nucleotide variants (SNVs) and
   short insertions and deletions (indels) identified by whole-genome
   sequencing in the 2538 ICGC PCAWG tumor samples from 27 tumor types.
   Our pathway and network analyses focused on a subset of 2252 tumors
   that excluded melanomas and lymphomas due to their atypical
   distributions of mutations in regulatory regions^[91]17. We utilized
   the pan-cancer driver p-values of single protein-coding and non-coding
   elements from the PCAWG Drivers and Functional Interpretation Working
   Group analysis^[92]16, including exons, promoters, untranslated regions
   (5′ UTR and 3′ UTR), and enhancers. This analysis integrates
   predictions from 16 driver discovery methods, resulting in consensus
   driver p-values for coding and non-coding elements; see ref. ^[93]16
   for further details. The p-values of individual genes and non-coding
   elements indicate their statistical significance as drivers, according
   to diverse methods that account for positive selection, functional
   impact of mutations, regional mutation rates, and mutational processes
   and signatures^[94]16. Among protein-coding driver p-values of the
   pan-cancer cohort, 75 genes were statistically significant (FDR < 0.1;
   Supplementary Fig. [95]1) and an additional 7 genes were observed at
   near-significant levels (0.1 ≤ FDR < 0.25). These numbers are
   consistent with previous reports of a “long tail” of driver genes with
   few highly mutated genes and many genes with infrequent mutations
   across cancer types^[96]2,[97]18. Non-coding mutations exhibit a
   similar long-tail distribution with even fewer significant genes (eight
   genes at FDR < 0.1 and two genes at 0.1 ≤ FDR < 0.25). No single gene
   has both significant or near-significant coding and non-coding driver
   p-values (FDR < 0.25), suggesting that non-coding mutations target a
   complementary set of genes as coding mutations.

   Earlier studies have demonstrated that proteins harboring coding driver
   mutations interact with each other in molecular pathways and networks
   significantly more frequently than expected by
   chance^[98]2,[99]3,[100]9–[101]11,[102]13. We observed significant
   numbers of interactions between both significantly mutated coding
   and/or non-coding elements, suggesting that the pathway and network
   methods may be useful in prioritizing rare driver events that are not
   significant by single-element analyses (Supplementary Fig. [103]2;
   Coding and non-coding mutations cluster on networks in Supplementary
   file).

Pathway and network analysis of potential driver mutations

   We performed a comprehensive pathway and network analysis of cancer
   drivers using the single-element driver p-values computed by the PCAWG
   Drivers and Functional Interpretation Working Group^[104]16 as input.
   We applied seven distinct pathways and network methods
   (ActivePathways^[105]19, CanIsoNet^[106]20, Hierarchical
   HotNet^[107]21, a hypergeometric analysis (Vazquez), an induced
   subnetwork analysis (Reyna and Raphael, in preparation), NBDI^[108]22,
   and SSA-ME^[109]23) that each leverage information from molecular
   pathways or protein interaction networks (Fig. [110]1, Methods section)
   to amplify weak signals in the single-element analysis. All methods
   were calibrated on randomized data (Pathway and network methods in
   Supplementary file).

Fig. 1. Overview of the pathway and network analysis approach.

   [111]Fig. 1
   [112]Open in a new tab

   Coding, non-coding, and combined gene scores were derived for each gene
   by aggregating driver p-values from the PCAWG driver predictions in
   individual elements, including annotated coding and non-coding elements
   (promoter, 5′ UTR, 3′ UTR, and enhancer). These gene scores were input
   to five network analysis algorithms (CanIsoNet^[113]20, Hierarchical
   HotNet^[114]21, an induced subnetwork analysis (Reyna and Raphael, in
   preparation), NBDI^[115]22, and SSA-ME^[116]23), which utilize multiple
   protein–protein interaction networks, and to two pathway analysis
   algorithms (ActivePathways^[117]19 and a hypergeometric analysis
   (Vazquez)), which utilize multiple pathway/gene-set databases. We
   defined a non-coding value-added (NCVA) procedure to determine genes
   whose non-coding scores contribute significantly to the results of the
   combined coding and non-coding analysis, where NCVA results for a
   method augment its results on non-coding data. We defined a consensus
   procedure to combine significant pathways and networks identified by
   these seven algorithms. The 87 pathway-implicated driver genes with
   coding variants (PID-C) are the set of genes reported by a majority
   (≥4/7) of methods on the coding data. The 93 pathway-implicated driver
   genes with non-coding variants (PID-N) are the set of genes reported by
   a majority of methods on non-coding data or in their NCVA results. Only
   five genes (CTNNB1, DDX3X, SF3B1, TGFBR2, and TP53) are both PID-C and
   PID-N genes.

   Since the prioritization of non-coding somatic mutations in cancer is
   not yet a solved problem, it was difficult to know in advance which
   analysis methodologies, if any, would be best suited to distinguish
   drivers from passengers by aggregating weak signals across pathways or
   networks. Thus, we formed a consensus of multiple methods, following
   the “wisdom of crowds” ensemble approach of machine learning^[118]24 to
   improve the specificity of the results. We included methods that used
   different sources of pathway or network information and different
   prioritization criteria (see Supplementary Data [119]1 for a complete
   list). Each method nominated genes, and consensus sets of genes with
   possible coding and non-coding driver mutations were defined as the
   genes found by at least four of the seven methods (Supplementary
   Data [120]2–[121]5). We use the term pathway-implicated driver (PID)
   genes to describe these candidate driver genes.

   One potential concern with a consensus procedure is that the results
   may be dominated by a few highly correlated methods. Our pathway or
   network analysis methods use varied sources of prior knowledge (i.e.,
   pathway databases or interaction networks), and input data (e.g.,
   driver p-values, point mutations, and/or gene expression), and rely on
   different techniques to integrate these data sources. We found only a
   modest overlap between the output of the seven methods (Method results
   comparison and Consensus procedure in Supplementary file; Supplementary
   Data [122]6–[123]8), suggesting a non-uniform weighting of the
   consensus to mitigate the influence of redundant methods was not
   necessary.

   Using coding mutations alone, we identify a set of 87
   pathway-implicated driver genes with coding variants (PID-C genes). The
   87 PID-C genes (Supplementary Data [124]2, Supplementary Fig. [125]6a)
   include 68 previously identified cancer genes as catalogued by the
   COSMIC Cancer Gene Census (CGC) database (v83, 699 genes from Tier 1
   and Tier 2)^[126]25 (2.98 genes expected; Fisher′s exact test
   p = 3.57 × 10^−83; Fig. [127]2a, c; Supplementary Fig. [128]7a). The
   PID-C genes have significantly higher coding gene scores than non-PID-C
   genes (rank-sum test p = 1.72 × 10^−58; median rank 48 of PID-C genes),
   and each of the 87 PID-C genes improves the score of its network
   neighborhood (19.7 genes expected; p < 10^−6; Supplementary
   Data [129]9). This network neighborhood analysis shows that PID-C genes
   are not implicated solely by their network neighbors^[130]14, but
   themselves contribute significantly to their discovery by the pathway
   and network methods. The 87 PID-C genes also include 31 genes that are
   not statistically significant (FDR > 0.1) in the PCAWG Drivers and
   Functional Interpretation Working Group analysis; Fig. [131]2a, c;
   Supplementary Figs. [132]8a, [133]9), illustrating that the network
   neighborhoods can nominate genes with infrequent mutations, i.e., those
   in the “long tail”, as possible driver genes. Interestingly, 13 of
   these 31 genes with FDR > 0.1 are also known drivers according to the
   CGC database (3.0 genes expected; Fisher’s exact test
   p = 2.1 × 10^−14). Thus, the consensus pathway and network analysis
   recovers many known protein-coding driver mutations and identifies
   additional possible drivers that are infrequently mutated and thus
   remain below the statistical significance threshold of gene-specific
   driver analyses.

Fig. 2. Pathway and driver analysis identifies driver genes in the long
tail of the driver p-values for coding and non-coding mutations.

   [134]Fig. 2
   [135]Open in a new tab

   a Pathway and network methods identify significant coding driver
   mutations. Driver p-values on protein-coding elements for 250 genes
   with most significant coding driver p-values; dashed and dotted lines
   indicate FDR = 0.1 and 0.25, respectively. Dark green bars are PID-C
   genes, and light green bars are non-PID-C genes. Blue squares below the
   x-axis indicate COSMIC Cancer Gene (CGC) Census genes. In total, 31 of
   87 PID-C genes have coding driver p-values with FDR > 0.1. Several
   PID-C genes are labeled, including all CGC genes with coding FDR > 0.1.
   Overlap between PID-C and PID-N genes is indicated with asterisks.
   Source data are provided as a Source Data file. b Pathway and network
   methods identify rare non-coding driver mutations. Driver p-values on
   non-coding elements (promoter, 5′ UTR, and 3′ UTR of gene) for 250
   genes with most significant non-coding driver p-values; dashed and
   dotted lines indicate FDR = 0.1 and 0.25, respectively. Dark yellow
   bars are PID-N genes, and light yellow bars are non-PID-N genes. Blue
   squares are as above. In total, 3 (TERT, HES1, and TOB1) of 93 PID-N
   genes have non-coding driver p-values with FDR ≤ 0.1, while 90 have FDR
   > 0.1 . Several PID-N genes are labeled, including PID-N genes with
   significant in cis gene expression changes (see Fig. [136]3) and all
   PID-N genes with non-coding FDR > 0.25. Overlap between PID-C and PID-N
   genes is indicated with asterisks. Source data are provided as a Source
   Data file. c Statistical significance of overlap between top-ranked
   genes according to coding driver p-values and PID-C genes with CGC
   genes. Fisher’s exact test p-values and driver FDR thresholds of 0.1
   and 0.25 are highlighted. Green squares indicate overlap between PID-C
   genes and CGC genes. Source data are provided as a Source Data file. d
   Statistical significance of overlap of genes ranked by driver p-values
   on non-coding (promoter, 5′ UTR, 3′ UTR) elements and CGC genes. Driver
   FDR thresholds of 0.1 and 0.25 are highlighted. Yellow square indicates
   overlap between PID-N genes and CGC genes. Source data are provided as
   a Source Data file.

   Using non-coding mutations alone, we identify a set of 62 genes using
   our consensus pathway and network analysis, resulting in fewer genes
   than our analysis with coding mutations. However, when we performed a
   joint analysis of coding and non-coding mutations, we found that the
   much stronger signal in coding mutations dominated the combined signal
   in coding and non-coding mutations. To increase sensitivity to detect
   contributions of non-coding mutations, we devised a “non-coding
   value-added” (NCVA) procedure (Fig. [137]1; Supplementary Fig. [138]3;
   Non-coding value-added (NCVA) procedure in the Methods section). Our
   NCVA procedure asks if the coding mutations enhance the discovery of
   potential non-coding driver genes beyond what is found with only the
   non-coding mutations. This procedure identified an additional set of 31
   genes that, when merged with the 62 genes found with non-coding
   mutations alone, resulted in a set of 93 pathway-implicated driver
   genes with non-coding variants (PID-N) (Supplementary Fig. [139]4,
   Consensus results in the Methods section). PID-N genes appear as a
   robust and biologically relevant set, unbiased by any particular
   mutational process reflecting a particular carcinogen or DNA damage
   processes (Supplementary Fig. [140]5, Mutational signatures in the
   Methods section).

   The 93 PID-N genes (Supplementary Data [141]3, Supplementary
   Fig. [142]6b) include 19 previously identified cancer genes according
   to the COSMIC Cancer Gene Census (CGC) database a significant
   enrichment over the 3.2 genes expected (p = 5.3 × 10^−11; Fisher’s
   exact test, Fig. [143]2b, d; Supplementary Figs. [144]7b, c). Excluding
   the eight genes with individually significant non-coding elements in
   the PCAWG Drivers and Functional Interpretation Working Group
   analysis^[145]16, 19 genes are both PID-N genes and CGC genes, a
   significant enrichment over the 3.1 genes expected (p = 5.3 × 10^−11;
   Fisher’s exact test), suggesting that non-coding mutations may alter
   genes with recurrent coding or structural variants in some samples. The
   PID-N genes have significantly higher non-coding gene scores than
   non-PID-N genes (rank-sum test p = 1.47 × 10^−58; median rank 165 of
   PID-N genes), and 92/93 PID-N genes (except for HIST1H2BO) improve the
   scores of their network neighborhoods (28.5 genes expected; p < 10^−6;
   Supplementary Data [146]10). This shows that PID-N genes are not
   implicated solely by their network neighbors^[147]14. The vast majority
   of PID-N genes (90/93, including the 19 CGC genes) are distinct from
   the PCAWG Drivers and Functional Interpretation Working Group analysis
   (Fig. [148]2b; Supplementary Figs. [149]8b, [150]9), with only three
   genes in common: TERT, HES1, and TOB1. Of these three, only TERT is
   recorded as a known cancer gene in the CGC database. Moreover, the 93
   PID-N genes are more strongly enriched (p = 5.3 × 10^−11; Fisher’s
   exact test) for COSMIC CGC genes than the 93 genes with the smallest
   non-coding driver p-values of promoters, 5′ UTRs, or 3′ UTRs
   (p = 4.8 × 10^−3; Fisher’s exact test). Thus, our consensus procedure
   of the pathway and network analyses appreciably augments the
   significantly mutated elements in the PCAWG Drivers and Functional
   Interpretation Working Group results^[151]16.

   Taken together, the PID-C and PID-N genes contain an additional 121
   genes over what was found in the PCAWG Drivers and Functional
   Interpretation Working Group analysis, including 90 new possible
   non-coding drivers (Consensus Results in the Methods section). In
   total, non-coding mutations in PID-N genes cover an additional 151
   samples (9.1% of samples) than PID-C genes. We found that most coding
   mutations in PID-C genes and most non-coding mutations in PID-N genes
   are clonal (median > 80% for both PID-C and PID-N genes^[152]26). In
   addition, the overwhelming majority of the PID-N genes were distinct
   from PID-C genes (Supplementary Fig. [153]4) with only five genes in
   common: CTNNB1, DDX3X, SF3B1, TGFBR2, and TP53. While this suggests
   that coding and non-coding driver mutations occur in largely distinct
   sets of cancer genes, we show below that both types of mutations affect
   genes underlying many of the same hallmark cancer processes.

Impact of non-coding mutations on gene expression

   Non-coding mutations may act by altering transcription factor-binding
   sites or other types of regulatory sites. Thus, we evaluated whether
   non-coding mutations in PID-N genes were associated with in cis
   expression changes in the same gene. We found that five PID-N genes
   (FDR < 0.3) showed significant in cis expression correlations out of
   the 90 that could be tested using RNA-Seq data (Fig. [154]3;
   Supplementary Fig. [155]10; Supplementary Data [156]11–[157]16). In
   contrast, 34 out of 87 PID-C genes exhibited significant or
   near-significant in cis expression correlations (FDR < 0.3)
   (Supplementary Data [158]17, [159]18).

Fig. 3. Gene expression changes are correlated with mutations in PID-N genes.

   [160]Fig. 3
   [161]Open in a new tab

   Evolutionary conservation of genomic elements estimated with PhyloP are
   shown as gray features. H3 histone lysine 4 tri-methylation sites
   (H3K4me3) measured in GM12878 HapMap B-lymphocytes cell lines are
   highlighted in the green track, indicating active promoter regions near
   transcription start sites^[162]49. Boxplot center lines show the
   median, boxplot bounds show the first quartile Q1 and the third
   quartile Q3, and whiskers show 1.5 (Q3–Q1) below and above Q1 and Q3,
   respectively. a TP53 promoter. TP53 coding and non-coding genomic loci
   with zoomed-in view of the TP53 promoter region. TP53 promoter
   mutations (six mutations in Biliary-AdenoCA, ColoRect-AdenoCA,
   Kidney-ChRCC, Lung-SCC, Ovary-AdenoCA, and Panc-AdenoCA cancer types)
   correlate significantly (Wilcoxon rank-sum test p = 0.001, FDR = 0.087)
   with reduced TP53 gene expression, where FPKM-UQ is upper quartile
   normalized fragments per kilobase million. Samples with copy-number
   gains and losses in the TP53 promoter region are annotated in red and
   blue, respectively. Two of the six TP53 promoter mutations overlap with
   transcription factor-binding sites (with one mutation matching three
   motifs). Source data are provided as a Source Data file. b TLE4
   promoter. TLE4 coding and non-coding genomic loci with zoomed-in view
   of TLE4 promoter region. TLE4 promoter mutations in Liver-HCC samples
   (three mutations) correlate (Wilcoxon rank-sum test p = 0.02,
   FDR = 0.2) with lower TLE4 gene expression. Samples with copy-number
   gains and losses annotated in red and blue, respectively. One of the
   three TLE4 promoter mutations has a transcription factor-binding site
   for ZNF263. Source data are provided as a Source Data file. c TCF4
   promoter. TCF4 coding and non-coding genomic loci with zoomed-in view
   of TCF4 promoter region. TCF4 promoter mutations in Lung-SCC samples
   (three mutations) correlate (Wilcoxon rank-sum test p = 0.03,
   FDR = 0.27) with lower TCF4 gene expression. Samples with copy-number
   gains and losses annotated in red and blue, respectively. One of the
   three TCF4 promoter mutations has a transcription factor-binding site
   for ZEB1. Source data are provided as a Source Data file.

   Unsurprisingly, the most significant in cis expression correlation for
   a PID-N gene is the correlation between TERT promoter mutations and
   increased expression, which we find in 11 Thy-AdenoCA tumors
   (p = 1.3 × 10^−10, FDR = 3.2 × 10^−9; Wilcoxon rank-sum test), 11
   CNS-Oligo tumors (p = 6.8 × 10^−3, FDR = 9.7 × 10^−2; Wilcoxon rank-sum
   test), and 22 CNS-GBM tumors (Wilcoxon rank-sum test p = 2.3 × 10^−2,
   FDR = 0.19; Wilcoxon rank-sum test) (Supplementary Fig. [163]8),
   consistent with previous reports^[164]5,[165]6,[166]27. Note that these
   associations were limited by the unavailability of RNA expression data
   for some samples with TERT mutations as well as the low-sequencing
   coverage in promoter regions that limited the detection of TERT
   promoter mutations. The PCAWG Drivers and Functional Interpretation
   Working Group investigated the latter issue for two mutation hotspot
   sites in the TERT promoter, and estimated that 216 mutations in these
   sites were likely not called^[167]16, a large underrepresentation
   relative to the total of 97 samples with TERT promoter mutations (71
   samples with expression data) in our analyses.

   We found significant in cis expression correlations in four other PID-N
   genes: TP53, TLE4, TCF4, and DUSP22 (Fig. [168]3, Supplementary
   Fig. [169]10). TP53 shows significantly reduced expression
   (p = 1.0 × 10^−3; FDR = 8.7 × 10^−2; Wilcoxon rank-sum test) across six
   tumors with TP53 promoter mutations from six different tumor types
   (Fig. [170]3a; Supplementary Fig. [171]10). The reduced expression of
   mutated samples is consistent with TP53’s well-known role as a tumor
   suppressor gene, and links between TP53 promoter methylation and
   expression have been previously investigated^[172]28. This expression
   change was also described in the study by the PCAWG Drivers and
   Functional Interpretation Working Group^[173]16. TLE4 shows
   significantly reduced expression (p = 1.7 × 10^−2; FDR = 0.20; Wilcoxon
   rank-sum test) in three Liver-HCC tumors with TLE4 promoter mutations
   (Fig. [174]3b; Supplementary Fig. [175]10). TLE4 is a transcriptional
   co-repressor that binds to several transcription factors^[176]29, and
   TLE4 functions as a tumor suppressor gene in acute myeloid lymphoma
   through its interactions with Wnt signaling^[177]30. Furthermore, in an
   acute myeloid lymphoma cell line, TLE4 knockdown increased cell
   division rates, while forced TLE4 expression induced apoptosis^[178]31.
   However, the role of TLE4 in solid tumors is not well understood. TCF4
   shows significantly reduced expression (p = 3.4 × 10^−2; FDR = 0.27;
   Wilcoxon rank-sum test) in three Lung-SCC tumors with TCF4 promoter
   mutations (Fig. [179]3c; Supplementary Fig. [180]10). TCF4 is part of
   the TCF4/β-catenin complex and encodes a transcription factor that is
   downstream of the Wnt signaling pathway. Low TCF4 expression has been
   observed in Lung-SCC tumors^[181]32. Finally, DUSP22 has significantly
   reduced expressed (p = 6.3 × 10^−3; FDR = 0.024; Wilcoxon rank-sum
   test) in five Lung-AdenoCA patients with DUSP22 3′ UTR mutations and
   significantly over-expressed (p = 7.8 × 10^−4; FDR = 0.075; Wilcoxon
   rank-sum test) in three Lung-AdenoCA patients with DUSP22 5′ UTR
   mutations. These UTR mutations were mutually exclusive. DUSP22 encodes
   a phosphatase signaling protein, and was recently proposed to be a
   tumor suppressor in lymphoma^[182]33.

   While these gene expression correlations provide additional support for
   a subset of PID-N genes, the variant allele frequency of a mutation and
   the copy number of the gene are additional covariates for gene
   expression. We found that these covariates did not play a role in of
   the correlations that we identified: the majority of mutations in each
   PID gene were clonal (Supplementary Fig. [183]11) and copy-number
   changes did not affect the expression correlations for the five PID-N
   genes described above (Fig. [184]3; Supplementary Fig. [185]10). In
   addition, the low number of PID-N genes exhibiting associated gene
   expression changes is explained by the low number of samples with
   mutations in PID-N genes, the uneven availability of expression data
   across the tumor types, and decreased sequence coverage in promoter
   regions^[186]16. These issues further reduced the number of samples
   with non-coding mutations and RNA expression, limiting the power of in
   cis gene expression correlation analysis.

Modular organization of coding and non-coding mutations

   We identified specific protein–protein interaction subnetworks and
   biological pathways that were altered by coding mutations, non-coding
   mutations, or a combination of both types of mutations. We found
   significantly more interactions between PID-C genes that expected by
   chance using a node-degree preserving permutation test (64 interactions
   observed vs. 40 interactions expected, p < 10^−6), a near-significant
   number of interactions between PID-N genes (18 vs. 12 expected,
   p = 6.8 × 10^−2), and significantly more interactions between both
   PID-C and PID-N genes (67 vs. 40 expected, p = 6 × 10^−4),
   demonstrating an interplay between coding and non-coding mutations on
   physical protein–protein interaction networks (Network annotation in
   the Methods section). We organized the interacting subnetworks
   involving PID-C and PID-N genes into five biological processes: core
   drivers, chromatin organization, cell proliferation, development, and
   RNA splicing (Fig. [187]4a). While the high frequency of molecular
   interactions between PID-C and PID-N genes is expected since such
   interactions were used as a signal in pathway and network methods, the
   organization of these interactions illustrates the relative
   contributions of coding and non-coding mutations in individual
   subnetworks.

Fig. 4. Pathway and network modules containing PID-C and PID-N genes.

   [188]Fig. 4

   [189]Fig. 4
   [190]Open in a new tab

   a Network of functional interactions between PID-C and PID-N genes.
   Nodes represent PID-C and PID-N genes and edges show functional
   interactions from the ReactomeFI network (gray), physical
   protein–protein interactions from the BioGRID network (blue), or
   interactions recorded in both networks (purple). Node color indicates
   PID-C genes (green), PID-N genes (yellow), or both PID-C and PID-N
   genes (orange); node size is proportional to the score of the gene; and
   the pie chart diagram in each node represents the relative proportions
   of coding and non-coding mutations associated with the corresponding
   gene. Dotted outlines indicate clusters of genes with roles in
   chromatin organization and cell proliferation, which predominantly
   contain PID-C genes; development, which includes comparable amounts of
   PID-C and PID-N genes; and RNA splicing, which contains PID-N genes. A
   core cluster of genes with many known drivers is also indicated. b
   Pathway modules containing PID-C and PID-N genes. Each row in the
   matrix corresponds to a PID-C or PID-N gene, and each column in the
   matrix corresponds to a pathway module enriched in PID-C and/or PID-N
   genes (see the Methods section). A filled entry indicates a gene (row)
   that belongs to one or more pathways (column) colored according to gene
   membership in PID-C genes (green), PID-N genes (yellow), or both PID-C
   and PID-N genes (orange). A dark colored entry indicates that a PID-C
   or PID-N gene belongs to a pathway that is significantly enriched for
   PID-C or PID-N genes, respectively. A lightly colored entry indicates
   that a PID-C or PID-N gene belongs to a pathway that is significantly
   enriched for the union of PID-C and PID-N genes, but not for PID-C or
   PID-N genes separately. Enrichments are summarized by circles adjacent
   each pathway module name and PID gene name. Boxed circles indicate that
   a pathway module contains a pathway that is significantly more enriched
   for the union of the PID-C and PID-N genes than the PID-C and PID-N
   results separately. The enriched modules and PID genes are clustered
   into four biological processes: chromatin, development, proliferation,
   and RNA splicing as indicated.

   We further characterized the molecular pathways enriched among our
   PID-C and PID-N using the g:Profiler web server^[191]34 (Fig. [192]4b;
   Supplementary Fig. [193]12, Supplementary Data [194]19–[195]24, Pathway
   annotation in the Methods section). Overall, 63 pathways were enriched
   for PID-C genes and 13 pathways were enriched for PID-N genes
   (FDR < 10^−6). Since our gene-prioritization methods use pathway
   databases and interaction networks as prior knowledge, it is not
   surprising that PID-C and PID-N genes are enriched in multiple
   molecular pathways. However, the enrichment results provide clues about
   the modular organization of the pathways and allow us to assess the
   relative contributions of coding and non-coding mutations in each
   pathway.

   We further grouped these molecular pathways into 29 modules using
   overlaps between annotated pathways recorded in the pathway enrichment
   map (Supplementary Fig. [196]12). For each enriched module, we examined
   whether PID-C, PID-N, or both types of genes were responsible for the
   observed enrichment. This produced a clustering of modules and PID
   genes into four biological processes: chromatin organization, cell
   proliferation, development, and RNA splicing (Fig. [197]4b).

   We found that pathways in the chromatin and cell proliferation
   processes—including chromatin remodeling and organization, histone
   modification, apoptotic signaling, signal transduction, Ras signaling,
   and cell growth—were altered primarily by coding mutations in PID-C
   genes. This is not surprising as these pathways contain many well-known
   cancer genes, such as TP53, KRAS, BRAF, cyclin-dependent kinase
   inhibitors, EGFR, PTEN, and RB1.

   At the same time, we found that multiple signaling pathways include
   significant numbers of both PID-C and PID-N genes, suggesting that
   non-coding mutations provide an alternative to coding mutations in
   disrupting these pathways. In particular, the Wnt signaling pathway
   (FDR = 6.8 × 10^−13), which was predominantly targeted by coding
   mutations, was also targeted by non-coding mutations in several PID-N
   genes, including TERT (103 mutations), HNF1A/B (24 mutations), TLE4 (32
   mutations), TCF4 (93 mutations), and CTNNB1 (17 mutations)
   (Supplementary Fig. [198]13a). The Notch signaling pathway
   (FDR = 6.8 × 10^−7) was associated with comparable numbers of PID-C and
   PID-N genes, including the PID-N genes JAG1 and MIB1 that encode
   ligands and the PID-N transcription factors ACL1, HES1, and HNF1B (66
   non-coding mutations in total) (Supplementary Fig. [199]13b). The TGF-β
   signaling pathway (FDR = 3.2 × 10^−7) also contained both PID-C and
   PID-N genes, including the PID-N genes HES1, HNF1A/B, HSPA5, MEF2C, and
   the genes TGFBR2 and CTNNB1, which are both PID-C and PID-N genes (214
   coding mutations and 166 non-coding mutations).

   We found that several developmental processes were altered by
   significant numbers of both PID-C and PID-N genes. Cell fate
   determination (FDR = 2.0 × 10^−7) was predominantly affected by
   non-coding mutations in the PID-N genes DUSP6, MEF2C, JAG1, SOX2, HES1,
   ACL1, ID2, SUFU, and KLF4 (total 191 non-coding mutations), but also by
   coding mutations in PID-C genes BRAF, GATA3, and NOTCH1/2. Pathways
   related to nervous system development (FDR = 5.8 × 10^−8) were enriched
   for the PID-N genes ASCL1, CTNNB1, ID2, SUFU, and TERT that have known
   roles in cancer^[200]35,[201]36, complementing the PID-C genes NOTCH1,
   PTEN, and RHOA that also have known cancer roles. The pattern
   specification process (FDR = 8.8 × 10^−8) was also affected by both
   coding and non-coding mutations, including the PID-N genes ASCL1, SUFU,
   and RELN and the PID-C genes ATM and SMAD4. In these cases, non-coding
   mutations complement the coding mutations that disrupt these pathways,
   covering additional patients.

   Intriguingly, we find that RNA splicing pathways were affected
   primarily by non-coding mutations (FDR = 7.6 × 10^−9). A total of 17
   PID-N genes involved in splicing-related pathways (Supplementary
   Fig. [202]13c), including several heterogeneous nuclear
   ribonucleoproteins (hnNRP) and serine- and arginine-rich splicing
   factors (SRSFs). None of these PID-N genes were significantly mutated
   according to single-element tests used in the PCAWG Drivers and
   Functional Interpretation Working Group analysis^[203]16.

   As we did not find any significant (FDR < 0.1) in cis associations
   between non-coding mutations and altered expression in splicing-related
   PID-N genes, we explored potential in trans effects between non-coding
   mutations in these genes and expression of other genes. We found that
   non-coding mutations in splicing-related PID-N genes largely
   recapitulate a recently published association from a TCGA PanCanAtlas
   analysis^[204]37 between coding mutations in several splicing factors
   and differential expression of 47 pathways (Fig. [205]5). In
   particular, we identified three clusters of mutations in RNA splicing
   genes (C1, C2, and C3; Fig. [206]5a, b) using hierarchical clustering
   of differential expression patterns across these pathways. A highly
   overlapping set of clusters was found using t-distributed stochastic
   neighbor embedding (top annotation bar in Fig. [207]5a) showing that
   the clusters were robust to the choice of the clustering method.
   Further support for robustness of clusters was found via silhouette
   scores and bootstrapping (Supplementary Fig. [208]14). Each of these
   clusters contained at least one coding mutation in the splicing genes
   SF3B1, FUBP1, and RBM10, as reported previously^[209]37, along with
   non-coding mutations in splicing-related PID-N genes, demonstrating
   that both types of mutations resulted in similar gene expression
   signatures. The joint analysis of coding and non-coding mutations in
   splicing factors also recovered the two groups of enriched
   pathways^[210]37 (P1 and P2 in Fig. [211]5a; Supplementary
   Fig. [212]15). One group (P1) is characterized by immune cell
   signatures and the other group (P2) reflects mostly cell-autonomous
   gene signatures of cell cycle, DDR, and essential cellular
   machineries^[213]37. This similarity between the gene expression
   signatures for non-coding mutations in several PID-N splicing factors
   and the signatures previously reported for coding mutations in splicing
   factor genes^[214]37 supports a functional role for splicing-related
   PID-N genes in altering similar gene expression programs.

Fig. 5. RNA splicing factors are targeted primarily by non-coding mutations
and alter expression of similar pathways as coding mutations in splicing
factors.

   [215]Fig. 5
   [216]Open in a new tab

   a Heatmap of Gene Set Enrichment Analysis (GSEA) Normalized Enrichment
   Scores (NES). The columns of the matrix indicate non-coding mutations
   in splicing-related PID-N genes and coding mutations in splicing genes
   reported in ref. ^[217]37, and the rows of the matrix indicate 47
   curated gene sets^[218]37. Red heatmap entries represent an
   upregulation of the pathway in the mutated samples with respect to the
   non-mutated samples, and blue heatmap entries represent a
   downregulation. The first column annotation indicates mutation cluster
   membership according to common pathway regulation. The second column
   annotation indicates whether a mutation is a non-coding mutation in a
   PID-N gene or a coding mutation, with the third column annotation
   specifies the aberration type (promoter, 5′ UTR, 3′ UTR, missense, or
   truncating). The fourth column annotation indicates the cancer type for
   coding mutations. The mutations cluster into three groups: C1, C2, and
   C3. The pathways cluster into two groups: P1 and P2, where P1 contains
   an immune signature gene sets and P2 contains cell-autonomous gene
   sets. b t-SNE plot of mutated elements. Gene expression signatures for
   samples with non-coding mutations clusters in splicing-related PID-N
   genes with gene expression signatures for coding mutations in
   previously published splicing factors. The shape of each point denotes
   the mutation cluster assignment (C1, C2, or C3), and the color
   represents whether the corresponding gene is a PID-N gene with
   non-coding mutations or a splicing factor gene with coding mutations.

   In addition to the above modules, we also found that transcription
   factors were well represented among both the PID-C and PID-N genes. In
   total, nine PID-C genes are transcription factors (ARHGAP35, ARID2,
   FOXA1, GATA3, NFE2L2, SMAD4, SOX9, TCF7L2, TP53; FDR = 2.1 × 10^−10),
   while 19 PID-N genes are transcription factors (ASCL1, BHLHE40, ESRRG,
   HES1, HNF1A, HNF1B, HOXA10, HOXB5, KLF4, MEF2C, MYC, NFE2, NR2F1, SOX2,
   SOX4, TCF4, TP53, ZNF521, ZNF595; FDR = 4.1 × 10^−20). This observation
   suggests that non-coding mutations may affect transcriptional
   regulatory networks.

Discussion

   We present an integrative pathway and network analysis that expands the
   list of genes with possible non-coding driver mutations into the “long
   tail” of rarely mutated elements that are not significant by
   single-element analysis. We find that genes harboring both coding or
   non-coding mutations overlap in pathways and networks; thus, pathway
   databases and interaction networks serve as useful sources of prior
   knowledge to implicate additional mutated elements beyond those
   identified by single-element tests. In total, our integrative pathway
   and network analysis identified 87 pathway-implicated driver genes with
   coding variants (PID-C) and 93 pathway-implicated driver genes with
   non-coding variants (PID-N). Importantly, 90 PID-N genes were not
   statistically significant (FDR > 0.1) by single-element tests on
   non-coding mutation data, and these genes are key candidates for future
   experimental characterization. Among them, we find that promoter
   mutations in TP53, TLE4, and TCF4 are associated with reduced
   expression of these genes.

   We find that coding and non-coding driver mutations largely target
   different genes and make varying contributions to pathways and networks
   perturbed in cancer. While some cancer pathways are targeted by both
   coding and non-coding mutations, such as the Wnt and Notch signaling
   pathways, other pathways appear to be predominantly altered by one
   class of mutations. In particular, we find non-coding mutations in
   multiple genes in the RNA splicing pathway, and samples with these
   mutations exhibit gene expression signatures that are concordant with
   gene expression changes observed in samples with coding mutations
   splicing factors SF3B1, FUBP1, and RBM10^[219]37. Together these
   results demonstrate that rare non-coding mutations may result in
   similar perturbations to both common and complementary biological
   processes.

   There are several caveats to the results reported in this study. First,
   there is relatively low power to detect non-coding mutations in the
   cohort, particularly in cancer types with small numbers of patients.
   Second, transcriptomic data were available for only a subset of
   samples, further reducing our ability to validate our predictions using
   gene expression data. Third, our pathway and network analysis relied on
   the driver p-values from the PCAWG Drivers and Functional
   Interpretation Working Group analysis^[220]16. While this analysis
   accounts for regional variations in the background mutation rate across
   the genome, it is possible that these corrections are incomplete.
   Furthermore, if the uncorrected confounding variables are correlated
   with gene membership in pathways and subnetworks, then the false
   positive rates in our analysis may be higher than estimated. All of
   these factors, plus other unknown confounding variables, make it
   difficult to assess the false discovery rate of our predictions,
   particularly for PID-N genes. Further experimental validation of these
   predictions is necessary to determine the true positives from false
   positives in our PID gene lists.

   Because of limited power in individual cancer types, our pathway and
   network analysis focused on associations found across cancer and tissue
   types. Thus, we primarily utilized generic, tissue-independent network
   and pathway information. However, it is well known that gene–gene
   interactions vary across tissues and that cancer cells rewire signaling
   and regulatory networks. Future investigations that consider the
   variable landscape of regulatory and physical interactions across
   tissues may offer a new perspective on the data. Specifically, each
   cell type has a different epigenetic wiring and regulatory machinery,
   and non-coding mutations may target cell type-dependent
   vulnerabilities. Approaches that incorporate tissue-specific,
   cancer-specific, or patient-specific gene–gene regulatory information
   may reveal new classes of drivers unexplored with our current
   approaches.

   Our pathway- and network-driven strategies enable us to interpret the
   coding and non-coding landscape of tumor genomes to discover driver
   mechanisms in interconnected systems of genes. This approach has
   multiple benefits. First, by broadening our mutation analysis from
   single genomic elements to pathways and networks of multiple genes, we
   identify new components of known cancer pathways that are recurrently
   altered by both coding and non-coding mutations, and thus likely to be
   important in cancer. Second, we identify new pathways and subnetworks
   that would remain unseen in an analysis focusing on coding sequences.
   Investigation of the coding and non-coding mutations that perturb these
   pathways and networks will enable more accurate patient-stratification
   strategies, pathway-focused biomarkers, and therapeutic approaches.

Methods

Mutation and pathway data

   We used gene scores derived from the PCAWG Drivers and Functional
   Interpretation Working Group analysis^[221]16 and combined several
   pathways and interaction networks for our pathway and network analyses.
   Here, we use the term “pathway methods” to refer to approaches that
   utilize sets of related genes for their analysis and use the term
   “network methods” to refer to approaches that utilize pairwise
   interactions among genes and/or their products.

Somatic mutation data

   We obtained consensus driver p-values ([222]syn8494939) from the PCAWG
   Drivers and Functional Interpretation Working Group analysis^[223]16
   for coding and non-coding (core promoter, 5′ UTR, 3′ UTR, enhancers)
   genomic elements for the Pancan-no-skin–melanoma–lymph cohort. We
   removed driver p-values for several elements (H3F3A and HIST1H4D
   coding; LEPROTL1, TBC1D12, and WDR74 5′ UTR; and
   chr6:142705600-142706400 enhancer, which targets ADGRG6) that the PCAWG
   Drivers and Functional Interpretation Working Group analysis had
   manually examined and discarded. We included enhancers with ≤ 5 gene
   targets (syn7201027), which covered 89% of enhancers elements from the
   PCAWG Drivers and Functional Interpretation Working Group
   analysis^[224]16. In cases where the PCAWG Drivers and Functional
   Interpretation Working Group analysis reported multiple p-values for
   the same genomic element, we used the smallest reported p-value for
   that element.

Derivation of gene scores

   Pathway databases and gene interaction networks typically record
   information at the level of individual genes. Thus, we formed coding
   and non-coding gene scores by combining PCAWG driver p-values across
   coding and/or non-coding (core promoter, 5′ UTR, 3′ UTR, enhancer)
   genomic elements as follows. Let p[x](g) be the driver p-value for
   element x of gene g from the PCAWG Drivers and Functional
   Interpretation Working Group analysis^[225]16. We combined p-values
   from multiple elements using Fisher’s method, where we selected the
   minimum p-value min(p[promoter](g), p[5’UTR](g)) for overlapping core
   promoter and 5′ UTR elements on gene g and the minimum p-value
   p[enhancer](g) of all enhancers targeting gene g. Using this approach,
   we defined the following gene scores on coding (GS-C), non-coding,
   (GS-N), and combined coding and non-coding (GS-CN) genomic elements:
    1. GS-C: p[C](g) = p[coding](g)
    2. GS-N: p[N](g) = fisher(min(p[promoter](g), p[5′UTR](g)),
       p[3′UTR](g), p[enhancer](g))
    3. GS-CN: p[CN](g) = fisher(p[coding](g), min(p[promoter](g),
       p[5′UTR](g)), p[3′UTR](g), p[enhancer](g)).

   Here,
   [MATH: <mi>p</mi><mo>=</mo><mi
   mathvariant="normal">fisher</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>p
   </mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo
   ><msub><mrow><mi>p</mi></mrow><mrow><mi>k</mi></mrow></msub></mrow><mo>
   )</mo></mrow><mo>=</mo><mo>−</mo><mn>2</mn><msubsup><mrow><mo>∑</mo></m
   row><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>k</mi></mrow><
   /msubsup><mi mathvariant="normal">ln</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>p</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub></mrow></mfenced><mo>~</mo><msubsup><mrow><mi>χ</mi></mrow><mrow><mn
   >2</mn><mi>k</mi></mrow><mrow><mn>2</mn></mrow></msubsup> :MATH]
   , is Fisher’s method for combining p-values, where 2k is the degrees of
   freedom in the calculation. When the driver p-value for a genomic
   element was undefined, we omitted that element from the calculation and
   reduced the number of degrees of freedom.

   For the pathway and networks methods that analyze individual mutations,
   we used mutations from the PCAWG MAF ([226]syn7364923) on the same
   genomic elements as the PCAWG Drivers and Functional Interpretation
   Working Group analysis, i.e., coding, core promoter, 5′ UTR, 3′ UTR,
   and enhancer. We removed melanoma and lymphoma samples as well as 69
   hypermutated samples with over 30 mutations/MB (syn7894281,
   syn7814911). We also removed mutations in elements that the PCAWG
   Drivers and Functional Interpretation Working Group analysis had
   manually examined and discarded (see above), resulting in lists of
   mutations used for later assessing biological relevance of our results
   ([227]syn8103141, [228]syn9684700).

Pathway and network databases

   Pathway methods used gene sets from six databases: CORUM^[229]38,
   GO^[230]39, InterPro^[231]40, KEGG^[232]41, NCI Nature^[233]42, and
   Reactome^[234]43 ([235]syn3164548, [236]syn11426307), where small (<3
   genes) and large (>1000 genes) pathways were removed.

   Network methods used interactions from three interaction networks: the
   largest connected subnetwork of the ReactomeFI 2015 interaction
   network^[237]44 ([238]syn3254781) with high-confidence (≥ 0.75
   confidence score) interactions, which we treated as undirected; the
   largest connected subnetwork of the iRefIndex14 interaction
   network^[239]45, which we augmented with interactions from the KEGG
   pathway database^[240]41 ([241]syn10903761). The BioGRID interaction
   network^[242]46 ([243]syn3164609) was also used to evaluate and
   annotate results.

Individual pathway and network algorithms

   We applied seven pathway and network methods to the gene scores and
   mutation data. We used two pathway methods: ActivePathways^[244]19 and
   a hypergeometric analysis (Vazquez). We also used five network methods:
   CanIsoNet^[245]20, Hierarchical HotNet^[246]21, an induced subnetwork
   analysis (Reyna and Raphael, in preparation), NBDI^[247]22, and
   SSA-ME^[248]23. Table [249]1 shows pathway databases and interaction
   networks used by each method.

Table 1.

   Summary of pathway database and interaction network data for each
   method.
   Method Pathway databases or interaction networks
   ActivePathways Gene Ontology (GO)^[250]39 biological processes,
   Reactome^[251]43 pathways
   CanIsoNet STRING v10^[252]50, DIMA^[253]51, 3did^[254]52
   Hierarchical HotNet ReactomeFI 2015^[255]43, iRefIndex14+KEGG^[256]41,
   [257]45
   Hypergeometric analysis GO biological processes; CORUM^[258]38,
   KEGG^[259]41, InterPro^[260]41, Nature NCI^[261]42 pathways
   Induced subnetwork analysis ReactomeFI 2015^[262]43,
   iRefIndex14+KEGG^[263]41, [264]45
   NBDI ReactomeFI 2015^[265]43
   SSA-ME ReactomeFI 2015^[266]43
   [267]Open in a new tab

   Using these pathway and network databases, we ran each method on the
   GS-C, GS-N, and GS-CN gene scores to identify three corresponding lists
   of genes. Each method evaluated the statistical significance of its
   results on each data set.

Non-coding value-added (NCVA) procedure

   The GS-CN results leverage both coding and non-coding mutation data,
   improving the detection of weaker pathway and network signals. We
   devised a non-coding value-added (NCVA) procedure to separate the
   coding and non-coding signals in this combined analysis, resulting in a
   set of NCVA genes for which the non-coding mutation data make a
   statistically significant contribution to their discovery in the GS-CN
   results. Specifically, we evaluated the statistical significance of
   genes in the GS-CN results using a permutation test where the driver
   p-values for coding elements were fixed and the driver p-values for
   non-coding elements were permuted. This procedure identified the subset
   of the GS-CN results that were reported infrequently (p < 0.1) on
   permuted data, and thus more likely to be true positives. Each method’s
   NCVA results were added to that method’s overall set of non-coding
   results (GS-N).

Consensus results for pathway and network methods

   We defined a consensus set of genes for each set of results: GS-C
   results, GS-N results, GS-CN results, and GS-N combined with NCVA
   results, across our seven pathway and network methods. Specifically, we
   defined a gene to be a consensus gene if it was found by a majority
   (≥ 4/7) of the pathway and network methods. For our analysis, we
   focused on the consensus GS-C results, which we call the
   pathway-implicated driver genes with coding variants (PID-C), and the
   consensus from the GS-N results combined with NCVA results, which we
   call the pathway-implicated driver genes with non-coding variants
   (PID-N). We defined PID-C genes as the 87 genes in the consensus of the
   GS-C results, and we defined PID-N genes as the 93 genes in the
   consensus of each method’s GS-N results combined with its NCVA results.
   We performed several analyses to assess the biological relevance of
   PID-C and PID-N genes.

Identification of mutational signatures of PID genes

   We performed a permutation-based enrichment test for mutation
   signatures from PCAWG mutation signatures analysis^[268]47. We
   identified the most likely mutation signature for each non-coding
   mutation in PID-N genes and compared them to randomly chosen non-coding
   mutations in non-PID-N genes.

Improved network neighborhood scores of PID genes

   To assess the extent to which gene scores on PID genes contribute to
   their detection by pathway and network methods, we considered the
   contribution of each PID gene’s score to the score of its network
   neighborhood in the BioGRID interaction network.

   For each PID gene g, we used Fisher’s method to combine the gene scores
   of the first-order network neighbors of g both with and without the
   score of g itself. In particular, for gene g, let p(g) be the gene
   score for g and N(g) be the network neighborhood of g. Then
   [MATH:
   <msubsup><mrow><mi>p</mi></mrow><mrow><mi>N</mi><mrow><mo>(</mo><mrow><
   mi>g</mi></mrow><mo>)</mo></mrow></mrow><mrow><mi
   mathvariant="normal">with</mi></mrow></msubsup><mo>=</mo><mi
   mathvariant="normal">fisher</mi><mfenced close=")"
   open="("><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mi>v</mi></mrow><mo>)</
   mo></mrow><mo>:</mo><mi>v</mi><mo>∈</mo><mi>N</mi><mfenced close=")"
   open="("><mrow><mi>g</mi></mrow></mfenced><mo>∪</mo><mfenced close="}"
   open="{"><mrow><mi>g</mi></mrow></mfenced></mrow></mfenced> :MATH]

   is a score for the network neighborhood of g when including gene g and
   [MATH:
   <msubsup><mrow><mi>p</mi></mrow><mrow><mi>N</mi><mrow><mo>(</mo><mrow><
   mi>g</mi></mrow><mo>)</mo></mrow></mrow><mrow><mi
   mathvariant="normal">without</mi></mrow></msubsup><mo>=</mo><mi
   mathvariant="normal">fisher</mi><mfenced close=")"
   open="("><mrow><mi>p</mi><mrow><mo>(</mo><mrow><mi>v</mi></mrow><mo>)</
   mo></mrow><mo>:</mo><mi>v</mi><mo>∈</mo><mi>N</mi><mfenced close=")"
   open="("><mrow><mi>g</mi></mrow></mfenced></mrow></mfenced> :MATH]

   is a score for the network neighborhood of g when excluding gene g.

   If the network neighborhood of g has a smaller p-value with g than
   without g, i.e.,
   [MATH:
   <msubsup><mrow><mi>p</mi></mrow><mrow><mi>N</mi><mrow><mo>(</mo><mrow><
   mi>g</mi></mrow><mo>)</mo></mrow></mrow><mrow><mi
   mathvariant="normal">with</mi></mrow></msubsup><mspace
   width="0.25em"></mspace><mo><</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi>p</mi></mrow><mrow><mi>N</mi
   ><mrow><mo>(</mo><mrow><mi>g</mi></mrow><mo>)</mo></mrow></mrow><mrow><
   mi mathvariant="normal">without</mi></mrow></msubsup> :MATH]
   , then gene g improves the score of the network neighborhood,
   suggesting that the gene score of g plays a role in its detection by
   pathway and network methods. Alternatively, if the network neighborhood
   of g has a larger p-value with g than without g, i.e.,
   [MATH:
   <msubsup><mrow><mi>p</mi></mrow><mrow><mi>N</mi><mrow><mo>(</mo><mrow><
   mi>g</mi></mrow><mo>)</mo></mrow></mrow><mrow><mi
   mathvariant="normal">with</mi></mrow></msubsup><mspace
   width="0.25em"></mspace><mo>></mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi>p</mi></mrow><mrow><mi>N</mi
   ><mrow><mo>(</mo><mrow><mi>g</mi></mrow><mo>)</mo></mrow></mrow><mrow><
   mi mathvariant="normal">without</mi></mrow></msubsup> :MATH]
   , then gene g worsens the score of the network neighborhood, suggesting
   that the gene scores of the network neighbors of g are predominantly
   responsible for the detection of g by pathway and network methods.

   We performed this test for every PID-C gene with GS-C gene scores and
   every PID-N gene with GS-N gene scores. We also sampled genes uniformly
   at random from the network (87 for PID-C genes and 93 for PID-N genes;
   10^6 trials) to ascertain whether significantly more PID genes that
   improved the scores of their network neighborhoods than expected by
   chance.

Expression analysis of PID genes

   We evaluated whether mutation status of each PID gene was correlated
   with RNA expression. We used PCAWG-3 gene expression data
   ([269]syn5553991), which was averaged from TopHat2 and STAR-based
   alignments, with FPKM-UQ normalization. Tumor type and copy-number
   aberrations are known to be covariates for gene expression, so we
   conditioned on tumor types and annotated copy-number aberrations.

   We used the following procedure to assess expression correlations on
   individual tumor types. We only considered cases with at least three
   mutated samples and three non-mutated samples to restrict our analysis
   to cases with sufficient statistical power. For each PID-C gene or each
   non-coding element in a PID-N gene, we partitioned the samples with
   expression data into a set A of samples with mutation(s) in the element
   and a set B of samples without mutations in the element. We performed
   the Wilcoxon rank-sum test for the expression of the gene in sets A and
   B and performed the Benjamini–Hochberg correction on each coding or
   non-coding element to provide FDRs.

   We used the following procedure to assess expression correlations
   across tumor types. We only considered cases with at least one mutated
   sample and one non-mutated sample to restrict our analysis to cases
   with sufficient statistical power. For each PID-C gene and each
   non-coding element in a PID-N gene, we partitioned the samples with
   expression data into sets A[c] of samples in cohort c with mutation(s)
   in the element and sets B[c] of samples in cohort c without mutations
   in the element. We converted the expression values into z-scores using
   the expression from non-mutated samples in cohort c, and we computed
   the Wilcoxon rank-sum test on the expression of the gene in sets from
   A = ⋃[c∊C] A[c] and B = ⋃[c∊C] B[c], where C is the set of all cohorts
   containing samples with mutation(s) in the element. We then performed
   the Benjamini–Hochberg correction on each coding or non-coding element
   to provide FDRs.

Network annotation of PID genes

   We performed a permutation test to evaluate the statistical
   significance of the number of interactions in the BioGRID
   high-confidence interaction network between PID-C genes, the number of
   interactions between PID-N genes, and the number of interactions
   between PID-C and PID-N genes, i.e., when a PID-C gene interacts with a
   PID-N gene. To compute the permutation p-value, we sampled random
   networks uniformly at random from the collection of networks with the
   same degree sequence as the BioGRID network.

   We found connected subnetworks of 46 PID-C genes (31 genes expected,
   p = 9 × 10^−4) and 16 PID-N genes (10 genes expected, p = 6.1 × 10^−2)
   in the high-confidence BioGRID^[270]48 protein–protein interaction
   (PPI) network. The union of the PID-C and PID-N genes formed a larger
   connected subnetwork of 73 genes (Fig. [271]4a). These connected
   subnetworks were significantly larger than expected by chance according
   to this permutation test (57 genes expected, p = 2.2 × 10^−3).
   Furthermore, we observed statistically significant numbers of
   protein–protein interactions between PID-C and PID-N genes (67
   interactions observed vs. 45 expected, p = 6 × 10^−4), suggesting that
   the associated mutations may target an overlapping set of underlying
   pathways. The PID-C genes were connected by significantly more
   interactions than expected (64 vs. 40 expected, p < 10^−4), and the
   PID-N genes were interconnected at a sub-significant level (18 vs 12
   expected, p = 6.8 × 10^−2). Thus, certain pathways are affected by
   either coding or non-coding mutations, but some pathways are affected
   by a complement of both coding and non-coding mutations.

Pathway annotation of PID genes

   Using g:Profiler^[272]34, we performed a pathway enrichment analysis
   for PID genes and 12,061 gene sets representing GO biological processes
   and Reactome pathways. We used the Benjamini–Hochberg correction to
   control the FDR of the results.

Characterization of PID genes in RNA splicing

   GSEA enrichment analysis was performed with the default parameters
   using the curated pathway gene lists^[273]37 for samples harboring
   non-synonymous coding mutations in five genes (FUBP1, RBM10, SF3B1,
   SRSF2, and U2AF1) with confirmed on-target splicing deregulation. Due
   to limited number of samples with RNA-seq data in individual tumor
   types, we restricted our analysis to missense mutations in SF3B1,
   truncating mutations in RBM10, and truncating mutations in FUBP1 for
   tumor types contained at least three samples with these classes of
   mutations. Each tumor type containing such mutations was considered
   separately^[274]37.

   We performed the same GSEA analysis for non-coding mutations in 17
   PID-N genes that were annotated as involved in RNA splicing. Due to
   limited number of samples from individual tumor types containing
   mutations in these genes (often there was only one per tumor type), we
   performed GSEA analysis jointly on all tumor types containing mutations
   in an individual PID-N gene, restricting the non-mutated group to
   samples from the same tumor types as the mutant samples. The GSEA
   Normalized Enrichment Scores (NES) were clustered using hierarchical
   complete linkage clustering on the Euclidean distance between the NES
   scores. Separately, we computed a 2D projection of NES scores using
   t-Distributed Stochastic Neighbor Embedding (t-SNE).

Ethical review

   Sequencing of human subjects' tissue was performed by ICGC and TCGA
   consortium members under approval of local Institutional Review Boards
   (IRBs). Informed consent was obtained from all human participants. All
   data were deidentified for this study, and data access for
   participating researchers was obtained through data access agreements
   between local institutions, the ICGC Data Access Compliance Office
   (DACO), and the NIH dbGaP.

Supplementary information

   [275]Supplementary Information^ (1.7MB, pdf)
   [276]41467_2020_14367_MOESM2_ESM.pdf^ (105.2KB, pdf)

   Description of Additional Supplementary Files
   [277]Supplementary Data 1^ (704B, txt)
   [278]Supplementary Data 2^ (602B, txt)
   [279]Supplementary Data 3^ (666B, txt)
   [280]Supplementary Data 4^ (726B, txt)
   [281]Supplementary Data 5^ (444B, txt)
   [282]Supplementary Data 6^ (557B, txt)
   [283]Supplementary Data 7^ (541B, txt)
   [284]Supplementary Data 8^ (11.2KB, txt)
   [285]Supplementary Data 9^ (30.3KB, txt)
   [286]Supplementary Data 10^ (12KB, txt)
   [287]Supplementary Data 11^ (2.4KB, txt)
   [288]Supplementary Data 12^ (7.1KB, txt)
   [289]Supplementary Data 13^ (1.2KB, txt)
   [290]Supplementary Data 14^ (5.9KB, txt)
   [291]Supplementary Data 15^ (2.8KB, txt)
   [292]Supplementary Data 16^ (6.3KB, txt)
   [293]Supplementary Data 17^ (15.2KB, txt)
   [294]Supplementary Data 18^ (6.6KB, txt)
   [295]Supplementary Data 19^ (73.5KB, txt)
   [296]Supplementary Data 20^ (22.4KB, txt)
   [297]Supplementary Data 21^ (141.8KB, txt)
   [298]Supplementary Data 22^ (63.1KB, txt)
   [299]Supplementary Data 23^ (26.7KB, txt)
   [300]Supplementary Data 24^ (73.9KB, txt)
   [301]Supplementary Data 25^ (31.6KB, txt)
   [302]Supplementary Data 26^ (4.9KB, txt)

Acknowledgements