Abstract
Multi-omics datasets represent distinct aspects of the central dogma of
molecular biology. Such high-dimensional molecular profiles pose
challenges to data interpretation and hypothesis generation.
ActivePathways is an integrative method that discovers significantly
enriched pathways across multiple datasets using statistical data
fusion, rationalizes contributing evidence and highlights associated
genes. As part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes
(PCAWG) Consortium, which aggregated whole genome sequencing data from
2658 cancers across 38 tumor types, we integrated genes with coding and
non-coding mutations and revealed frequently mutated pathways and
additional cancer genes with infrequent mutations. We also analyzed
prognostic molecular pathways by integrating genomic and transcriptomic
features of 1780 breast cancers and highlighted associations with
immune response and anti-apoptotic signaling. Integration of ChIP-seq
and RNA-seq data for master regulators of the Hippo pathway across
normal human tissues identified processes of tissue regeneration and
stem cell regulation. ActivePathways is a versatile method that
improves systems-level understanding of cellular organization in health
and disease through integration of multiple molecular datasets and
pathway annotations.
Subject terms: Cancer genomics, Data integration, Gene ontology
__________________________________________________________________
Multi-omics datasets pose major challenges to data interpretation and
hypothesis generation owing to their high-dimensional molecular
profiles. Here, the authors develop ActivePathways method, which uses
data fusion techniques for integrative pathway analysis of multi-omics
data and candidate gene discovery.
Introduction
Pathway enrichment analysis is an essential step for interpreting
high-throughput (omics) data that uses current knowledge of genes and
biological processes. A common application determines statistical
enrichment of molecular pathways, biological processes and other
functional annotations in long lists of candidate genes^[41]1. Genomic,
transcriptomic, proteomic and epigenomic experiments emphasize
complementary aspects of underlying biology and are best analyzed
integratively, as is now routinely done in large-scale projects such as
The Cancer Genome Atlas (TCGA)^[42]2, Clinical Proteome Tumor Analysis
Consortium (CPTAC), International Cancer Genome Consortium
(ICGC)^[43]3,[44]4, Genotype-Tissue Expression (GTEx)^[45]5, and
others^[46]6. Thus, simultaneous analysis of multiple candidate gene
lists for characteristic pathways is increasingly needed.
Numerous approaches are available for interpreting single gene lists.
For example, the GSEA algorithm can detect upregulated and
downregulated pathways in gene expression datasets^[47]7. Web-based
methods such as Panther^[48]8, ToppCluster^[49]9, and g:Profiler^[50]10
detect significantly enriched pathways amongst ranked or unranked gene
lists and are generally applicable to genes and proteins from various
analyses. Some approaches allow analysis of multiple input gene lists
however these primarily rely on visualization rather than data
integration to evaluate the contribution of distinct gene lists towards
each detected pathway^[51]9,[52]10. Finally, no methods are available
for unified pathway analysis of coding and non-coding mutations from
whole-genome sequencing (WGS) data, or integrating these with other
types of DNA aberrations such as copy number changes and balanced
genomic rearrangements.
Cancer genomes are characterized by multiple classes of mutations,
including single nucleotide variants (SNVs), small insertions–deletions
(indels), copy number alterations, and translocations. These affect a
small number of frequently mutated pan-cancer driver genes such as
TP53, less-frequent and tissue-specific genes such as SPOP in prostate
cancer, and numerous infrequently mutated genes. The majority of
currently known driver mutations of SNVs and indels affect
protein-coding sequence^[53]11 and only few high-confidence non-coding
drivers have been found, such as the mutation hotspots in the TERT
promoter^[54]12. Discovery of coding and non-coding driver mutations is
a major goal of large cancer whole genome sequencing efforts. The PCAWG
Consortium aggregated whole genome sequencing data from 2658 cancers
across 38 tumor types generated by the ICGC and TCGA projects. These
sequencing data were re-analysed with standardized, high-accuracy
pipelines to align to the human genome (reference build hs37d5) and
identify germline variants and somatically acquired mutations, as
described in the PCAWG marker paper^[55]4. A consensus analysis of
variant calls in PCAWG tumors generated a high-confidence catalog of
driver mutations in protein-coding driver genes (CDS) and non-coding
regions of 5′ and 3′ untranslated elements (UTRs), promoters and
enhancers^[56]13. A consensus pathway and network analysis of PCAWG
driver mutations used knowledge of molecular pathways and gene
interaction networks as priors to further discover infrequent candidate
driver variants including those in the non-coding genome^[57]14.
Here we report the development of the ActivePathways method that uses
data fusion techniques to address the challenge of integrative pathway
analysis of multi-omics data. It detects significantly enriched
pathways across multiple datasets, including those pathways that are
not apparent in any individual dataset. We present several analyses to
demonstrate this method. First, we integrate cancer driver genes with
coding and non-coding mutations predicted using the PCAWG
dataset^[58]13 and reveal numerous processes and additional genes with
frequent coding and non-coding mutations. Second, we integrate patient
clinical information with transcriptomic and copy number alterations in
breast cancers of the METABRIC project^[59]15 to discover prognostic
pathways and processes in breast cancer subtypes. Third, we integrate
transcriptomic data of normal tissues of the GTEx project^[60]5 with
ChIP-seq data to infer gene regulatory networks and biological
processes downstream of the Hippo pathway of tissue growth control and
regeneration. Thus ActivePathways is a versatile method for combining
diverse multiomics datasets.
Results
Multi-omics pathway enrichment analysis with ActivePathways
ActivePathways is a simple three-step method that extends our earlier
work^[61]10 (Fig. [62]1). It requires two input datasets. The first
input is a table of P-values with genes listed in rows and evidence
from distinct datasets listed in columns. The columns can include
P-values of differential gene expression, gene essentiality, mutation
or copy number alteration burden and many others that are derived using
platform-specific quantification methods. The second input is a
collection of gene sets that represents collective knowledge of gene
function and interactions we refer to as pathways. The most common
analysis utilizes biological processes from gene ontology^[63]16 (GO)
and molecular pathways from the Reactome database^[64]17. Depending on
the hypothesis, these data may also include many other types of gene
sets such as targets of transcription factors or microRNAs.
Fig. 1. Method overview.
[65]Fig. 1
[66]Open in a new tab
a ActivePathways requires as input (1) a matrix of gene P-values for
different omics datasets, and (2) a collection of gene sets
corresponding to biological pathways and processes. b In step (1), gene
P-values are merged using the Brown procedure and filtered to produce
an integrated gene list that combines evidence from datasets and is
ranked by decreasing significance with a lenient threshold. In step
(2), pathway enrichment analysis is conducted on the integrated gene
list using the ranked hypergeometric test that determines the optimal
level of enrichment in the ranked gene sub-list for every pathway. In
step (3), separate gene lists are compiled from individual input
datasets and analyzed for pathway enrichment using the ranked
hypergeometric test, to find supporting evidence for each pathway from
the integrative analysis. c ActivePathways provides a list of enriched
pathways in the integrated gene list, the associated genes with
significant Brown P-values, and annotations of evidence supporting each
pathway. Results of ActivePathways are visualized as enrichment maps
where nodes correspond to pathways and pathways with many shared genes
are connected into networks representing broader biological themes.
In the first step of ActivePathways, we derive an integrated gene list
that for each input gene aggregates significance from multiple omics
datasets. The integrated gene list is compiled by a fusion of gene
significance from different omics datasets (i.e., evidence) using the
Brown’s extension^[67]18 of the Fisher’s combined probability test. The
Brown’s method considers dependencies between datasets and thus
provides more conservative estimates of significance for genes that are
supported by multiple similar omics datasets. The integrated input gene
list is then ranked by decreasing significance and filtered using a
lenient cut-off, designed to capture additional candidate genes with
sub-significant signals while discarding the bulk of insignificant
genes (unadjusted Brown P[gene] < 0.1). In the second step, a pathway
enrichment analysis is conducted on the integrated gene using a ranked
hypergeometric test^[68]10 and a collection of gene sets (i.e.,
biological processes, molecular pathways, and other gene annotations).
The ranked hypergeometric test is designed to capture smaller pathways
tightly associated with few top-ranking genes and also broader
processes associated with larger subsets of input genes. The
family-wise multiple testing correction method by Holm^[69]19 is then
applied across tested pathways to select the pathways significantly
enriched in the integrated gene list (Q[pathway] < 0.05). In the third
step, we perform a similar analysis on the gene lists of individual
omics datasets separately to determine the omics evidence supporting
the integrative pathway analysis results determined in step 2.
Importantly, the third step also highlights pathways that are only
found through data integration and are not apparent in any single omics
dataset separately, providing the added value of integrated analysis.
Finally, the method provides input files for Enrichment Map^[70]20 for
visualizing resulting pathways with the corresponding omics evidence.
Integrating coding and non-coding drivers in 2658 cancer genomes
We performed an integrative pathway analysis of driver genes predicted
in the PCAWG project based on somatic SNVs and indels. This analysis
comprised 29 cancer patients cohorts of histological tumor types and 18
meta-cohorts combining multiple types of tumors, with 47 cohorts in
total (Supplementary Table [71]1). ActivePathways identified at least
one significantly enriched process or pathway in 89% of these cohorts
(42/47, Q[pathway] < 0.05, ranked hypergeometric test) (Fig. [72]2a).
We analyzed the evidence supporting predictions of enriched pathways:
most cohorts showed enrichments in pathways supported by protein-coding
mutations in genes (37/47 or 79%). This serves as a positive control
since the majority of currently-known cancer driver genes have frequent
protein-coding mutations.
Fig. 2. Pathway enrichment analysis of cancer driver genes with
ActivePathways.
[73]Fig. 2
[74]Open in a new tab
a Bar plot shows number of significantly enriched pathways (Q < 0.05)
among predicted driver genes with coding and non-coding mutations in
the PCAWG dataset. The majority of pathways detected by ActivePathways
are supported by protein-coding mutations, as expected (dark green
bars), while non-coding mutations (orange, red) reveal additional
pathways. Pathways shown in dark red are found only in the integrated
gene list of coding and non-coding mutations but not in gene lists of
individual mutation scores. b Enrichment map shows pathways enriched in
frequently mutated genes in the adenocarcinoma cohort of 1773 tumors.
Nodes in the network represent pathways and similar pathways with many
common genes are connected. Groups of similar pathways are indicated.
Nodes are colored by supporting evidence from coding and non-coding
cancer mutations. Arrow indicates kidney developmental processes. c The
group of enriched kidney developmental processes is apparent from
integrated evidence of coding and non-coding mutations but is not found
among coding or non-coding candidate genes separately. d Heatmap shows
P-values of driver genes involved in kidney developmental processes,
including driver genes found in the driver analysis (indicated with #)
and additional genes only found in the pathway analysis. Top row shows
merged P-values from the Brown procedure. Genes listed in the Cancer
Gene Census (CGC) database are indicated in boldface letters. e Pathway
analysis recovers most genes of the driver list from PCAWG (orange
asterisks), as well as additional infrequently mutated genes apparent
due to their pathway associations. Additional known cancer genes
detected in the pathway analysis are listed (green dots) and occur more
frequently than expected from chance alone.
Non-coding mutations in genes also contributed broadly to the discovery
of frequently mutated biological processes and pathways: 24/47 cohorts
(51%) showed significantly enriched pathways that were apparent when
only analyzing non-coding driver scores corresponding to UTRs,
promoters or enhancers. The majority of PCAWG tumor cohorts (41/47 or
87%) revealed some frequently mutated pathways that were apparent when
integrating coding and non-coding mutations however remained undetected
when considering either coding or non-coding mutations separately,
emphasizing the value of our integrative approach. As expected, cohorts
with more patient tumor samples generated more significantly enriched
pathways (Spearman ρ = 0.74, P = 2.3 × 10^−9; Supplementary
Fig. [75]1), suggesting that larger datasets are better powered to
distinguish rarely mutated genes involved in biological pathways and
processes. Discovery of pathways enriched in non-coding mutations
suggests that our integrative pathway analysis is an attractive
strategy for illuminating the dark matter of the non-coding cancer
genome.
We studied the adenocarcinoma cohort of 1773 samples of 16 tumor types.
Integrative pathway analysis highlighted 432 genes that were
significantly enriched in 526 pathways (Q[pathway] < 0.05)
(Fig. [76]2b). As expected, the majority of pathways were only
supported by genes with frequent coding mutations (328/526 or 62%).
However, an additional set of 101 pathways (19%) was supported by both
coding and non-coding gene mutations, 72 pathways (14%) were only
apparent in the integrated analysis of both coding and non-coding
mutations, and 25 (5%) were only enriched in non-coding mutations.
Accumulation of these individually infrequent non-coding mutations into
relevant pathways and processes is apparent in our integrative analysis
and remains undetected in a gene-focused analysis.
The major biological themes with frequent protein-coding mutations
included hallmark cancer processes such as ‘apoptotic signaling’ (24
genes; Q[pathway] = 4.3 × 10^−5) and ‘mitotic cell cycle’ (8 genes;
Q[pathway] = 0.0026), and additional biological processes such as
chromatin modification and RNA splicing that are increasingly
recognized in cancer biology. Thus, ActivePathways captures the
expected cancer pathways enriched in driver genes with protein-coding
mutations as positive controls. In contrast to these solely
protein-coding driver associations, a large group of developmental
processes and signal transduction pathways was detected as enriched in
both coding and non-coding mutations in genes; for example ‘embryo
development process’ was supported by mutations in exons, 3′UTRs and
gene promoters (68 genes; Q[pathway] = 2.9 × 10^−12), while the
Reactome pathway ‘repression of WNT target genes’ was only apparent in
the integrated analysis of coding and non-coding mutations but not in
either dataset alone (5 genes, Q[pathway] = 0.016). In summary, these
data show that ActivePathways is a sensitive approach for integrating
multi-omics signals such as coding and non-coding mutations,
interpreting supporting omics evidence, and finding additional
functional associations that are not apparent in any single input
dataset.
Pathway-based prioritization of rarely mutated cancer genes
Pathway analysis can identify candidate genes that would otherwise
remain undetected in gene-based analyses. ActivePathways enhances such
discovery by integrating signals across multiple datasets. In the
pathway analysis of coding and non-coding mutations in PCAWG, we
focused on a group of processes involved in kidney development that
were exclusively detected through the integration of coding and
non-coding mutations (Fig. [77]2c, d). ActivePathways found 18 genes
involved in these processes, only five of which were predicted as
driver genes in the consensus driver analysis of the PCAWG
project^[78]13. Additional known cancer genes included the oncogene MYC
with 13 patients with 3′UTR mutations (driver P-value
P[UTR3] = 4.8 × 10^−4), the transcription factor SMAD3 of the TGF-β
pathway with 14 patients with protein-coding mutations
(P[CDS] = 4.0 × 10^−4) and the growth inhibitory tumor suppressor gene
TSC1 with 23 patients with protein-coding mutations
(P[CDS] = 1.4 × 10^−4) as well as candidate cancer genes such as IQGAP1
with ten patients with promoter mutations (P[promoter] = 8.2 × 10^−4)
that encodes a signaling protein involved in cell motility and
morphology. The additional genes remained below the FDR-adjusted
significance cut-off in the gene-focused consensus driver analysis
(Q = 0.17–0.62), however were found by ActivePathways due to their
associations with kidney development. Thus ActivePathways can exploit
functional gene annotations and multiple omics signals to find further
candidate genes that remain undetected in gene-focused analyses.
We evaluated the effects of our data integration strategy and examined
all 333 pathway-associated candidate genes detected in the
adenocarcinoma cohort (Fig. [79]2e). As expected, these included a
considerable proportion of known cancer genes. First, as positive
controls we found 60/64 significantly mutated genes that were also
identified in the PCAWG consensus driver analysis^[80]13. In addition,
we found a set of 47/333 known cancer genes annotated in the COSMIC
Cancer Gene Census database^[81]11 that were not detected in the driver
analysis, significantly more than expected by chance alone (seven genes
expected, Fisher’s exact P = 4.0 × 10^−24). Those included well-known
cancer driver genes MYC, IDH1, NF1, and BCL9. ActivePathways was able
to detect these additional genes for several reasons. First, the
integrated gene list was filtered using a lenient statistical cut-off
in ActivePathways (P[gene] < 0.1) that allowed a long tail of 273/333
genes with less-frequent mutations to be detected through pathway
associations. Second, certain genes were upgraded through the data
fusion procedure as a single stronger P-value per gene was derived by
combining multiple weaker P-values corresponding to the coding regions,
promoters, UTRs, enhancers of the gene. This affected 17/333
pathway-associated genes including six known cancer genes (HNRNPA2B1,
STAG2, TCF7L2, SUZ12, CLTC, and ZNF521), Thus, the integration
procedure prioritized specific pathway-related genes compared to their
original rankings in individual mutation datasets. However, the
majority of genes showed reduced significance after the fusion and were
excluded from the pathway analysis: 3543 genes had at least one
significant P-value prior to data fusion (P[gene] < 0.1) while 88% of
these (3112) were considered insignificant following the Brown P-value
combination step. In contrast, the majority of pathway-associated genes
(220/333) showed improved rankings in the integrated gene list compared
to their original rankings in individual input datasets. This formal
combination of P-values across omics datasets is therefore more
conservative than a naïve approach of selecting a top P-value for every
gene. Thus, ActivePathways finds additional candidate genes that remain
undetected in gene-by-gene analyses and are highlighted due to their
multiple omics signals in pathways.
Integrating prognostic CNA and mRNA signals in breast cancer
To demonstrate an integrative analysis of patient clinical information
with multiple types of omics data, we studied the pathways and
processes associated with patient prognosis in breast cancer. We
leveraged the METABRIC dataset^[82]15 of 1780 breast cancer samples
drawn from all four subtypes (HER2-enriched, basal-like, luminal-A,
luminal-B) and evaluated all genes using three types of prognostic
evidence. mRNA abundance profiles were deconvolved between mRNA
abundance levels in tumor cells (TC) and tumor-adjacent cells (TAC)
using the ISOpure algorithm^[83]21,[84]22. mRNA values were associated
with patient survival using median dichotomization and log-rank tests.
Gene copy number alterations (CNA) were included as the third type of
evidence and associated with patient survival using log-rank tests.
ActivePathways identified 192 significantly enriched GO biological
processes and Reactome pathways across the four subtypes of breast
cancer, of which nine pathways were enriched in multiple cancer
subtypes and 33 pathways were only apparent through the integrative
pathway analysis but not in any of the CNA or mRNA datasets alone. The
major findings enriched in prognostic signatures in breast cancer
subtypes involved the processes and pathways of immune response,
apoptosis, ribosome biogenesis and chromosome segregation
(Fig. [85]3a).
Fig. 3. Prognosis-associated pathways in four molecular subtypes of breast
cancer.
[86]Fig. 3
[87]Open in a new tab
a Enrichment maps of prognostic pathways and processes were found in an
integrative analysis of mRNA abundance in tumor cells (TC),
tumor-adjacent cells (TAC), and gene copy number alterations (CNA) of
the METABRIC dataset. Multicolored nodes indicate pathways that were
prognostic according to several types of molecular evidence. Blue nodes
indicate pathways that were only apparent through merging of molecular
signals. b Hazard ratios (HR) of prognostic genes related to immune
system development in basal and HER2-enriched subtypes of breast
cancer. Strongest HR value of TC, TAC is shown. Genes commonly found in
basal and HER2-enriched tumors are shown. Known cancer genes are shown
in boldface. c Heatmap shows genes, corresponding log-rank P-values,
and merged Brown P-values related to the GO process ‘negative
regulation of apoptotic process’ that was found by
integrating prognostic omics data in HER2-enriched breast cancer. d
Kaplan–Meier plots show the strongest prognostic signal related to
apoptotic signaling, the phosphatase DUSP1 that significantly
associates with reduced patient survival through increased
tumor-adjacent mRNA level (left), increased tumor mRNA level (center)
and gene copy number amplification (right). Log-rank P-values are
shown.
Immune activity was associated with prognostic genes in basal-like and
HER2-enriched breast cancers with significant enrichment of GO
processes such as ‘immune system development’ (Q[basal] = 3.0 × 10^−4,
113 genes; Q[HER2] = 0.035, 61 genes; ranked hypergeometric test) and
‘lymphocyte differentiation’ (Q[HER2] = 6.8 × 10^−4, 46 genes;
Q[basal] = 8.4 × 10^−4, 45 genes). The majority of related genes were
associated with improved patient prognosis upon increased mRNA
abundance in tumor cells or tumor-adjacent cells, comprising 50/61
genes in the HER2-enriched subtype and 78/113 genes in the basal
subtype (Fig. [88]3b). Interestingly, only a minority of these genes
(10) were significant in both of the two breast cancer subtypes,
suggesting different modes of immune activity in subtypes and
emphasizing the power of our pathway-based approach. Basal-like breast
cancers were associated with additional 67 terms involving immune
response and blood cells, however no immune-related terms were enriched
for luminal subtypes of breast cancers. Prognostic features of
immune-related genes in HER2-enriched and basal-like breast cancers are
well known^[89]23,[90]24. Our pathway-based findings indicate that
immune activity in breast tumor cells and in the surrounding
microenvironment negatively affects tumor progression and improves
prognosis.
Apoptosis was associated with patient prognosis in HER2-enriched and
luminal-A breast cancers through enriched GO processes such as
‘negative regulation of apoptotic process’ (Q[HER2] = 0.030, 122 genes;
Q[luminalA] = 0.015, 228 genes) and ‘programmed cell death’
(Q[HER2] = 0.015, 125 genes; Q[luminalA] = 0.016, 231 genes)
(Fig. [91]3c). Interestingly, anti-apoptotic pathways were only
detected in the integrative analysis and not in genomic and
transcriptomic gene signatures separately. Among the genes negatively
regulating apoptosis, DUSP1 provided the strongest prognostic signal in
HER2-enriched breast cancers. This was apparent in the molecular
stratification of samples by mRNA of tumor cells (log-rank
P[TC] = 0.019, HR = 1.5) and tumor-adjacent cells
(P[TAC] = 8.3 × 10^−4, HR = 1.8) as well as gene copy number
amplifications (P[CNA] = 9.8 × 10^−4, HR = 2.8) (Fig. [92]3d). DUSP1
encodes a phosphatase signaling protein of the MAPK pathway that is
over-expressed in malignant breast cancer cells and inhibits apoptotic
signaling^[93]25. HER2 over-expression is known to suppress apoptosis
in breast cancer^[94]26. Anti-apoptotic signaling is a hallmark of
cancer and expectedly associated with worse patient prognosis.
ActivePathways also identified prognostic pathway associations that
were only apparent in single subtypes of breast cancer. For example,
the prognostic genes for luminal-B subtype were enriched in processes
related to ‘chromosome segregation’ (Q[luminalB] = 0.017, 41 genes)
that have been associated with worse outcome in breast cancer^[95]27.
As another example, luminal-A breast cancers were associated with
prognosis in ribosomal and RNA processing genes, such as ‘ribosome
biogenesis’ (Q[luminalA] = 6.9 × 10^−10, 60 genes) and ‘rRNA metabolic
process’ (Q[luminalA] = 1.8 × 10^−13, 64 genes). Although not described
specifically in the luminal-A subtype, ribosomal mRNA abundance has
been shown to be prognostic in breast cancer as a marker of cell
proliferation^[96]28,[97]29. In summary, ActivePathways can be used for
integrating clinical data with multiomics information of molecular
alterations. Such analyses can provide leads for functional studies and
biomarker development.
Interpreting co-expressed and DNA-bound targets of Hippo TFs
To demonstrate the use of ActivePathways for studying gene regulation,
we analyzed transcriptomes of non-cancerous human tissues from the GTEx
project^[98]5. We focused on the Hippo signaling pathway involved in
organ size control, tissue homeostasis and cancer^[99]30,[100]31 and
studied regulatory networks downstream of the two transcription factors
(TFs) YAP and TAZ (encoded by YAP1 and WWTR1). YAP and TAZ are
evolutionarily conserved master regulators of Hippo signaling in
mammals that respond to intracellular and extracellular signals of
cell–cell interactions, cell polarity, mechanical cues, G
protein-coupled receptor signaling, and cellular energy
status^[101]30,[102]31.
We performed an integrative pathway enrichment analysis of
transcriptomics and epigenomics data of the two master regulators of
the Hippo pathway. First, we predicted transcriptional target genes for
YAP and TAZ (1898 and 1319 genes, respectively), using co-expression
analysis and robust rank aggregration^[103]32 over 9642 transcriptomes
of 40 tissue types in GTEx (Q < 0.05). Second, we studied the set of
2356 target genes of YAP that have DNA-binding sites at gene promoters
derived from a YAP ChIP-seq study^[104]33 reanalyzed in the ReMap
database^[105]34 (Q < 0.05). The three gene lists with corresponding
significance values were used as input to ActivePathways for the
integrative analysis.
Integrative analysis of transcriptional and DNA-binding target genes of
YAP/TAZ resulted in 225 significantly enriched GO processes and
Reactome pathways (Q[pathway] < 0.05) (Fig. [106]4a). The resulting
pathways are expected in the context of Hippo signaling and included
development and morphogenesis, cell motility, organization of actin
cytoskeleton and cell–cell junctions, signal transduction pathways such
as EGFR, Wnt, Robo, TGF-beta, rho GTPase, and others. ActivePathways
highlighted 2066 genes in these pathways. We examined known members of
the Hippo pathway and found 13 of 32 genes among pathway-associated
genes (WWTR1, LATS1, WWC1, YWHAQ, YAP1, MAPK8, MAPK9, STK3, TNIK,
TEAD3, MAP4K3, MINK1, TEAD1), more than expected from chance alone
(four genes expected, P = 3.6 × 10^−5, Fisher’s exact test). We also
compiled an extended list of 106 Hippo-related
genes^[107]30,[108]31,[109]35 and confirmed that these were enriched in
pathway-associated genes (36/308 genes expected, P = 9.1 × 10^−26,
Fisher’s exact test) (Fig. [110]4b). A large fraction of pathways (55
or 24%) was only identified in the integrated analysis and not in any
input dataset alone, underlining the advantage of detecting
significantly enriched pathways across multiple complementary omics
datasets. Similarly, Hippo-related genes were either supported by
ChIP-seq data alone (54 genes), mRNA data alone (30 genes), or both by
mRNA and ChIP-seq data (22 genes), concordant with the notion that
RNA-seq and ChIP-seq show limited agreement regarding TF target genes
and provide complementary insights into gene regulation. Thus, our
integrative analysis of co-expression networks and TF–DNA interactions
of Hippo master regulators expectedly converges on Hippo-related genes
and pathways.
Fig. 4. Integrative pathway enrichment analysis of Hippo target genes across
human tissues.
[111]Fig. 4
[112]Open in a new tab
a Enrichment map of GO processes and Reactome pathways enriched in the
target genes of transcriptional regulators YAP and TAZ of the Hippo
pathway. Co-expressed target genes of YAP and TAZ across normal human
tissues of the GTEx dataset (pathways are shown in green and yellow,
respectively) and DNA-binding target genes of YAP from ChIP-seq
experiments (shown in blue) were analyzed. Pathways only found through
the integration of ChIP-seq and RNA-seq data are shown in red. b Euler
diagram shows 106 Hippo-related genes that were significantly enriched
in the detected pathways and supported by a combination of signals in
RNA-seq and ChIP-seq datasets. Core Hippo genes detected in the
analysis are listed. c Regulatory network of 17 TFs and 1,426 target
genes detected in the ActivePathways analysis of gene sets representing
transcription factor target genes. Transcription factors with enriched
target genes in the YAP/TAZ regulome are shown in multi-colored
circles. Target genes are colored by increasing statistical
significance (turquoise to red). d Integrative analysis of pathways and
GO processes is complementary to the analysis of transcription factor
targets. Euler diagram shows total number of pathway-associated
identified in the analysis of GO and Reactome terms (left) and TF
target genes from ENCODE (right). Numbers of Hippo-related genes are
shown in brackets.
In addition to GO terms and pathways, ActivePathways can be used to
interpret omics data with other classes of functional gene sets such as
TF target genes. To further elucidate gene regulatory networks
downstream of YAP/TAZ, we considered potential enrichments of
DNA-binding target genes of 161 TFs profiled in ChIP-seq studies of the
ENCODE project^[113]36. We found a regulatory network of 17 TFs and
1426 target genes enriched in the YAP/TAZ regulome (Q < 0.05, ranked
hypergeometric test) (Fig. [114]4c). These included the DNA-binding
targets of the master regulators of pluripotent stem cells^[115]37
NANOG (208/774 target genes, Q = 1.3 × 10^−6) and POU5F1 (107/406
target genes, Q = 7.3 × 10^−11). This finding is in agreement with the
role of Hippo pathway activity in stem cell regulation^[116]38. The
regulatory network was significantly enriched in 50 Hippo-related genes
(25 expected, P = 1.2 × 10^−7, Fisher’s exact test) and six core Hippo
genes (two expected, P = 0.030; WWTR1, VGLL4, TNIK, MAPK8, MOB1A,
LATS1), similarly to the pathway-based analysis above. However, the two
analyses revealed distinct genes: 886 genes were commonly found by both
analyses, 1180 genes were found only in the pathway-based analysis, and
540 genes were found only in the TF-based analysis (Fig. [117]4d).
Thus, integrative enrichment analysis of TF target genes provides
complementary information to GO terms and pathways. In summary, this
analysis highlights genes and pathways related to Hippo signaling in
human tissues and demonstrates the use of ActivePathways for studying
gene regulatory networks across complementary omics datasets and
technology platforms.
Evaluating the robustness and sensitivity of ActivePathways
We carefully benchmarked ActivePathways using the dataset of cancer
driver genes predicted by PCAWG^[118]13. First, we compared the
performance of ActivePathways with six methods used in the PCAWG
pathway and network analysis working group^[119]14 (Hierarchical
HotNet^[120]39,[121]40, SSA−ME^[122]41, NBDI^[123]42, induced
subnetwork analysis^[124]40, CanIsoNet^[125]43, hypergeometric test).
These diverse methods used molecular interaction networks, functional
gene sets and/or transcriptomics data to analyze the PCAWG pan-cancer
dataset of predicted cancer driver genes. A subsequent consensus
analysis defined pathway-implicated driver gene lists with
protein-coding and non-coding mutations, based on a majority vote of
the pathway and network analysis methods^[126]14. ActivePathways
recovered these consensus gene lists with the highest accuracy: 100% of
coding driver genes (87/87) and 85% of non-coding candidate genes
(79/93) were detected (Fig. [127]5a). Thus ActivePathways agreed the
most with the ensemble of several other methods in prioritizing known
and candidate cancer driver genes using pathway and network context.
Fig. 5. Benchmarking of ActivePathways.
[128]Fig. 5
[129]Open in a new tab
a Comparison of ActivePathways with six additional pathway and network
analysis methods used in the PCAWG pathway and network consensus
analysis. ActivePathways best recovers the consensus lists of
pathway-implicated driver genes with coding and non-coding mutations
(indicated by asterisk). The consensus lists are shown in the leftmost
bars of the plot and have been compiled through a majority vote of the
seven methods in the PCAWG pathway and network analysis working group.
b Comparison of ActivePathways (leftmost bars) and common pathway
enrichment analysis using multiple significance cut-offs of PCAWG gene
lists with protein-coding and non-coding mutations. ActivePathways
shows increased sensitivity of pathway analysis even at the most
lenient gene list significance cut-offs and recovers additional
pathways only detected through integration of multiple datasets (dark
red).
We also compared the performance of ActivePathways with a standard
approach of pathway enrichment analysis that considers a single
statistically-filtered gene list using a ranked hypergeometric
test^[130]1,[131]10 (Fig. [132]5b). To this end, we analyzed individual
gene lists of protein-coding and non-coding drivers of the PCAWG
adenocarcinoma cohort using multiple gene selection thresholds
(Q < 0.05, Q < 0.1, Q < 0.25, P < 0.1). ActivePathways showed increased
sensitivity of pathway enrichment analysis compared to the standard
approach, in particular for pathways involving non-coding mutations
that were not promimently represented in any single gene lists. Even
compared to the analysis employing the most lenient gene selection
filter (P < 0.1), ActivePathways identified 72 additional pathways that
were only apparent through the integration of coding and non-coding
mutations and remained undetected in the analysis of individual
datasets. Thus, our method provides additional information to common
approaches that focus on single gene lists.
We evaluated the robustness of ActivePathways to parameter variations
and missing data. We varied the parameter P[gene] that determines the
ranked gene lists used in the pathway enrichment analysis (default
threshold P[gene] < 0.1). The majority of PCAWG cohorts (40/47 or 85%)
retrieved significantly enriched pathways even with a considerably more
stringent threshold (P[gene] < 0.001), however 67% fewer pathways were
found compared to the default threshold in the median cohort
(Supplementary Fig. [133]2). We then evaluated the robustness of
ActivePathways to missing data by randomly removing subsets of driver
scores from the initial dataset. Even when removing 50% of gene driver
P-values with P < 0.001, the majority of cohorts (37/47 or 79%) had at
least one significantly enriched pathway, however 66% fewer pathways
were found on average (Supplementary Fig. [134]3).
We evaluated the expected false positive rates of ActivePathways. We
tested 1,000 simulated datasets for each of 47 patient cohorts and
expectedly found no significant pathways in 92% of these simulations
(Supplementary Fig. [135]4). Simulated data were obtained by randomly
reassigning P-values of driver predictions to different genomic
elements, a conservative approach that disrupts pathway annotations of
genes while retaining the presence of strong P-values observed in the
real data. The median family-wise false discovery rate of
ActivePathways computed across cohorts (7.2%) slightly exceeded the
applied multiple testing correction (Q < 0.05). Higher rates were
observed in cohorts including melanoma tumors, potentially due to
abundant promoter mutations caused by impaired nucleotide excision
repair in protein-bound genomic regions^[136]44. We evaluated
quantile-quantile (QQ) plots of pathway-based ranked hypergeometric
P-values from ActivePathways and found that these often deviated from
the expected uniform distribution and appeared statistically inflated
(Supplementary Fig. [137]5). However, P-values derived from simulated
gene scores showed no inflation in our simulations. Anticipating that
the most significant cancer driver genes associated with protein-coding
mutations, we performed partial simulations. We constructed datasets
with simulated gene P-values for protein-coding mutations and true
P-values for non-coding mutations. As expected, partially simulated
datasets showed a lesser extent of P-value inflation, suggesting that
highly significant known cancer genes are responsible for inflation due
to their involvement in many pathways. Statistical testing of highly
redundant pathways and processes violates the independence assumption
of statistical tests and multiple testing procedures, a known caveat of
pathway enrichment analysis^[138]1, which likely explains the observed
distribution of significance values of our method. Collectively, these
benchmarks show that ActivePathways is a sensitive and robust method
for detecting significantly enriched pathways and processes through
integrative analysis of multi-omics data.
Discussion
Integrative pathway enrichment analysis helps distill thousands of
high-throughput measurements to a smaller number of pathways and
biological themes that are most characteristic of the experimental data
at hand, ideally leading to mechanistic insights and candidate genes
for follow-up studies. In particular, a joint analysis of complementary
datasets often leads to insights that are unavailable in any particular
dataset. ActivePathways provides a generally available framework for
systematically prioritizing genes and pathways across multiple omics
datasets that utilizes fusion of gene significance. This allows us to
identify pathways and processes that stand out when the data are
combined, yet are not apparent in any single analyzed dataset. The use
of ActivePathways is demonstrated in the three case studies above. In
our example of cancer driver discovery, pathway analysis provides
complementary insights to gene-focused driver discovery as it also
focuses on sub-significant genes that are clustered into relevant
biological processes and carry coding and non-coding mutations. In the
integrative analysis of molecular alterations in breast cancer
subtypes, we find a spectrum of genes and pathways whose molecular
signatures in the tumors or the microenvironment have potential
prognostic significance. A subset of these findings, such as
anti-apoptotic signaling, is only apparent through data integration. In
the final example, we use ActivePathways to associate the co-expression
and DNA-binding networks of Hippo master regulators to downstream
pathways and processes with multi-omics evidence.
Our general pathway analysis strategy is applicable to diverse kinds of
datasets where well-calibrated P-values are available for the entire
set of genes or proteins. A multi-omics study may quantify genes using
a series of genomic, transcriptomic and proteomic experiments and
compute corresponding P-values. Data from epigenomic experiments and
genome-wide association studies can be also analyzed after signals have
been appropriately mapped to genes, for example by identifying ChIP-seq
peaks in gene promoters similarly to our GTEx analysis. Clinical and
phenotypic information of patients can be also included through
association and survival statistics, as shown in our analysis of
prognostic signatures of breast cancer subtypes. Our method is expected
to work with raw, unadjusted P-values and also with Q-values adjusted
for multiple testing, however it is primarily intended for unadjusted
P-values for increased sensitivity. ActivePathways conducts multiple
testing correction at the pathway level and reports significantly
enriched pathways at a family-wise error rate cutoff, regardless of the
gene-specific multiple testing correction applied upstream.
Quantification of genes and proteins through P-values is more robust
than quantification through their abundance measures such as counts or
fold-changes. P-values provided to ActivePathways need to be computed
using dedicated methods for individual omics platforms such that
inherent biases in the data are accounted for prior to pathway
analysis. In our example of cancer driver discovery, appropriately
computed P-values of driver gene predictions from the PCAWG
project^[139]13 account for confounding factors of somatic mutations,
such as gene sequence length and nucleotide content, mutation
signatures active in different types of tumors^[140]45 and biological
correlates of mutation frequency such as transcription and replication
timing^[141]46. On the one hand, considering all such variations
directly in the pathway enrichment analysis would require substantially
more complex models. On the other hand, directly analyzing pathways
using simpler metrics (such as mutation counts or frequencies) would
propagate any upstream biases to the pathway enrichment analysis and
cause challenges with false positives and data interpretation. Thius,
given appropriately derived P-values for genes, proteins and other
molecules, ActivePathways can be applied to a wide range of analyses.
Our method comes with important caveats. First, we only evaluate genes
and proteins annotated in pathway databases. Such databases have
variable coverage, rely on frequent data updates^[142]47 and may miss
sparsely annotated candidate genes. The most general type of pathway
enrichment analysis considers biological processes and molecular
pathways however many kinds of gene sets available in resources such as
MSigDB^[143]48 can be used to expand the scope of ActivePathways.
Second, pathway information is highly redundant and analyses of rich
molecular datasets often result in many significant results reflecting
the same underlying pathway. We address this redundancy by visualizing
and summarizing pathway results as enrichment maps^[144]1,[145]20 that
summarize multiple similar pathways and processes into general
biological themes. Statistical inflation of results accompanied by
biological redundancy is addressed by a stringent multiple testing
correction. Third, the analysis treats pathways as gene sets and does
not consider interactions of genes in pathways. This simplified
strategy allows us to consider a wider repertoire of pathways and
processes as reliable mechanistic interactions are often
context-specific and limited to a small subset of well-studied
signaling pathways. Several advanced methods such as HotNet^[146]39,
PARADIGM^[147]49, and GeneMania^[148]50 model pathways through gene and
protein interactions.
Translation of discoveries into improved human health through
actionable mechanistic insights, biomarkers, and molecular therapies is
a long-standing goal of biomedical research. For example,
next-generation cancer genomics projects such as ICGC-ARGO aim to
collect multi-omics datasets with detailed clinical profiles that will
present new challenges for pathway and network analysis techniques. In
summary, ActivePathways is integrative pathway analysis method that
improves systems-level understanding of cellular organization in health
and disease.
Methods
Integrated and evidence-based gene lists
The main input of ActivePathways is a matrix of P-values where rows
include all genes of a genome and columns correspond to evidence from
omics datasets. To interpret multiple omics datasets, a combined
P-value is computed for each gene using a data fusion approach,
resulting in an integrated gene list. The integrated gene list is
computed by merging all P-values of a given gene into one combined
P-value using the Brown’s extension^[149]18 of the Fisher’s combined
probability test that accounts for overall covariation of P-values from
different sources of evidence. The integrated gene list of Brown
P-values is then ranked in order of decreasing significance and
filtered using a lenient threshold (unadjusted P < 0.1 by default).
Evidence-based gene lists representing different omics datasets are
based on ranked P-values from individual columns of the input matrix
and filtered using the same significance threshold.
Statistical enrichment of pathways
Statistical enrichment of pathways in ranked lists of candidate genes
is carried out with the ranked hypergeometric test. The test considers
one pathway gene set at a time and analyzes increasing subsets of input
genes from the top of the ranked gene list. The same procedure is used
for integrated and evidence-based gene lists. At each iteration, the
test computes the hypergeometric enrichment statistic and P-value for
the set of genes shared by the pathway and top sub-list of the input
gene list. For optimal processing speed, only gene lists ending with a
pathway-related gene are considered. The ranked hypergeometric
statistic selected the input gene sub-list that achieves the strongest
enrichment and the smallest P-value as the final result for the given
pathway, as:
[MATH: Ppathway,G=min,argminn
mrow>∑x=kminn,K
munderover>KkN−Kn−kNn
mrow>, :MATH]
where P[pathway] stands for the hypergeometric P-value of the pathway
enrichment at the optimal sub-list of the significance-ranked candidate
genes, G represents the length of the optimal sub-list, i.e., the
number of top genes from the input gene list, N is the number of
protein-coding genes with annotations in the pathway database, i.e., in
Gene Ontology and Reactome, K is the total number of genes in a given
pathway, n is the number of genes in a given gene sub-list considered,
and k is the number of pathway genes in the considered sub-list. For a
conservative estimate of pathway enrichment, we consider as background
N the universe of genes contained in input gene sets (terms from
pathway databases and ontologies) rather than the complete repertoire
of protein-coding genes. To obtain candidate genes involved in the
pathway of interest, we intersect pathway genes with the optimal
sub-list of candidate genes. The ranked hypergeometric P-value is
computed for all pathways and resulting P-values are corrected for
multiple testing using the Holm-Bonferroni method of family-wise error
rate (FWER)^[150]19. Significant pathways are reported by default
(Q < 0.05).
Evaluating omics evidence of enriched pathways
Each evidence-based gene list derived from a single omics dataset is
also analyzed for enriched pathways with the ranked hypergeometric
test. Pathways found in the integrated gene list are then labeled for
supporting evidence in the case they are also found as significant in
any evidence-based gene list. A pathway is considered to be found only
through data integration and labeled as combined-only if it is
identified as enriched in the integrated gene list but not identified
as enriched in any of the evidence-based gene lists at equivalent
significance cut-offs (default Q < 0.05). Each detected pathway is
additionally annotated with pathway genes apparent in the optimal
sub-list of candidate genes separately for the integrated gene list and
each evidence-based gene list.
Pathways and processes
We used gene sets corresponding to biological processes of Gene
Ontology^[151]16 and molecular pathways of the Reactome
database^[152]17 downloaded from the g:Profiler web server^[153]10 on
Jan 26th 2018. Large general gene sets with more than a thousand genes
and small specific gene sets with less than five genes were excluded to
avoid statistical inflation of large gene lists and interpretation
challenges with very small lists.
Enrichment map visualization
ActivePathways creates input files for the EnrichmentMap app^[154]20 of
Cytoscape^[155]51 for network visualization of similar pathways and
their coloring according to supporting evidence. Enrichment maps for
adenocarcinoma driver mutations, prognostic quantification of molecular
alterations of breast cancer, and transcriptional networks downstream
of Hippo signaling were visualized with stringent pathway similarity
scores (Jaccard and overlap combined coefficient 0.6) and manually
curated for the most representative groups of similar pathways and
processes. Singleton pathways that were redundant with larger groups of
pathways were merged with the latter or discarded. Coloring of pathways
in the adenocarcinoma enrichment map was rearranged by merging colors
of pathways supported by non-coding mutation scores of promoters,
enhancers and/or UTRs into one group.
Coding and non-coding mutations of the PCAWG dataset
We used ActivePathways to analyze driver predictions of coding and
non-coding mutations across white-listed 2583 whole cancer genomes of
the ICGC-TCGA PCAWG project. P-values of driver predictions were
computed separately for protein-coding sequences, promoters, enhancers
and untranslated regions (UTR3, UTR5 across multiple subsets of samples
representing histological tumor types and pan-cancer cohorts as
reported in the PCAWG driver discovery study^[156]13. We used
gene-enhancer mapping predictions provided by PCAWG, excluded enhancers
with more than five target genes, and selected the most significant
enhancer for each gene, if any. Unadjusted P-values for coding
sequences, promoters, enhancers and UTRs were compiled as input
matrices and analyzed as described above. Missing P-values were
conservatively interpreted as ones. Results from ActivePathways were
validated with two lists of cancer genes. Predicted drivers from the
PCAWG consensus analysis^[157]13 were selected as statistically
significant findings (Q < 0.05) following a stringent multiple testing
correction spanning all types of elements (exons, UTRs, promoter,
enhancers). The curated list of known cancer genes was retrieved from
the COSMIC Cancer Gene Census (CGC) database^[158]11. One-tailed
Fisher’s exact tests were used to estimate enrichment of these genes in
our results, using all human protein-coding genes as the statistical
background set.
Prognostic CNA and mRNA signals in breast cancer
ActivePathways was used to evaluate prognostic pathways in breast
cancer subtypes. Multiple types of omics data were used for an
integrative analysis: mRNA gene expression data and gene copy number
alteration (CNA) data of the were derived from the METABRIC cohort of
1991 patients with a single primary fresh frozen breast cancer specimen
each^[159]15. Curtis et al.^[160]15 classified the patients into the
intrinsic breast cancer subtypes using the PAM50 mRNA-based
classifier^[161]52 resulting in 330 basal-like breast cancers, 238
HER2-enriched breast cancers, 721 luminal-A breast cancers, 491
luminal-B breast cancers. Using these data, we computationally
deconvolved tumor cell (TC) mRNA and tumor adjacent cell (TAC) mRNA
abundance levels from the bulk profiled specimens. TC mRNA was
deconvolved using the ISOpure^[162]21 method release 2010b in MATLAB.
TAC mRNA abundance profiles were computed using the
ISOpure.calculate.tac function from the R package ISOpureR^[163]22
v1.1.2. The deconvolution analyses were performed independently for
each breast cancer subtype. The mRNA univariate survival analysis was
conducted as follows. For each gene, patients were dichotomized based
on mRNA abundance. Dichotomization was either based on the median mRNA
abundance for that gene or a fixed value of 6.5. Based on the mRNA
abundance distribution of genes on the Y chromosome in female samples,
the value 6.5 was estimated as the threshold for noise for
non-expressed genes. Median dichotomization was used if the median was
above 6.5 or if there were no events in one of the groups when
dichotomizing based on 6.5. The high and low groups based on mRNA
abundance were compared by univariate log-rank tests for overall
patient survival. TC and TAC mRNA abundance values were evaluated
independently. Survival modeling was performed in the R statistical
environment (v3.4.3) using the survival package (v2.42–3). The CNA
univariate survival analysis was conducted as follows. For each gene,
we assessed whether more gains or losses were apparent. The copy number
status with a higher count was subsequently used to separate patients
into two groups: those with the chosen copy number status and the
remaining patients. The two groups were then used for overall survival
modeling with log-rank tests.
Hippo pathway target genes in mRNA and ChIP-seq data
This analysis included two types of omics data, mRNA abundance
measurements from RNA-seq experiments and transcription factor
DNA-binding measurements from chromatin immunoprecipitation sequencing
(ChIP-seq) experiments. The RNA-seq dataset of human tissues was
downloaded from GTEx v7 data portal
([164]https://www.gtexportal.org/home/). The dataset included
transcript abundance values of 21,518 protein-coding genes in 11,688
samples across 53 tissues. Tissues with less than 25 available samples
and low gene expression (mean TPM < 1.0; transcripts per million) were
excluded from further analysis, resulting in 40 tissues and 9672
samples with mRNA abundance profiles of 19,025 genes. Transcriptional
target gene lists for the master transcription factors YAP and TAZ
(encoded by YAP1, WWTR1) were predicted separately in the following two
steps. First, we computed pairwise Pearson correlations between a given
TF and all other genes within a tissue of interest and ranked these by
the significance P-value of positive correlations. Second, the
resulting ranked gene lists were then aggregated into one master target
gene list across the GTEx tissues using the robust rank aggregation
(RRA) method^[165]32 with default parameters (Q[gene] < 0.05).
FDR-adjusted values of genes from RRA for YAP and TAZ were used as the
first and second evidence for input of ActivePathways, respectively.
For DNA-binding targets of YAP, we retrieved ChIP-seq binding sites in
three cell lines (CCLP1, MSTO, and HUCCT1) from an earlier
study^[166]33 that were reprocessed in the ReMap database^[167]34.
Binding sites were filtered by statistical significance (Q < 0.05) and
mapped to gene promoters of the human genome (hg19) using gene
promoters defined in the PCAWG driver analysis^[168]13. If a promoter
had multiple binding sites, the site with the strongest FDR-value was
selected as a representative site for that gene. FDR-adjusted values of
genes with YAP DNA-binding sites were used as the third evidence for
input of ActivePathways. Significantly enriched pathways among the
putative target genes of YAP and TAZ were subsequently detected using
ActivePathways. We compiled a list of 308 Hippo-related genes from the
KEGG pathway database^[169]35 and two recent review
papers^[170]30,[171]31. To validate the analysis, we tested the overall
sets of genes identified by ActivePathways for enrichment of
Hippo-related genes using Fisher’s exact tests. The Hippo analysis was
conducted separately for two collections of functional gene sets.
First, we tested GO biological processes and Reactome pathways
similarly to analyses described above. Second, we used gene sets
corresponding to transcription factor binding sites (TFBS) derived from
the ENCODE project^[172]36.
Method benchmarking
We benchmarked ActivePathways using multiple approaches, including
simulated datasets of P-values, method parameter variations, and
partial replacement of P-values scores with insignificant P-values.
Benchmarking was performed on the PCAWG predictions of cancer driver
genes with coding and non-coding mutations. To evaluate the false
discovery rate of ActivePathways, we created simulated datasets by
reassigning all observed P-values to random genes and their genomic
elements. Simulations were conducted separately for different tumor
cohorts. One thousand simulated datasets were analyzed with
ActivePathways and those with at least one significantly detected
pathway counted towards false discovery rates. A separate set of
simulations maintained the positions of non-coding P-values among genes
and randomly reassigned P-values corresponding to protein-coding
mutations, expectedly leading to a reduction in detected pathways as
the PCAWG datasets primarily included strong P-values for genes with
frequent protein-coding mutations. Quantile-quantile analysis and
QQ-plots were used to compare P-value distributions of pathways
discovered from true P-values, partially shuffled P-values (true
non-coding and shuffled protein-coding P-values), and fully shuffled
P-values. To evaluate robustness of ActivePathways, we randomly
replaced a fraction of significant driver P-values in the true dataset
(P < 0.001) with insignificant P-values (P = 1). We tested different
fractions of missing values (10, 25, and 50%) across a thousand
datasets of randomly selected missing data points. We concluded that
most PCAWG cohorts included significantly enriched pathways even with
large fractions of missing data. To further evaluate robustness, we
tested different values of the Brown P-value threshold used to select
the integrated gene list for pathway enrichment analysis. The default
parameter value (P[gene] < 0.1) was compared to alternative values
(0.001, 0.01, 0.05, and 0.2). We concluded that ActivePathways found
enriched pathways in most tumor cohorts even at more stringent gene
selection levels.
Ethical review
Sequencing of human subjects’ tissue was performed by ICGC and TCGA
consortium members under a series of locally approved Institutional
Review Board (IRB) protocols as described in Hudson et al. Informed
consent was obtained from all human participants. Ethical review of the
current data analysis project was granted by the University of Toronto
Research Ethics Board (REB) under protocol #30278, “Pan-cancer Analysis
of Whole Genomes: PCAWG”.
Reporting summary
Further information on research design is available in the [173]Nature
Research Reporting Summary linked to this article.
Supplementary information
[174]Supplementary Information^ (1MB, pdf)
[175]Reporting Summary^ (93.8KB, pdf)
Acknowledgements