Abstract
The emergence of large-scale multi-omics data warrants method
development for data integration. Genomic studies from cancer patients
have identified epigenetic and genetic regulators – such as methylation
marks, somatic mutations, and somatic copy number alterations (SCNAs),
among others – as predictive features of cancer outcome. However,
identification of “driver genes” associated with a given alteration
remains a challenge. To this end, we developed a computational tool,
iEDGE, to model cis and trans effects of (epi-)DNA alterations and
identify potential cis driver genes, where cis and trans genes denote
those genes falling within and outside the genomic boundaries of a
given (epi-)genetic alteration, respectively. iEDGE first identifies
the cis and trans gene expression signatures associated with the
presence/absence of a particular epi-DNA alteration across samples. It
then applies tests of statistical mediation to determine the cis genes
predictive of the trans gene expression. Finally, cis and trans effects
are annotated by pathway enrichment analysis to gain insights into the
underlying regulatory networks. We used iEDGE to perform integrative
analysis of SCNAs and gene expression data from breast cancer and 18
additional cancer types included in The Cancer Genome Atlas (TCGA).
Notably, cis gene drivers identified by iEDGE were found to be
significantly enriched for known driver genes from multiple compendia
of validated oncogenes and tumor suppressors, suggesting that the
remainder are of equal importance. Furthermore, predicted drivers were
enriched for functionally relevant cancer genes with
amplification-driven dependencies, which are of potential prognostic
and therapeutic value. All the analyses results are accessible at
[32]https://montilab.bu.edu/iEDGE. In summary, integrative analysis of
SCNAs and gene expression using iEDGE successfully identified known
cancer driver genes and putative cancer therapeutic targets across 19
cancer types in the TCGA. The proposed method can easily be applied to
the integration of gene expression profiles with other epi-DNA assays
in a variety of disease contexts.
Subject terms: Computational models, Cancer genomics
Introduction
A central goal of cancer genomics is to identify key genetic and
epigenetic alterations that promote initiation and/or progression of
cancer. These alterations can manifest themselves across multiple
biological levels – DNA, RNA, protein – and can be correspondingly
quantified by multiple high-throughput profiling technologies.
Large-scale cancer genomics data compendia such as The Cancer Genome
Atlas (TCGA) have generated comprehensive multi-omics datasets for tens
of thousands of patients across ~30 types of cancer^[33]1. The
availability of such large-scale datasets provides an opportunity to
develop methods that integrate data from multiple types of profiling
platforms, supporting the discovery of novel candidate driver genes and
eventually diagnostic and prognostic biomarkers and therapeutic
targets.
Past research using integrative approaches have shown success in
discovering novel cancer drivers^[34]2,[35]3. However, many existing
methods often rely on ad-hoc, albeit sophisticated, analysis methods
and scripts not accessible to analysts other than those responsible for
their development. Furthermore, the generated analysis results are
often static, and not accessible in an interactive fashion. The
approach presented here aims to address both of these shortcomings.
The central hypothesis behind integrative approaches is that the
integration of multi-level genomics data allows for prioritization of
putative cancer “drivers” that are potential biomarkers or therapeutic
targets. An important genetic alteration type in cancer is somatic
copy-number alterations (SCNAs). SCNAs harbor many known cancer drivers
(oncogenes or tumor suppressors) and play an important role in cancer
initiation and/or progression through activation of oncogenes and
inactivation of tumor suppressors^[36]4,[37]5. Identification of
unknown SCNA-associated cancer drivers is complicated by the fact that
each SCNA contains many genes, often even a complete chromosome arm,
the majority of which is likely not to confer any selective advantage
(i.e., passengers). One approach to address this problem is to prune
the set of candidate drivers based on their association with paired
omics data, such as gene expression profiles. For example, one can
prioritize genes found in frequent SCNA peaks whose gene expression
changes are associated with corresponding copy number
changes^[38]6,[39]7. Even after this pruning step, the set of remaining
candidate drivers might still yield too many testable hypotheses for
use in functional validation studies. More importantly, association
between SCNA and gene expression alone may not be the best metric for
ranking potential cancer drivers.
To address this problem, we developed a methodology that identifies
SCNA- (or other alteration-) associated genes and performs prediction
of cis gene drivers prioritized by their capability of transactivating
downstream gene expression (trans effects), by leveraging the
trans-gene signature (gene sets outside the SCNA of interest) whose
expression is also associated with a given SCNA event of interest. The
approach is predicated on the hypothesis that SCNA-related drivers of
tumorigenesis will mediate a larger proportion of the downstream effect
observable by trans gene expression than non-drivers. Using this
heuristic, and based on the Sobel test of mediation^[40]8, we identify
putative SCNA-associated cancer drivers as the cis gene mediating the
most trans gene expression, although the method allows users to
customize the set of trans genes considered for the ranking of putative
cis gene drivers.
We developed a corresponding software tool, integration of Epi-DNA and
Gene Expression (iEDGE), available at
[41]https://github.com/montilab/iEDGE ^[42]9 for the prediction of
(epi-)DNA-associated cancer drivers. Some of iEDGE’s methodological
components were previously applied to the cis/trans analysis of diffuse
large B cell lymphoma multi-omics datasets, and have shown success in
uncovering SCNA-associated cis driver genes^[43]6,[44]10. Here, we
present an expanded and optimized method crucially incorporating the
mediation step, and we present the results of its application to the
prediction of SCNA-associated cancer drivers across 19 cancer types
using data from TCGA, with a particular focus on analysis of TCGA
breast cancer. An interactive web portal of pan-cancer TCGA analysis is
available at [45]https://montilab.bu.edu/iEDGE^[46]11. Our list of
candidate drivers is highly enriched for known oncogenes and tumor
suppressors and additionally implicates many suspected drivers as well
as novel candidate genes with potential prognostic or therapeutic
importance in cancer.
Methods
iEDGE overview
An overview of the iEDGE approach is summarized in Fig. [47]1. Briefly,
iEDGE integrates samples quantified from a gene expression profiling
assay paired with one or more genomic or epi-genomic assays capturing
information upstream of gene expression, such as SCNAs, DNA
methylation, or microRNA expression. First, we identify the list of
genes associated in “cis” with the upstream epi-DNA alteration. In the
case of SCNAs, cis genes of each SCNA in this list are defined as genes
within the focal peaks of the SCNA, and trans genes are defined as
genes outside of the focal peaks. Then, iEDGE performs differential
expression analysis to identify cis and trans genes significantly
associated with each SCNA and, optionally, pathway enrichment analysis
of each significant cis and trans gene sets (Fig. [48]1A). Next, iEDGE
predicts cis driver genes using mediation analysis, wherein each
differentially expressed cis gene is ranked by the number of
differentially expressed trans genes it mediates as determined using
the Sobel test of mediation (Fig. [49]1B).
Figure 1.
[50]Figure 1
[51]Open in a new tab
Overview of iEDGE workflow. (A) iEDGE identifies the cis and trans gene
expression signatures of epi-DNA alterations, e.g., Somatic Copy Number
Alterations (SCNAs), through differential expression analysis. It then
performs pathway enrichment analysis to identify pathways or genesets
associated with each SCNA. (B) Secondly, it performs cis-to-trans gene
mediation analysis to identify putative driver cis genes of each
epi-DNA alteration, defined as the cis genes that mediate the most
trans gene expression.
Copy number and gene expression pre-processing
We utilized the dataset of somatic copy number alterations and RNA-seq
from the TCGA breast cancer cohort, which was preprocessed using
Firehose v0.4.13 and downloaded from the Broad Institute TCGA GDAC
repository ([52]http://gdac.broadinstitute.org/runs/)^[53]12.
The SCNA dataset was preprocessed using GISTIC2.0^[54]13 under the
Firehose run release analyses__2015_08_21, which identified 29
significant focal amplifications and 40 significant focal deletions to
be considered for integrative analysis^[55]14. SCNA status by sample
was binarized using the amplitude threshold of 0.1, that is, SCNA
status = 1: t > 0.1 or 0: t ≤ =0.1 for amplifications, and 1: t < −0.1
or 0: t ≥ = −0.1 for deletions. Cis genes were identified by GISTIC2.0
as genes in the wide peak of each significant SCNA with boundaries
selected at the confidence level of 0.99. Trans genes were identified
as genes outside the wide peak of each SCNA.
The gene expression data is a RSEM processed gene expression matrix
(stddata__2015_06_01)^[56]15. Expression values were log[2]-transformed
prior to integrative analysis. The samples were categorized into breast
cancer subtypes using a combination of the pam50 classifier^[57]16 and
the HER2 status. HER2 status was determined using HER2 receptor
activity, labeled positive if tested positive by either FISH or IHC
method. In HER2 negative samples, the pam50 classification was used.
Samples with gene expression-based membership in one of four major
breast cancer subtypes (Luminal A, Luminal B, Her2, Basal) were
retained for integrative analysis. Tumors classified as “Normal-like”
by pam50 were removed from further analysis. A total of 1050 samples
(primary solid tumors only) were found with paired gene expression and
SCNA data by matching the sample barcode identifier (combination of
patient id and sample type).
Determining significantly expressed SCNA-associated cis and trans genes
We performed differential expression of cis and trans genes with
respect to each GISTIC2.0 defined significant focal SCNA peak. The
significance of differential expression was estimated using limma
(Ritchie et al. 2015) for each SCNA with samples split into two groups
(amplified vs. normal for amplification peaks, and deleted vs. normal
for deletion peaks) using FDR < 0.25 and fold change > 1.2 for cis
genes, and FDR < 0.01 and fold change > 1.5 for trans genes. One-sided
significant levels were reported for cis genes with the rationale that
a focal amplification is commonly associated with an increase in gene
expression and a deletion is associated with a decrease in gene
expression. Two-sided significant levels were reported for trans genes,
as indirect downstream effects can occur through either transcriptional
repression or activation.
Pathway enrichment analysis of significant cis and trans gene sets
Significantly differentially expressed cis and trans gene sets were
tested for pathway enrichment using the MSigDB gene set compendia
hallmark (hallmark gene sets), c2.cp (curated gene sets from online
pathway databases), and c3 (motif gene sets), version 5.0^[58]17. The
significance of pathway enrichment was determined using a
hypergeometric distribution-based test and corrected for multiple
hypothesis testing using the False Discovery Rate (FDR) method^[59]18.
For breast cancer-specific pathway enrichment results, pathways with
significant enrichment (FDR < 0.25) in any SCNA were reported. In
addition, each SCNA was labeled according to its over-representation in
a particular breast cancer subtype using a one-sided Fisher exact test
comparing counts of SCNA occurrence within vs. outside each breast
cancer subtype (FDR < 0.05). Subtype-specific SCNAs were subsequently
used in conjunction with pathway enrichment results to determine
subtype-specific pathway enrichments using a one-sided Fisher exact
test (FDR < 0.05).
Mediation testing and prediction of cis drivers
To elucidate which cis genes are likely to mediate the association
between copy number alteration and trans gene expression, we used the
Sobel test to estimate the mediation effect of each cis gene and its
significance^[60]8. Briefly, we model the association for each triplet
of SCNA, cis gene, and trans gene using the linear regression models
specified in Fig. [61]1B. The mediation effect of the cis gene:
Δτ = τ − τ‘, represents the change in the magnitude of the effect of
the SCNA status on the trans gene expression after controlling for the
cis gene expression. The significance of the mediation effect is
calculated from the t statistic: t = Δτ/SE, where SE is the pooled
standard error term, and is compared to the normal distribution to
determine the p-value and FDR^[62]18.
An important simplifying model assumption is that the association
between each SCNA and trans gene is mediated by at most one cis gene,
therefore the cis gene mediator for each SCNA-trans gene pair is chosen
based on the most significant mediation effect (ranked by the FDR
values of the mediation test).
Once the cis genes mediator is determined for each unique trans gene,
the mediation effect of cis gene i on trans gene j can be expressed as
either binary (0 or 1), based on the significance of the mediation, or
as the weight
[MATH:
wij=Δτijsign(τij)×τij :MATH]
, limited to the range of [0, 1]. Using the latter weighted approach,
the total fractional weighted mediation effect of cis gene i across m
significantly expressed trans genes, also referred to as the Weighted
Fraction of Trans Mediation (WFTM), is expressed as
[MATH:
WFTMi<
/mi>=∑1<
/mn>mwij×I
ijm :MATH]
, where I[ij] denotes the indicator variable taking the value 1 if cis
gene i is has the most significant mediation effect on trans gene j
among all cis genes, 0 otherwise.
Next, for each SCNA, we rank each cis gene based on its total mediation
effect M[i]. The cis gene with the highest value of M[i] is denoted as
the “Rank-1 cis gene” for the given SCNA, the candidate driver gene of
the alteration.
Assessing enrichment of predicted cis drivers in databases of known cancer
drivers
To investigate the functional impact of putative drivers identified by
iEDGE, we tested for the enrichment of iEDGE predicted driver genes in
several cancer driver databases. Reference cancer driver genes, denoted
as either “oncogenes” or “tumor suppressors” in the original sources,
were compiled using data from Tuson Explorer (Davoli et al.
2013)^[63]19, Online Mendelian Inheritance in Man (OMIM)^[64]20, Cancer
Gene Census (CGC) from Catalogue of Somatic Mutations In Cancer
(COSMIC)^[65]21,[66]22, and Uniprot^[67]23. We tested for the
overrepresentation among Rank-1 cis genes, compared to non-Rank-1 cis
genes, of known drivers from the reference databases (Table [68]S7).
Enrichment tests were conducted separately for each reference database
and driver type, i.e., oncogene (“_OG”), tumor suppressor (“_TN”), or
both (“_COMBINED”), as well as using the union of the driver genes
across databases (column “ANY” indicates union of drivers across
knowledge bases). Enrichment significance was calculated using a
one-sided Fisher exact test assessing the overrepresentation of Rank-1
vs. non-Rank-1 cis genes with respect to their membership in the
reference driver list, conditional on the direction of change, e.g.
amplified cis genes among oncogenes, deleted cis genes among tumor
suppressors, or direction insensitive. P-values are adjusted with the
FDR procedure to correct for multiple hypothesis testing across 19
tumor types.
Copy number-associated gene dependencies
SCNAs often lead to overexpression of driver oncogenes and confer a
tumor-promoting environment. In other words, driver oncogenes are more
likely to act as essential genes (increased gene dependency) in an
amplified state. To identity such genes, we looked for copy number
associated gene dependencies using data available from DepMap,
specifically, gene dependency data^[69]24 and cancer cell line genomics
data from the Cancer Cell Line Encyclopedia (CCLE)^[70]25,[71]26. In
particular, we mined for genes with associations between gene
dependency (Combined RNAi screens from Broad, Novartis, Marcotte) and
somatic copy number status across cell lines (CCLE). To do this, we
used a linear regression model Y = α X + β where Y is the Gene
Dependency Score and X is the copy number level (log2 relative to
ploidy) across cell lines. Gene dependency scores were calculated using
DEMETER2^[72]24. A negative gene dependency score corresponds to high
gene dependency, e.g., an increased gene essentiality. In contrast, a
high gene dependency score corresponds to non-essential genes. We
looked for genes with significantly negative association between copy
number and Gene Dependency Score, that is, genes in which higher copy
number is associated with higher gene essentiality, using a one-sided
t-test on the coefficient α (alternative hypothesis α < 0).
Additionally, FDR correction was performed across p-values for all
genes. Since gene dependency scores are calculated from only gene
knockdowns, we were only able to test amplification-driven gene
dependencies, whereas overexpression assays would be needed for
detection of deletion-driven gene dependencies.
Finally, to determine if iEDGE was able to uncover an enrichment of
amplification-driven gene dependencies, we tested for enrichment of
genes with amplification-driven gene dependencies among iEDGE-predicted
Rank-1 cis genes using a one-sided Fisher test on the contingency table
of counts (rows: membership in Rank-1 vs. non-Rank-1, columns: presence
vs. absence of amplification driven gene dependencies).
Pan-cancer analysis
TCGA gene expression and copy number (GISTIC2.0) data were retrieved
using Firehose v0.4.13 for 19 cancer types as summarized in
Table [73]S6. Gene expression data (RNASeq) correspond to the latest
release at the time of retrieval (stddata__2015_02_04 for cancer types
ACC, KIRP, THCA and stddata__2016_07_15 for all other cancer types).
SCNA copy number data uses the GISTIC2.0 run corresponding to Firehose
run release analyses__2016_01_28. Gene expression processing and copy
number processing steps for the pan-cancer analysis are consistent with
methods used for the BRCA-only analysis. Of note, the BRCA dataset in
the pan-cancer analysis includes all TCGA BRCA samples with paired copy
number and gene expression data to be consistent with processing of
other TCGA cancer types, contrary to the removal of samples without an
assigned molecular subtype in the BRCA-only analysis.
We tested for enrichment of known cancer driver genes among Rank-1 cis
genes in each of the 19 TCGA cancer types using a one-sided Fisher test
(Supplemental Table [74]S8), consistent with the BRCA-only analysis.
FDR correction was performed on the nominal p-values across all 19
cancer types for each test (unique combination of database origin and
alteration direction, gain or loss). Enrichment tests are direction
sensitive (“OG” tests for enrichment of oncogenes in Rank-1 cis genes
in amplifications, “TN” tests for enrichment of tumor suppressors in
Rank-1 cis genes in deletions, “COMBINED” tests for the union of the
two sets). To determine the significance of the number of expected
subtypes with increased sensitivity of driver predictions compared to
random, we conducted a Kolmogorov-Smirnov test using the p-values of
test of enrichments across subtypes against the uniform distribution in
the range of [0,1].
We also tested for enrichment of amplification-driven gene dependencies
in Rank-1 cis genes across the 19 cancer types (see Methods: Copy
number-associated gene dependencies). Multiple hypothesis correction
using the FDR procedure^[75]18 was used to adjust the significance
values across multiple cancer types. Similarly, the significance of
number of expected subtypes with increased sensitivity of
amplification-driven gene dependencies is determined using a
Kolmogorov-Smirnov test.
Evaluation of the reproducibility of cis driver gene predictions in BRCA
To evaluate the consistency of Rank-1 driver gene predictions, we
generated 100 bootstrapped resamples of the original TCGA breast cancer
dataset using sampling with replacement with the number of samples
equal to the size of the original dataset, derived the predicted Rank-1
cis genes across the 100 bootstrapped datasets, and compared these
predictions against the original list of predicted Rank-1 cis genes
from the full dataset. A reproducibility score was calculated for each
of the original predicted Rank-1 cis gene as the percent of inclusion
of the particular gene as a Rank-1 cis gene among the bootstrapped
results. To explain the variation on reproducibility scores across
genes, we modeled these scores using linear regression models with the
dependent variable being either the Weighted Fraction of Trans Mediated
(WFTM) or the Entropy of WTFM for the alteration of interest,
calculated as the Shannon Entropy of WTFM of all differentially
expressed cis genes within the alteration harboring the Rank-1 cis gene
of interest. A two-sided t-test on the slope, β[1], of the linear
regression, with H[a]: β[1] ≠ 0, was used to estimate significance,
defined as p-value <0.05.
Evaluation of mediation testing from simulated data
The Sobel test of mediation identifies cis genes that mediates SCNA and
trans gene expression. To determine the conditions in which mediation
is correctly identified, we used a forward simulation approach to
generate labeled data of true positives and true negative, and then
applied the mediation test to estimate its sensitivity and specificity.
Of note, for simplicity, these simulated instances represent total, not
partial, mediation.
True positive instances of mediation were generated using the following
linear regression models:
[MATH:
Ya=α1+β1
Xa+N(0,σ1
mn>2) :MATH]
[MATH:
Za=α2+β2
Ya+N(0,σ
22) :MATH]
Here, X[a] denotes the independent variable (SCNA status), Z[a] is a
dependent variable (trans gene expression) and Y[a] is a true mediator
of X[a] and Z[a] (cis gene expression).
True negative instances, representing the lack of a mediation effect,
were generated using the following models:
[MATH:
Yb=α1+β1
Xb+N(0,σ
12) :MATH]
[MATH:
Zb=α2+β2
Xb+N(0,σ
22) :MATH]
Here, X[b] is the independent variable (SCNA status) and Y[b] and Z[b]
are both dependent variables generated based on separate regression
models from X[b].
Variables X, Y, Z are vectors of length n, representing the sample size
of the data. X is a binary vector (0 s and 1 s) corresponding to the
binarized SCNA copy number status. σ is the standard deviation of the
Gaussian noise term. We fixed β[1] and β[2] at 0.7 based on estimation
from real data (TCGA breast cancer). The mediation test was performed
on 1000 simulated true positives and 1000 simulated true negatives, and
performance was measured in terms of AUC, sensitivity and specificity.
For sensitivity and specificity, mediation calls were made based on the
Sobel test p-value of 0.05. Test performance was recorded for simulated
datasets based on a range of values of n (sample size) and σ (standard
deviation of the Gaussian noise in the regression models). The standard
deviation σ is a proxy for the correlation strength between dependent
and independent variables, as higher noise corresponds to weaker
correlation. For interpretability purposes, values for the parameter σ
are converted to the corresponding Pearson correlation estimates using
a Loess model (Local Regression).
Results
Sensitivity and specificity of mediation testing
In order to evaluate the sensitivity and specificity of the mediation
analysis that is part of our approach, as well as to assess the
relationship of those metrics to sample size, we performed an extensive
evaluation based on simulated data.
In particular, evaluation of the Sobel test of mediation was carried
out using simulated data of true positives and true negative examples
of mediation. Test performance, as measured using AUC, sensitivity and
specificity, was recorded for varying combinations of correlation
between the independent variable and mediator (“correlationXY”) and
sample size (“N”) (Fig. [76]2). High specificity (true negative rate)
was consistently achieved for all input parameter ranges. Sensitivity
(true positive rate) drops under conditions of low correlation and low
sample size. Nevertheless, the conditions for lower sensitivity is not
characteristic of real datasets that were used to test and validate
iEDGE. Specifically, high correlation between SCNA and cis genes and
between SCNA and trans genes is expected given that only cis and trans
genes that were significantly expressed with respect to the SCNA were
considered prior to mediation testing. Additionally, sufficient sample
size is achieved in most of TCGA datasets tested (Supplemental
Table [77]S7), with the only exception being the TCGA Adrenocortical
carcinoma dataset (n = 77) in which mediation results should be
interpreted with caution.
Figure 2.
[78]Figure 2
[79]Open in a new tab
Mediation test performance on simulated data. AUC (Area Under the ROC
Curve), sensitivity (true positive rate), specificity (true negative
rate) for various values of correlationXY (correlation between SCNA and
mediator, X-axis) and N (sample size, Y-axis).
iEDGE identifies SCNA-associated cis and trans genes and pathway signatures
in TCGA breast cancer
The first step of the iEDGE-based analysis consists of the
identification of cis and trans signatures, which are the needed input
to the mediation analysis for the prioritization of putative cancer
drivers.
To identify cis and trans gene signatures of SCNAs in breast cancer, we
performed integrative analysis on paired copy number and gene
expression data from TCGA breast cancer primary tumors from four gene
expression-based molecular subtypes (Luminal A, Luminal B, Her2 and
Basal) using the workflow summarized in Fig. [80]1.
Focal SCNAs were identified using GISTIC2.0^[81]13, including 29
amplifications and 40 deletions. Next, we identified sets of cis and
trans genes with significant differential expression with respect to
each SCNA, and performed, pathway enrichments of cis and trans gene
sets (Fig. [82]1A).
A list of cis genes whose expression is significantly associated with
each SCNA is summarized in Supplemental Table [83]S1 (Fold change >1.2,
one-directional FDR <0.25). This list comprises an average of 20 genes
(and a median of 8 genes) per SCNA, with 1330 genes in total (269 in
amplifications, 1061 in deletions) out of the original 2003 genes
across all SCNAs identified using GISTIC2.0, a number clearly still too
large for meaningful consideration for functional validation.
Similarly, a list of significantly differentially expressed trans genes
is summarized in Supplemental Table [84]S2 (Fold change > 1.5,
bi-directional FDR < 0.1). This list contains an average of 865 (a
median of 598) significant trans genes per SCNA. Given the large number
of genes included in the trans signatures, their in-silico annotation
by enrichment approaches is best suited for a summarization of their
composition.
Pathway enrichment analyses of the union of cis and trans genesets
yielded interesting and potentially biological meaningful patterns
(Fig. [85]3). In particular, several pathways were enriched across most
SCNAs. For instance, gene sets HALLMARK_ESTROGEN_RESPONSE_LATE,
HALLMARK_ESTROGEN_RESPONSE_EARLY, HALLMARK_G2M_CHECKPOINT were
significant hits in more than 75% of SCNAs. These gene sets indicate
global patterns of downstream effects related to genomic instability
induced by co-occurring SCNAs across tumor samples (Fig. [86]S1), since
co-occurring SCNAs will tend to have overlapping trans signatures,
hence common pathway enrichments. To elucidate breast cancer subtype
specific pathway enrichments, we categorized each SCNA by their
enrichment in each of four major breast cancer types: Luminal A,
Luminal B, Her2, Basal. In addition, for each pathway, we tested if the
enrichment across SCNAs tended to occur in subtype-specific SCNAs
compared to non-subtype-specific SCNAs (Fig. [87]3). Several
significant pathways were found to be occurring more frequently in
Basal-specific SCNAs, including HALLMARK_SPERMATOGENESIS,
HALLMARK_KRAS_SIGNALING_UP, HALLMARK_E2F_TARGETS,
HALLMARK_BILE_ACID_METABOLISM, HALLMARK_FATTY_ACID_METABOLISM,
HALLMARK_UV_RESPONSE_UP, HALLMARK_MYOGENESIS (FDR < 0.05). The
enrichment of KRAS signaling can be explained by published evidence
supporting KRAS activation in basal-type breast cancer cells compared
to luminal cells^[88]27.
Figure 3.
[89]Figure 3
[90]Open in a new tab
Pathway enrichments of cis and trans genes of TCGA breast cancer
somatic copy number alterations. Heatmap representing the enrichment of
each SCNA-associated cis/trans union signatures (columns) with respect
to the genesets (rows) in the MSigDB “Hallmark” compendium for the TCGA
breast cancer dataset. Color-coded bars (pink/purple) on top of the
heatmap indicate the enrichment of each SCNA in a particular breast
cancer subtype as determined by Fisher exact test of the within- vs.
outside-subtype counts of SCNA occurrences. The directionality of each
SCNA is also represented as amplification (red) or deletion (blue).
An important outcome of the cis/trans analysis is the observation that
the number of genes identified as associated to (hence potentially
driven by) genetic alterations is still prohibitively large to be
realistically considered for functional validation. The mediation step
we describe next is aimed at providing a further prioritization
criterion for the identification of putative cancer drivers.
iEDGE identifies known cancer drivers in TCGA breast cancer
After the identification of the cis and trans gene signatures of SCNAs,
iEDGE can be used to predict cis gene drivers of each SCNA
(Fig. [91]1B). For each SCNA in the TCGA breast cancer dataset, we
ranked the significant cis genes using the Sobel test of mediation
(Sobel 1982) to predict the driver gene of the alteration. Cis genes
were ranked by the weighted fraction of significant trans genes they
mediate (see Methods, and Supplemental Table [92]S3). Rank-1 cis genes
– the genes mediating the largest proportion of trans genes within each
alteration, hence hypothesized to be driver genes for each SCNA – are
summarized in Supplemental Table [93]S4.
To verify the functional relevance of Rank-1 driver gene predictions,
we tested the list of Rank-1 cis genes for enrichment in cancer driver
databases compared to non-Rank-1 cis genes (Supplemental Table [94]S5).
Significance of enrichment was determined using a one-sided Fisher
exact test on the contingency table of counts of Rank-1 vs. non-Rank-1
cis genes against the list of known cancer drivers vs. unknown genes.
Predicted Rank-1 driver genes in amplifications were significantly
enriched for known oncogenes in UNIPROT (P-value = 0.0016), COSMIC
(P-value = 0.0012) and the combined test (union of driver databases)
(P-value = 0.0041), and predicted Rank-1 driver genes in deletions were
significantly enriched for tumor suppressors in TUSON
(P-value = 0.0045), UNIPROT (P-value = 0.016), COSMIC
(P-value = 0.002), and the combined test (P-value = 0.013). Among the
65 Rank-1 cis genes (Supplemental Table [95]S4), known cancer drivers
included MCL1, ACTL6A, MYC, CCND1, FOXA1, ERBB2, CCNE1 (oncogenes in
amplifications) and RPL5, ZMYND10, KMT2C, CSMD1, CDKN2B, PTEN, CREBL2,
FANCA and ARHGAP35 (tumor suppressors in deletions) (Supplemental
Table [96]S5).
Additionally, iEDGE-predicted Rank-1 mediating genes in breast cancer
includes putative novel breast cancer gene drivers and therapeutic
targets. Novel genes include RBM17, a splicing factor with enriched
amplification in Triple-Negative Breast Cancer, a subtype shown to be
selectively addicted to alternative splicing^[97]28, and SIRT3, a
regulator of mitochondrial adaptive response to stress mechanisms that
exhibits tumor suppressive functions in cancer^[98]29 and a promising
therapeutic target as experimental findings show that the knockdown of
SIRT3 enhanced migration and metastasis in ovarian cancer cells^[99]30.
iEDGE identifies amplification-driven gene dependencies in TCGA breast cancer
In addition to the prediction of known cancer drivers, we assessed
whether iEDGE was capable of identifying genes with copy-number driven
cancer dependencies, specifically, amplification-driven gene
dependencies (see Methods). To this end, we identified genes with
increased essentiality in an amplified state using DepMap data of
genetic screens (RNAi screens) paired with copy number data (CCLE)
(Supplemental Table [100]S6), and tested for their enrichment among
iEDGE Rank-1 cis genes. In the TCGA breast cancer dataset, we found a
highly significant enrichment of Rank-1 cis genes among genes with
amplification-driven gene dependencies (Fisher test one-sided
P-value = 3.53e-5). These were ERBB2, CCNE1, CCND1, FOXA1, ANKRD17,
MCL1. All of these genes are linked to breast cancer development in the
overexpressed state. ERBB2, CCND1, CCNE1 are well-characterized
oncogenes present among our curated set of cancer driver databases (see
Methods). FOXA1 has been shown to play an important role in promoting
ER + breast cancer^[101]31. ANKRD17 is a cyclin E/Cdk2 substrate which
positively regulates cell cycle progression by promoting G[1]/S
transition^[102]32. MCL1 high expression is linked to poor prognosis in
triple-negative breast cancer and targeting of MCL1 restricts the
growth of triple negative breast cancer xenografts, suggesting its
potential therapeutic value^[103]33.
In summary, using the cis gene mediation step, we identified known
SCNA-associated breast cancer gene drivers and novel genes with
amplification-driven gene dependency that are of potential prognostic
or therapeutic value.
TCGA pan-cancer analysis
Next, the driver prediction procedure was carried out across 19 cancer
types from TCGA (Supplemental Table [104]S7). We tested for the
enrichment of Rank-1 cis genes with known cancer drivers across the 19
cancer types and found significant enrichment (FDR < 0.05) in 15 out of
19 cancer types (Kolmogorov-Smirnov test of p-values against uniform
[0,1]: p-value = 2.5e-13) (Supplemental Table [105]S8).
Additionally, we tested for the enrichment of Rank-1 cis genes with
genes manifesting amplification-driven dependencies and found
significant enrichment (FDR < 0.05) in 8 out of the 19 cancer types
analyzed, including BLCA, BRCA, CESC, ESCA, HNSC, LUAD, OV, UCEC
(Kolmogorov-Smirnov test of p-values against uniform [0,1]:
p-value = 0.00287) (Supplemental Table [106]S9).
To assess the importance of the cis mediating step for the
identification of known cancer drivers, we ordered Rank-1 cis genes
(Supplemental Table [107]S10) by the number of cancer types they occur
in, and tracked which of these genes were validated cancer drivers. The
derived ordered Rank-1 gene list was then compared with the ordered
list of recurrent top differentially expressed (D.E.) cis genes in each
SCNA, irrespective of their cis-mediating rank (Supplemental
Table [108]S11), as well as with the ordered list of all recurrent
differentially expressed cis genes in each SCNA (Supplemental
Table [109]S12). These comparisons confirmed that the recurrent Rank-1
cis genes were more likely to capture known drivers than both the
recurrent top D.E. cis genes and all cis genes (Fig. [110]4A).
Remarkably, among the top 15 Rank-1 cis genes, 14 genes were known
cancer drivers (PTEN, WWOX, CCNE1, CDKN2A, MAP2K4, EGFR, KAT6A, MYC,
KRAS, ERBB2, WHSC1L1, PARK2, CREBBP, RB1) and only 1 was unverified
(ATP9B) (Fig. [111]4B, left) (Fisher’s exact test P-value = 7.76e-08).
In contrast, in the absence of the mediation step, among the top 15 top
D.E. genes in SCNAs, 9 were confirmed drivers (Fig. [112]4B, middle)
(Fisher’s exact test P-value = 0.0047), and among the top 15 cis genes
in SCNAs, only 4 were confirmed drivers (Fig. [113]4B, right) (Fisher’s
exact test P-value = 0.21). These results demonstrate the usefulness of
the mediation test in further restricting the list of SCNA-associated
cis genes to those of functional relevance across multiple cancer
types. The mediation test provides a substantial improvement over
ranking solely based on the cis gene differential expression and
demonstrates the added value of performing integrative analysis that
models downstream biological effects as captured by trans gene
expression.
Figure 4.
[114]Figure 4
[115]Open in a new tab
Sensitivity of cancer driver predictions across multiple cancer types.
(A) Sensitivity of Driver Predictions (expressed as the fraction of
known drivers in the ranked list of genes) vs. Occurrence of Genes in
Number of Cancer Types. (B) Barplots of top 15 genes ranked by
occurrence in number of cancer types. Left: Rank-1 mediating cis genes;
middle: top differentially expressed (D.E.) cis genes, right: all cis
genes differentially expressed in alteration. Known cancer drivers are
in bold and marked with (*).
In addition to enrichment for known cancer drivers, Rank-1 cis genes
that frequently occur across cancer types (Supplemental Table [116]S10)
include putative or novel genes implicated in cancer initiation or
progression. These include: TRIP13, a mitosis regulator that was shown
to promote tumor growth in colorectal cancer^[117]34 and is a predictor
of poor prognosis in prostate cancer^[118]35; ORAOV1, a gene
overexpressed in many solid tumors that is linked to generation of
reactive oxygen species^[119]36; TPX2, an interactor and substrate of
Aurora-A that is a potent oncogene amplified in many cancers and a
promising therapeutic target^[120]37,[121]38; and DUSP22, which has
been shown to behave as a tumor suppressor gene in peripheral T-cell
lymphomas^[122]39 and regulates ERα dependent transcription in breast
cancer cells^[123]40.
Evaluation of reproducibility of Rank-1 cis genes
To evaluate the reproducibility of cis gene ranking based on mediation
testing, we quantified the consistency of Rank-1 cis gene predictions
across bootstrapped resamples of the original TCGA breast cancer
dataset (Fig. [124]5). The majority of Rank-1 cis genes (69.2%) was
consistently found as Rank-1 cis genes in bootstrapped datasets with
greater than 0.75 fraction of inclusion, i.e., found as Rank-1 cis
genes in 75% of the resampled datasets, and the fraction of inclusion
is not biased by SCNA type (amplification or deletion) (Fig. [125]5A).
We further observed that the fraction of inclusion is positively
associated with the Weighted Fraction of Trans Mediation (WFTM), the
score used to rank cis genes within each SCNA based on the extent of
trans genes mediation (see Methods: Mediation testing and prediction of
cis drivers for calculation of WFTM) (Fig. [126]5B). In particular,
while Rank-1 cis genes with lower fraction of inclusion across
bootstrapped resamples tend to have lower WFTM, Rank-1 cis genes with
higher fractions of inclusion show more variability in WFTM but
generally tend to have higher WFTM. The fraction of inclusion is
negatively associated with the entropy of WFTM of cis genes in SCNAs of
interest (Fig. [127]5C). This is an indication that SCNAs with a single
dominant cis gene mediating the majority of trans genes (lower entropy)
tend to yield more reproducible Rank-1 cis genes across bootstrapped
resamples than SCNAs with multiple cis genes with similar trans
mediation (higher entropy).
Figure 5.
[128]Figure 5
[129]Open in a new tab
Reproducibility of Rank-1 cis genes across bootstrapped resamples. (A)
Fraction (i.e., frequency) of inclusion of Rank-1 cis genes in
bootstrapped resamples. (B) Fraction of inclusion vs. Weighted Fraction
of Trans Mediation (WFTM). (C) Fraction of inclusion vs. entropy of
WFTM of cis genes in the SCNA.
Graphical portal of iEDGE results enable targeted queries
In order to enable fast interactive browsing of iEDGE precomputed runs
on massive datasets such as the TCGA pan-cancer set, we developed a web
portal ([130]https://montilab.bu.edu/iEDGE) to allow users to query
iEDGE results selectively by cancer types, genes, and SCNAs^[131]11.
This portal displays graphical and tabular results for each step of the
iEDGE pipeline, including differential expression of cis and trans
genes, mediation analysis for driver prediction, and pathway enrichment
analysis.
An overview of an example walkthrough of a targeted query is
illustrated in Fig. [132]S2. Here, the TCGA breast cancer (BRCA) report
is selected and the gene query is ERBB2 (HER2). The table of
differential expression results is available for the cytoband 17q12 in
which ERBB2 resides in. Additionally, results of the mediation testing
and driver gene prediction is available in a bipartite graph format. In
this case, the graph indicates that ERBB2 is the top mediating cis gene
and the predicted driver of the SCNA.
Discussion
Methods developed for the integrative analysis of (epi-)DNA regulators
and gene expression data often focus only on the genes harbored by the
alteration regions, while not considering the downstream
(trans-)effects, which may limit a method’s ability to detect cancer
driver genes. We present a computational framework for the integrative
analysis of (epi-)DNA and gene expression data for large-scale
datasets, iEDGE, that is able to thoroughly catalogue the cis and trans
effects of epi-(DNA) alterations, and to predict the most likely
cis-driver genes based on the extent of their mediation of downstream
trans-genes.
The first step of the iEDGE pipeline uses differential expression
analysis to determine the cis and trans genes that are associated with
the presence/absence of a particular epi-DNA alteration across samples.
By measuring the alterations’ association with trans genes we capture
meaningful biological mechanisms representing downstream effects that
are generally missed by tools that only consider genes within the
alterations (e.g., GISTIC2.0). Trans genes are of potential high
relevance considering that (epi-)DNA regulators such as SCNAs harbor
many upstream genes in signaling pathways, e.g., transcription factors,
wherein the set of target genes effected can shed light on processes or
pathways associated with disease progression.
The second step of the pipeline, the mediation analysis, ranks the set
of cis genes by the extent of their mediation of trans genes. We showed
that mediation analysis captures important cancer driver genes in our
study of the TCGA breast cancer dataset. We then expanded these results
by performing a pan-cancer analysis across 19 cancer types in the TCGA,
further highlighting our tool’s ability to identify known, as well as
potentially novel drivers.
We conducted extensive in silico validation of predicted cis driver
genes, by first testing for their enrichment with known drivers from
multiple cancer driver databases. We then characterized predicted
drivers by testing for their “essentiality” against genetic screens and
cellular model data included in the DepMap, to explore SCNA-associated
gene dependencies. Both analyses showed that our list of predicted
genes is significantly enriched for cancer genes of functional
relevance, either as cancer drivers, potential cancer therapeutic
targets, or markers of disease progression. Further experimental
studies are needed to validate and characterize predicted drivers.
Furthermore, simulations were performed to validate the performance of
the Sobel test of mediation. Under assumptions of adequate sample size
and strong correlation effects between SCNA and cis acting genes,
conditions which were met in most of the TCGA datasets used in this
study, the Sobel test of mediation achieved high AUC for uncovering
true mediation effects.
Similar approaches have recognized the importance of integrating gene
expression and the coordinated expression of affected gene modules for
identifying drivers. One notable example is CONEXIC^[133]2, which
identified functional gene modules for each candidate regulator in the
form of copy number alteration (CNA), although a side-by-side
comparison of the two methods was not possible because the software is
not available. The conceptual steps of iEDGE are similar to the “Single
Modulator Step” outlined in Akavia et al., in which, first, cis genes
are defined in their “candidate driver gene” selection process, and
trans genes are defined in their “target gene modules” selection step,
and second, a scoring function is used to find the single candidate
driver gene that best associates with the target gene expression
module. The implementation details of the second step for the scoring
of driver genes is different compared to iEDGE. iEDGE considers each
3-layer relationship between SCNA status, cis, and trans gene
expression to calculate a mediation effect and to rank cis genes,
whereas the “Single Modulator Step” of CONEXIC computes the best
candidate driver using a Normal Gamma scoring function to measure each
target gene’s fit with each candidate driver’s gene expression.
Furthermore, while CONEXIC is not available for public use, iEDGE is
available as an open-source R package
([134]https://github.com/montilab/iEDGE)^[135]9 to enable analysis of
custom datasets, and it supports the automatic generation of
multi-level html reports in easily interpretable graphical formats as
shown for each of the TCGA cancer types in the iEDGE web portal
(montilab.bu.edu/iEDGE)^[136]11.
Other model-based methods incorporate models of causal relationships
between multiple levels of genomic data through the scores of
conditional dependencies, measured using partial correlation
coefficients for normal continuous features^[137]41, or conditional
mutual inclusive information for binary or mixed binary and continuous
features^[138]42,[139]43. These approaches are similar to the mediation
testing step of iEDGE, but they are more conservative models that
detect full mediation, e.g., significant hits are instances in which
the conditional independence given the mediator is zero, a condition
that is rarely satisfied by genomic data, whereas mediation tests are
also able to capture partial mediation in which the association between
the independent and response variable is significantly reduced in size
when the mediator is introduced but may still be different from zero.
One assumption used in the iEDGE analysis presented in this study is
that the number of trans genes that a cis gene mediates is a proxy
measure of a gene’s importance (i.e., of its likelihood of being a
cancer driver). This is a simple and intuitive heuristic to estimate
the extent of transcriptional impact from each (epi-)DNA regulator,
albeit it may not be an appropriate assumption for specific use cases.
For more targeted analysis, one may be only interested in predicting
the cis gene that mediates gene targets in a particular pathway. Our
tool is customizable in that the user can specify the set of trans
genes to consider for mediation based on their membership in a pathway
of interest or an experimentally derived gene signature.
We use SCNA as an example of epi-DNA events to demonstrate a convenient
use case for this tool as SCNAs can be used to easily define genomic
boundaries for distinguishing cis and trans acting genes. However, this
tool can also be applicable to the integrative analysis of other
genomic/epi-genomic data types such as DNA methylation, DNA mutations,
and microRNA regulatory networks. Similarly, gene expression dataset
can use a variety of quantitative gene-centric measures such as
RNA-seq, microarray, or proteomic assays.
Conclusion
We presented a computational toolbox for the integration of epi-DNA
regulators and gene expression profiles. This approach demonstrated
success in unbiased systemic discovery of cancer driver genes and genes
with context-specific (amplification-driven) gene dependencies
(potential therapeutic agents) across multiple cancer types in the
TCGA. iEDGE is a flexible framework that may also be applied towards
other disease contexts and data platforms including DNA methylation,
DNA mutation, microRNA networks integrated with transcriptomic or
proteomic datasets.
Supplementary information
[140]Supplementary Documentation^ (3.8MB, pdf)
[141]Supplementary Tables^ (17.3MB, xlsx)
Acknowledgements