Abstract
The identification of cancer-promoting genetic alterations is
challenging particularly in highly unstable and heterogeneous cancers,
such as esophageal adenocarcinoma (EAC). Here we describe a machine
learning algorithm to identify cancer genes in individual patients
considering all types of damaging alterations simultaneously. Analysing
261 EACs from the OCCAMS Consortium, we discover helper genes that,
alongside well-known drivers, promote cancer. We confirm the robustness
of our approach in 107 additional EACs. Unlike recurrent alterations of
known drivers, these cancer helper genes are rare or patient-specific.
However, they converge towards perturbations of well-known cancer
processes. Recurrence of the same process perturbations, rather than
individual genes, divides EACs into six clusters differing in their
molecular and clinical features. Experimentally mimicking the
alterations of predicted helper genes in cancer and pre-cancer cells
validates their contribution to disease progression, while reverting
their alterations reveals EAC acquired dependencies that can be
exploited in therapy.
Subject terms: Cancer, Cancer genomics, Oesophageal cancer
__________________________________________________________________
Identifying driver genes in unstable, heterogenous tumour types can be
challenging. Here, Mourikis, Benedetti, Foxall and colleagues present a
machine learning algorithm to tackle this problem in esophageal
adenocarcinoma.
Introduction
Genome instability enables the onset of several hallmarks of cancer
with some acquired alterations conferring selective advantages to the
mutated cells and driving their outgrowth and eventual dominance^[48]1.
The identification of driver genes (genes acquiring driver alterations)
is therefore critical to fully understand the molecular determinants of
cancer and to develop precision oncology. Since driver genes are under
positive selection during cancer progression, a reasonable assumption
is that their mutation is observed more frequently than expected. In
recent years, large-scale cancer genomic studies have provided the
required power to detect driver events recurring across samples with
good statistical confidence^[49]2,[50]3. However, the full
characterisation of driver events is challenging when a cancer’s
genomic landscape is highly variable and recurrent events are rare.
One such cancer is esophageal adenocarcinoma (EAC), whose incidence in
recent years has risen substantially in the western world^[51]4. EAC
exhibits high mutational and chromosomal instability leading to
widespread genetic heterogeneity. In over 400 EACs sequenced so far,
mutations in TP53, CDKN2A, SMARCA4, ARID1A, SMAD4, ERBB2, MYD88,
PIK3CA, KAT6A, ARID2, as well as amplifications of VEGFA, ERBB2, EGFR,
GATA4/6, CCNE1 are the most recurrent driver events^[52]5,[53]6.
However, a significant fraction of patients remain without known
genetic determinants and the number of identified drivers per sample is
too low to fully explain the disease. Consequently, the molecular
mechanisms that drive EAC have been difficult to characterise in full.
This impacts EAC diagnosis and treatment, with recent phase III
clinical trials of targeted agents failing to show benefits or reaching
inconclusive results^[54]7,[55]8.
Here we hypothesise that, alongside the critical role of recurrent and
well-known drivers, complementary somatic alterations of other genes
help cancer progression in individual patients. Therefore, the
comprehensive characterisation of the full compendium of cancer drivers
requires that both recurrent and rare events are considered. While
recurrent drivers can be identified based on the frequency of their
alterations, rare genes altered in few or even single patients are
difficult to identify. To this aim, we develop sysSVM, an algorithm
based on supervised machine learning that predicts cancer genes in
individual patients. The rationale of sysSVM is that somatic
alterations sustaining cancer affect genes with specific
properties^[56]9. It therefore uses these properties, rather than
recurrence, to identify cancer genes.
We apply sysSVM to 261 EACs from the UK OCCAMS Consortium, part of the
International Cancer Genome Consortium (ICGC). We first train the
classifier using 34 features derived from the biological properties
specific to known cancer genes and then prioritise 952 genes that,
together with the known drivers, help promote cancer development across
the whole EAC cohort. The large majority of these newly predicted
‘helper’ genes are rare or patient-specific but they converge towards
the perturbation of cancer-related processes including intracellular
signalling, cell cycle regulation, proteasome activity and Toll-like
receptor signalling. We use the recurrence of process perturbation,
rather than genes, to stratify the 261 EACs into six clusters that show
distinct molecular and clinical features and suggest differential
response to targeted treatment.
Results
The landscape of recurrent and rare EAC genes
sysSVM applies machine learning to predict altered genes contributing
to cancer in individual patients based on the similarity of their
molecular and systems-level properties to those of known cancer genes
(Supplementary Note [57]1). Molecular properties include somatic
alterations with a predicted damaging effect on the protein function
(gene gains and losses, translocations, inversions, insertions,
truncating and non-truncating damaging alterations and gain of function
mutations) as well as the overall mutation burden and the gene copy
number (Supplementary Table [58]1). Systems-level properties are
genomic, epigenomic, evolutionary, network and gene expression features
that distinguish cancer genes from other genes. They include gene
length and protein domain organisation^[59]9,[60]10, gene
duplicability^[61]11,[62]12, chromatin state^[63]13, connections and
position in the protein-protein interaction network^[64]11, number of
associated regulatory miRNAs^[65]12, gene evolutionary origin^[66]12
and breadth of gene expression in human tissues^[67]9,[68]10
(Supplementary Table [69]1).
sysSVM is composed of three steps (Fig. [70]1a, Supplementary
Note [71]1). In step 1, 34 features describing the gene molecular and
systems-level properties are mapped to all genes in each patient. In
step 2, known cancer genes altered in the patient cohort are used to
run a set of three-fold cross validations and identify the best models
in four kernels (linear, sigmoid, radial, polynomial). In step 3, these
best models are used for training and prediction. All altered genes
except the known cancer genes used for training are first scored in
each patient individually by combining the predictions of the four
kernels and then ranked according to the resulting score. Since the
hypothesis is that the strength of the contribution of a gene to cancer
depends on how similar its properties are to those of known cancer
genes, the top scoring genes in each patient are the most likely
contributors to cancer progression. The overall results are combined to
obtain the final list of predicted cancer genes.
Fig. 1.
[72]Fig. 1
[73]Open in a new tab
Cancer helper genes in 261 EACs. a Schematic workflow of the sysSVM
algorithm. b Application of sysSVM to 261 EACs. Genes with somatic
damaging alterations (n = 116,989) were extracted from 261 EACs and
divided into training (known cancer genes, blue) and prediction (rest
of altered genes, purple) sets. sysSVM was trained on the properties of
known drivers and the best models were used for prediction. All altered
genes were scored in each patient individually and the top 10 hits were
considered as the cancer helper genes in that patient, for a total of
2608 helper alterations, corresponding to 952 unique hits (red). c
t-distributed Stochastic Neighbour Embedding (t-SNE) plot of 116,989
altered genes in 261 EACs. Starting from the 34 properties used in
sysSVM, a 2-D map of the high-dimensional data was built using Rtsne
package ([74]https://github.com/jkrijthe/Rtsne) in R. Curves are
coloured according to the density of 476 known cancer genes altered
4091 times (blue) used as a training set and the rest of altered genes
are coloured according to their sysSVM score. d Distribution of
damaging alterations in 952 cancer helpers. Overall, these genes
acquire 2608 damaging alterations. e Expression of helper genes in EACs
where they are amplified compared to EACs where they are copy number
neutral. FPKM values from RNA-Seq were available from 92 EACs. Out of
the 952 helper genes, 389 had at least one amplification across these
samples, for a total of 751 amplification events. Significance was
assessed using the Wilcoxon rank-sum test. Lower and upper hinges and
middle line of boxplots correspond to 25th, 75th, and 50th percentiles.
Upper and lower whiskers extend less than 1.5 times the interquartile
range. f Recurrence of cancer helpers across 261 EACs. Only samples
acquiring alterations with a damaging effect are considered
We applied sysSVM to 261 EACs from OCCAMS, which are part of the ICGC
dataset (Fig. [75]1b, Supplementary Data [76]1). In step 1, we
extracted 17,078 genes with predicted damaging alterations (median of
382 damaged genes per patient) and mapped their 34 features. We
verified that there is no pairwise correlation between these features
(Supplementary Fig. [77]1). Moreover, 476 known cancer genes^[78]14
altered in the 261 EACs (Supplementary Data [79]2) tend to cluster in
distinct regions of the feature space (Supplementary Fig. [80]2). This
confirms that these features distinguish cancer genes from other genes.
In step 2, we ran 10,000 iterations of a three-fold cross validation
using the 476 known cancer genes and combined the results to obtain 500
best models for each kernel (Supplementary Table [81]2, Methods). In
step 3, we trained the four classifiers with these best models and used
them to score and rank the remaining 16,602 altered genes in each
patient. Since the gene score reflects a gradient between driver and
passenger activity, we considered the top 10 scoring genes in each EAC
as the main cancer contributors for that patient. We verified that the
main findings of our study hold true if we apply higher or lower cut
offs (see below). Overall, this produced 500 lists of top 10 scoring
genes in each sample (Supplementary Table [82]2, Methods). We
considered the list of 952 genes that occurred most frequently as the
final set of predicted cancer genes (Supplementary Data [83]3). Since
our hypothesis is that these genes help the known drivers to promote
cancer, we define them as helper genes.
We investigated the importance of each feature in the four classifiers
by ranking the 34 features based on their weight^[84]15 and observed
interesting properties of the four models. Firstly, categorical
features were the top contributors for linear kernels (linear and
polynomial), while both categorical and continuous features contributed
to non-linear kernels (radial and sigmoid, Supplementary Fig. [85]3).
This likely reflects intrinsic differences across kernels, and supports
their integration to capture different regions of the feature space and
increase the chances of identifying rare helpers. Secondly, no feature
had zero weight in all four kernels, indicating that all features
contributed to the final gene classification. Thirdly, despite the high
prevalence of copy number gains (Supplementary Table [86]1), gene
amplification was not the highest-ranking feature in any kernel
(Supplementary Fig. [87]3). We next checked whether nodes with similar
topological properties in the protein-protein interaction network had
similar functional properties, since this may bias the final
classification. A pathway enrichment analysis of the 2608 central hubs
of the network resulted in 528 enriched pathways (FDR = 0.01),
representing 46% of all Reactome pathways. This indicates a large
diversity in the functions of central hubs and excludes that nodes with
similar topological properties necessarily have similar functional
properties. Furthermore, a pathway enrichment analysis only considering
252 central hubs encoded by known cancer genes resulted in 158 enriched
pathways (FDR = 0.01). Of these, 135 are in common with all central
hubs, indicating that cancer-related functions constitute a small
fraction of pathways enriched in central hubs.
Helper genes localise to the same high-density regions of known cancer
genes (Fig. [88]1c), with lower scoring genes being further away
(Supplementary Fig. [89]3). The properties of top scoring genes
therefore resemble those of known cancer genes. Consistent with the
prevalence of gene amplification in EAC (Supplementary Table [90]1),
the vast majority of the 952 helpers undergo copy number gain
(Fig. [91]1d), resulting in their increased expression (Fig. [92]1e).
Although prevalent gene amplification does not bias the best models
(Supplementary Fig. [93]3), we investigated its impact on the final
predictions. First, only a minority of all amplified genes in the 261
EACs are predicted as helper genes (Supplementary Fig. [94]4). Second,
the 952 final helpers are amplified 6296 times in the 261 EACs, but
they are only predicted as helpers 2062 times (Supplementary
Fig. [95]4). Therefore, gene amplification alone is not sufficient for
a gene to be predicted as a helper. Despite the majority of helpers
being rare or patient-specific (Fig. [96]1f), some are altered in over
10% of EACs (Supplementary Data [97]3) and are usually associated with
frequently occurring amplifications^[98]16 (Supplementary Fig. [99]4).
We next assessed the robustness of our predictions. First, we evaluated
the performance of sysSVM using two independent cohorts, 86 EACs from
The Cancer Genome Atlas (TCGA) and 21 EACs from another study^[100]17
(Supplementary Data [101]1). We scored all altered genes, including
known cancer genes^[102]14, in each of the 107 EACs independently,
using the four best models trained on the ICGC cohort. In both
datasets, known cancer genes have significantly higher scores than the
rest of the altered genes (Supplementary Fig. [103]3), indicating that
sysSVM is able to recognise them as major cancer contributors. Second,
we checked whether any of the 952 helpers were previously identified as
cancer genes. We found that 41 helpers (4%) have recently been added to
the Cancer Gene Census^[104]14 and 171 helpers (18%) have been
predicted as candidate cancer genes in various cancers, including
EAC^[105]5 (Supplementary Data [106]3). Third, we searched for possible
false positive predictions using two lists. The first was composed of
49 genes predicted as false positives of recurrence-based
methods^[107]5 and contained only three helpers (PCLO, CNTNAP2 and
NRXN3; Supplementary Data [108]3). Interestingly, PCLO has recently
been shown to exert an oncogenic role in esophageal cancer by
interfering with EGFR signalling^[109]18. The second list was a
manually curated set of 488 putative false positives^[110]3 and
contained 44 helpers (4.6% of the total). This is less than the
fraction of known cancer genes^[111]14 present in the same list of
false positives (46/719, 6.4%). Altogether these analyses indicate that
sysSVM robustly predicts cancer genes in multiple patient cohorts, with
a minimal false positive rate.
Helper genes perturb cancer-related biological processes
To gather a comprehensive characterisation of the molecular
determinants of EAC, we analysed the biological processes perturbed by
helpers compared to drivers. We manually reviewed the 476 known cancer
genes^[112]14 with damaging alterations in the OCCAMS cohort and
retained 202 of them based on the concordance between the type of
acquired modification and literature evidence of their role as
oncogenes or tumour suppressors (Supplementary Data [113]2). The median
number of drivers per EAC is in accordance with recent
estimates^[114]19, with a prevalence of gene amplification
(Supplementary Fig. [115]4). We then performed two independent gene set
enrichment analyses, with either 202 known drivers or 952 helpers, to
dissect their relative functional contribution to EAC. This led to 212
and 189 enriched pathways out of the 1877 tested, respectively
(FDR < 0.01, Supplementary Data [116]4, Supplementary Fig. [117]4). The
analysis of known drivers resulted in a higher number of enriched
pathways than helpers, despite their lower number. This reflects the
higher number of pathways that drivers map to (median of four pathways
for drivers and two pathways for helpers).
Seventy-three pathways (over 34%) enriched in known drivers are
perturbed in over 50% of EACs (Supplementary Data [118]4, Supplementary
Fig. [119]5). These universal cancer pathways are involved in
well-known cancer-related processes, including intracellular
signalling, cell cycle control, apoptosis and DNA repair, and are
associated with the most recurrently altered known drivers (TP53,
CDKN2A, MYC, ERBB2, SMAD4, CDK6, KRAS; Supplementary Data [120]2).
Interestingly, 50 of the 73 (70%) are also enriched in helpers and 86
patients with altered helpers in a universal cancer pathway have no
known drivers in that pathway (Fig. [121]2a, Supplementary
Data [122]4). This indicates that helpers often contribute to the
perturbation of key cancer pathways and that their alteration may be
sufficient for cancer development in the absence of known drivers.
Fig. 2.
[123]Fig. 2
[124]Open in a new tab
Perturbed processes in 261 EACs. a Scatterplot of 51 universal pathways
enriched in known drivers and helpers. For each pathway, the number of
EACs with altered drivers and the number of EACs with altered drivers
and helpers is shown. The size of dots is proportional to the
additional EACs with perturbations in these pathways because of altered
helpers only. b Schematic of the procedure to cluster EACs according to
pathways enriched in known drivers or helpers. Enriched pathways are
mapped to individual EACs and the Jaccard index is calculated as the
proportion of shared pathways over the total pathways in each pair of
samples (i, j). Hierarchical clustering is then performed. c Clustering
of 261 EACs according to pathways enriched in known drivers and
helpers. Five clusters were identified using known drivers (1D-5D) and
six using helpers (1H-6H). Cluster-matching coloured lines show where
EACs clustered by pathways enriched in helpers map in the driver
clusters. d Mutational status of selected known drivers across 261
EACs. Drivers enriched in clusters of helpers are highlighted.
Significance was assessed using the Fisher’s exact test, after
correcting for False Discovery Rate (FDR)
Next, we clustered EACs according to the proportion of shared perturbed
pathways (Methods, Fig. [125]2b). When using pathways enriched in known
drivers, we identified five well-supported clusters (1D-5D,
Fig. [126]2c, median silhouette score = 0.5, Supplementary
Fig. [127]6). These clusters are clearly driven by the mutational
status of the most recurrent drivers. For example, TP53 is altered in
clusters 1D-3D, EGFR, ERBB2 and MYC are altered in cluster 1D and MYC
and KRAS are altered in cluster 2D (Fig. [128]2d, Supplementary
Data [129]1). Samples in clusters 4D and 5D show an overall lower
mutational burden (p = 0.03, Wilcoxon rank sum test), fewer known
drivers and consequently a lower number of enriched pathways
(p = 7 × 10^–6, Wilcoxon rank sum test).
When clustering EACs according to the pathways enriched in helpers, we
identified six well-supported clusters (1H-6H, Fig. [130]2c, median
silhouette score = 0.3, Supplementary Fig. [131]6). Here, samples are
brought together not by the recurrent alterations of the same helpers,
but by different helpers mapping to the same or related pathways
(Supplementary Data [132]1). For example, both clusters 1H and 3H show
diffuse perturbations in intracellular signalling (Fig. [133]2c), often
involving universal cancer pathways (Supplementary Data [134]4,
Supplementary Fig. [135]7). In over 43% of EACs in both clusters,
perturbations in universal cancer pathways occur in patients with no
drivers. Other pathways perturbed in cluster 1H, but not in 3H, involve
cell cycle regulation, Toll-like receptor signalling and proteasome
activity (Supplementary Data [136]1, Supplementary Fig. [137]7). EACs
in cluster 1H are also significantly associated with several known
drivers including RECQL4, RARA, MYC, SMARCE1 and ERBB2 (Fig. [138]2d),
which are often but not always co-altered (Supplementary Fig. [139]4).
They have a prevalence of mutational signature S3 and are enriched in
early (stage 2) tumours (Fig. [140]3a). Patients in cluster 3H are
instead enriched in tobacco smokers (Fig. [141]3a).
Fig. 3.
[142]Fig. 3
[143]Open in a new tab
Features of EAC clusters driven by pathways enriched in helpers. a For
each helper cluster (1H-6H) indicated are the molecular features
(mutational signatures, number of genes with damaging mutations,
undergoing amplification or deletion), the distribution of stage 2
tumours and the tobacco smoking habits of the patients that show
significant associations with one of the six clusters of helpers.
Enrichment in number of altered genes, tumour staging, smoking habits
and cancer predisposition was assessed using Fisher’s exact test.
Distributions of mutational signatures were compared using Wilcoxon
rank-sum test. FDR = false discovery rate after correction for multiple
testing. b Kaplan–Meier survival curves of EACs in cluster 4H (n = 37)
and the rest of EACs (n = 224). Analysis was performed using survival
([144]https://github.com/therneau/survival) and survminer
([145]https://github.com/kassambara/survminer) R packages with default
parameters. Significance was measured using the log-rank test
The processes perturbed in clusters 2H and 4H are also functionally
related, in this case to cell cycle regulation (Fig. [146]2c). All EACs
in cluster 2H have helpers involved in the regulation of the G1/S
transition (Supplementary Fig. [147]7), including members of the E2F
family of transcription factors and associated co-activators,
competitors and downstream targets (Supplementary Data [148]1). Cluster
4H instead harbours perturbations in DNA replication, with alterations
in the MCM complex, a downstream target of E2F^[149]20. Dysregulation
of E2F transcription factors or the MCM complex can induce genomic
instability through either aberrant cell-cycle control or replicative
stress^[150]21. Indeed, EACs in clusters 2H accumulate significantly
more damaged and amplified genes, while those in 4H show significantly
more deletions and are enriched in mutational signature 2
(Fig. [151]3a). Cluster 2H also shows significant alterations of the
known drivers GNAS, SS18L1, and FHIT (Fig. [152]2d). FHIT is linked to
increased genomic instability^[153]22 and regulates the expression of
cell cycle-related genes^[154]23, therefore potentially affecting the
G1/S transition pathways of this cluster. Cluster 4H shows frequent
alterations in the known drivers TRAPP and CDK6 (Fig. [155]2d). The
latter functions in various cell cycle-related pathways, including the
mitotic G1/S phase pathway altered in 100% of cluster 4H (Supplementary
Data [156]1, Supplementary Fig. [157]7). Interestingly, elevated
expression of the MCM complex has been associated with tumour
aggressiveness and poor outcome^[158]24. Their perturbation could
therefore explain the significantly lower survival observed for
patients in cluster 4H (Fig. [159]3b). Finally, cluster 5H shows
perturbations in the Toll-like receptor (TLR) signalling cascade
(Supplementary Fig. [160]7) that has recently been linked to
EAC^[161]25.
Overall, clusters 1H to 5H account for 166 EACs (64% of the total
cohort). The remaining 95 EACs in cluster 6H share fewer perturbed
pathways, although 55 of them (58%) have alterations in Rho GTPase
activity (Supplementary Fig. [162]7) with frequent modifications of Rho
GTPase effectors including ROCK1, PTK2, PAK1, LIMK1 and NDE1
(Supplementary Data [163]1). EACs in the six clusters obtained using
helpers are broadly dispersed in the clustering of known drivers
(Fig. [164]2c) indicating that helpers bring together patients with
similar perturbed processes that cannot be appreciated when focussing
only on recurrent drivers.
To test whether the germline genetic makeup of EAC patients was
associated with the somatic perturbation of specific processes, we
identified patients with potentially damaging germline variants in 152
known cancer predisposition genes^[165]26. Overall, we found 82
patients with damaging variants in 54 predisposition genes, with no
significant enrichment compared to European controls^[166]27
(Supplementary Fig. [167]8). This is expected since the heritable
component of EAC is spread over a large number of low-impact
loci^[168]28. We then tested for associations between cancer
predisposition genes and clusters of helpers or drivers. The only
significant result was a depletion of predisposition variants in
cluster 4H, which is characterised by diffuse perturbations in DNA
replication (Fig. [169]3a). Interestingly, the 54 cancer predisposition
genes damaged in EAC patients are enriched in DNA repair pathways
(FDR = 2.3 × 10^–10, Fisher’s exact test). It is tempting to speculate
that germline damages affecting DNA repair pathways would render
additional somatic perturbations in the same pathways lethal and
therefore be counter-selected.
Finally, we checked whether patient clustering is affected by the
number of helper genes considered in each patient. We performed the
same analysis considering the top five or top 15 scoring genes (528 and
1297 unique genes, respectively). We found that the vast majority (99
and 77%) of the pathways enriched in these datasets are also enriched
when considering the top 10 helpers (Supplementary Fig. [170]8),
indicating that the recurrently perturbed processes are highly
overlapping. We then clustered EACs according to the proportion of
shared perturbed pathways and verified that the six clusters obtained
using pathways enriched in top 10 genes recapitulated well the clusters
obtained using top five or 15 genes (Supplementary Fig. [171]8).
Therefore, the clustering is robust regardless of the applied ranking
cut-off.
Helper alterations lead to cancer phenotypes and dependence
To test the contribution of EAC helper genes to cancer, we used two
experimental approaches. In the first approach, we assessed the
consequences of altering (1) frequently or rarely altered helpers; (2)
loss of function alterations or gain of function/overexpression; and
(3) processes that define specific helper clusters or that are altered
across all helper clusters. We used diploid EAC FLO-1 cells that have
no alterations in any of the helpers selected for validation^[172]14 to
allow a clear evaluation of the effect of their induced alteration. We
measured cell proliferation as a main hallmark of cancer^[173]3 and
also performed gene-specific assays. In the second approach, we
evaluated the dependence of EAC on helper perturbations by assessing
the effect of reverting their alterations on cell growth. For this, we
used EAC cell lines with alterations similar to those observed in
patients.
First, we modified the most commonly altered helpers in clusters 2H and
4H, E2F1 (23 out of 24 samples in cluster 2H) and MCM7 (18 out of 37
samples in cluster 4H, Supplementary Data [174]1). Both E2F1 and MCM7
are amplified in EACs (Supplementary Data [175]3) leading to
significant gene overexpression (median two-fold increase,
p = 6 × 10^–3 and p = 8 × 10^–3, respectively, Wilcoxon rank-sum test;
Fig. [176]4a). We therefore overexpressed E2F1 and MCM7 in FLO-1 cells
to levels comparable to those observed in patients (Fig. [177]4b). In
both cases, this resulted in significantly increased proliferation of
overexpressing cells compared to control cells (p = 2 × 10^–4 and
p = 9 × 10^–4, respectively, two-tailed t-test; Fig. [178]4c). Since
E2F1 promotes cell cycle progression, we assessed DNA replication rate
by EdU incorporation. We observed increased EdU intensity throughout S
phase in E2F1 overexpressing cells compared to control cells
(p < 10^–4, Mann Whiney U test; Fig. [179]4d). This suggests that E2F1
may help cancer growth by promoting S phase entry. To assess the
consequence of MCM7 overexpression, we measured the loading of the MCM
complex onto chromatin. MCM7 overexpressing cells display a lower MCM
fluorescence intensity overall compared to control cells when staining
the chromatin-bound fraction for either MCM7 or MCM3 (p < 10^–4,
Mann–Whitney U test; Fig. [180]4e, f). This suggests that less MCM
complex is loaded onto chromatin by the start of S phase. Therefore,
MCM7 overexpression leads to both increased proliferation and
perturbation of MCM complex activity. Finally, we reduced MCM7
expression levels in MFD-1 cells derived from one of our EAC
patients^[181]29. MFD-1 cells have four-fold higher MCM7 basal
expression compared to FLO-1 cells (Fig. [182]4g). We therefore used a
doxycycline-inducible shRNA lentiviral vector (Supplementary
Table [183]3) to reduce MCM7 expression in MFD-1 cells to the level of
FLO-1 cells (Fig. [184]4h). This led to a significant decrease in
proliferation (p = 2 × 10^–5, two-tailed t-test; Fig. [185]4i),
indicating that MFD-1 cells rely on MCM7 overexpression for their
growth.
Fig. 4.
[186]Fig. 4
[187]Open in a new tab
Cancer helper role of E2F1 and MCM7. a E2F1 and MCM7 expression in EACs
with amplification (n = 11 samples each) compared to copy number
neutral EACs (n = 81 samples each). Significance was assessed by
Wilcoxon rank-sum test. b E2F1 and MCM7 mRNA expression in FLO-1 cells.
Lower and upper hinges and middle line of boxplots correspond to 25th,
75th, and 50th percentiles. Upper and lower whiskers extend less than
1.5 times the interquartile range. c Proliferation of FLO-1 cells
overexpressing E2F1 or MCM7 compared to control. d. EdU
(5-ethynyl-2′-deoxyuridine) incorporation by flow cytometry in E2F1
overexpressing cells compared to control. Cells were separated into G1,
S and G2 phases. S phase cells were subdivided into 4 gates (S1-S4,
Supplementary Fig. [188]9). Differences in EdU geometric mean
fluorescence intensity were assessed by Mann-Whitney U test. MCM
complex loading onto chromatin in MCM7 overexpressing or control cells
through MCM7 e or MCM3 f staining. Cells were pulsed with EdU and
chromatin was fractioned before staining for MCM7 or MCM3 to detect
chromatin-bound MCM complex. Cells were separated into cell cycle
phases using EdU and DAPI intensity. MCM7 or MCM3 fluorescence
intensity during S phase illustrates MCM complex unloading from
chromatin. Differences in geometric mean fluorescence intensity of MCM
staining were assessed using Mann-Whitney U test. For (d), (e), (f) one
representative of n = 3 biological replicates is shown. Corresponding
pseudocolour plots are in Supplementary Fig. [189]9. g MCM7 mRNA
expression in MFD-1 and FLO-1 cells. h qRT-PCR MCM7 expression in MFD-1
cells after transduction with a MCM7-shRNA inducible lentiviral vector.
Expression was assessed with or without 96 h doxycycline treatment. i
Proliferation curve of MFD-1 cells with or without doxycycline-induced
MCM7 knockdown. For all qRT-PCR experiments, expression was relativised
to β-2-microglobulin and normalised to FLO-1 cells. N = 2 biological
replicates were performed, each in technical triplicate. For all
proliferation assays, n = 2 biological replicates were performed, each
with four technical replicates. Proliferation was assessed every 24 h
and normalised to time zero. Mean values at 72 h were compared by
two-tailed Student’s t-test. Error bars show standard deviation. Source
data are available in the Source Data file
Next, we evaluated the role of rarely altered helpers, such as NCOR2
that is altered in eight EACs across five of the six clusters
(Supplementary Data [190]3). NCOR2 contributes to the nuclear receptor
corepressor complex that favours global chromatin deacetylation and
transcriptional repression^[191]5,[192]30 (Fig. [193]5a). Consistently
with the suggested tumour suppressor role of NCOR2 in lymphoma and
prostate cancer^[194]31, the most frequent NCOR2 alterations in EAC
lead to a loss of function. To reproduce these alterations, we edited
NCOR2 in FLO-1 cells using a vector-free CRISPR system^[195]32
(Methods, Supplementary Table [196]3) and the editing was quantified
using MiSeq (Fig. [197]5b). We observed a 1.3-fold increase in
proliferation of edited cells compared to the control cells
(p = 3 × 10^–3, two-tailed t-test test; Fig. [198]5c).
Fig. 5.
[199]Fig. 5
[200]Open in a new tab
Cancer helper role of selected genes. a Function of NCOR2 as part of
the nuclear receptor co-repressor complex, whose activity results in
chromatin deacetylation and transcriptional repression. b Editing of
the NCOR2 gene using three pooled crRNAs where cells are transiently
co-transfected with Cas9 protein, crRNAs and tracrRNA^[201]32. The
editing efficiency was measured using MiSeq. For NCOR2 two regions were
sequenced, one containing the region targeted by crRNA1 and 2 and the
other the region targeted bycrRNA3. For ABI2 only one amplicon was
sequenced, containing the region targeted by all three crRNAs. The
range of edited alleles and edited cells was derived considering the
two opposite scenarios where all three crRNAs edit the same
alleles/cells or different alleles/cells, respectively. c Proliferation
curve of NCOR2 or non-targeting control (NTC) edited FLO-1 cells. d
Manual curation of the helpers contributing to the Rho-GTPase effectors
pathway. Heatmap indicates the number of samples with alterations in
each gene. ABI2 (blue) and PAK1 (red) were selected for experimental
validation. e Induced alterations in ABI2 and PAK1 genes. Editing of
ABI2 was performed and assessed as described for NCOR2. PAK1 mRNA
expression in FLO-1 cells was assessed by qRT-PCR, relativised to
β-2-microglobulin and normalised to control cells. Experiments were
done in triplicate in n = 2 biological replicates. f Proliferation
curves of FLO-1 cells after ABI2 editing or PAK1 overexpression. For
all proliferation assays, n = 3 biological replicates were performed,
each with four technical replicates. Proliferation was assessed every
24 h and each time point was normalised to time zero. Mean values at
72 h were compared by two-tailed Student’s t-test. All error bars show
standard deviation. Source data are available in the Source Data file
Then, we tested the effect of altering members of the Rho GTPase
effector pathway, pervasively perturbed in all six clusters, often
through patient-specific alterations (Fig. [202]5d, Supplementary
Data [203]1). We modified ABI2 and PAK1, which undergo damaging
alterations and amplification in one and nine EACs, respectively
(Supplementary Data [204]3). We therefore edited ABI2 and overexpressed
PAK1 as described above (Supplementary Table [205]3, Fig. [206]5e) and
observed significantly increased proliferation compared to control
cells (ABI2: p = 4 × 10^–4, PAK1: p = 1 × 10^–3 two-tailed t-test;
Fig. [207]5f).
Finally, we focussed on PSMD3 encoding a subunit of the regulatory 19S
proteasome complex. PSMD3 is amplified and overexpressed in three EACs
of cluster 1H, which overall contains 14 samples with alterations in
six proteasome subunits (Fig. [208]6a and Supplementary Data [209]3).
Three EAC cell lines (MFD-1, OE19 and OE33) show higher basal
expression of PSMD3 compared to FLO-1 (2-, 3- and 4-fold increase
respectively, Fig. [210]6b). Using a doxycycline-inducible lentiviral
shRNA vector (Supplementary Table [211]3), we reduced PSMD3 expression
in MFD-1, OE19 and OE33 cells to levels equivalent to those of FLO-1
(Fig. [212]6c). In all three cell lines we observed a significant
reduction in cell proliferation following the reduction of PSMD3
expression (MFD-1: p = 4 × 10^–8; OE19: p = 2 × 10^–8; OE33:
p = 6 × 10^–3, two-tailed t-test; Fig. [213]6d). The effect was
particularly strong in OE19, where cell growth was arrested completely.
MFD-1 and OE33 showed 1.3- and 1.2-fold reductions in cell growth
(Fig. [214]6d). This suggests that the extent of EAC reliance upon
helper alterations is at least partially context dependent.
Fig. 6.
[215]Fig. 6
[216]Open in a new tab
EAC cell dependence on PSMD3 alteration. a Heatmap of proteasome
subunits predicted as helpers in 261 EACs. b PSMD3 basal mRNA
expression levels in FLO-1, MFD-1, OE19 and OE33 cells. Expression was
relativised to β-2-microglobulin and normalised to FLO-1 cells. c PSMD3
expression levels in MFD-1, OE19 and OE33 after transduction with a
lentiviral vector carrying an inducible shRNA against PSMD3. Expression
was assessed in the absence of doxycycline and after 96 h of
doxycycline treatment, relativised to β-2-microglobulin and normalised
to FLO-1 cells. d Proliferation curves of MFD-1, OE19 and OE33 cells
with or without doxycycline treatment to reduce PSMD3 expression to
levels comparable to those of FLO-1 cells. For all proliferation
assays, n = 2 biological replicates were performed, each with four
technical replicates. Proliferation was assessed every 24 h and each
time point was normalised to time zero. Mean values at 96 h were
compared by two-tailed Student’s t-test. All error bars show standard
deviation. Source data are available in the Source Data file
Taken together, our experimental data indicate that, independently of
the alteration frequency, the modification of helpers positively
affects EAC cell growth. The fold changes in proliferation rate
observed upon perturbation of helpers are in the same range as those
observed following alteration of known strong drivers including TP53 or
PIK3CA^[217]33,[218]34. Moreover, we provide evidence that EAC cells
become addicted to helper alterations, suggesting that targeting
helpers, or the pathways in which they act, could reduce EAC
progression.
Helper alterations promote growth in Barrett’s esophagus
To evaluate the role of helper perturbations in the early stages of
EAC, we quantified their alterations in 82 samples of Barrett’s
esophagus (BE)^[219]35, a pre-malignant condition associated with EAC.
When considering both damaging and non-damaging alterations, the
percentages of helpers and known drivers^[220]14 altered in the whole
BE cohort were comparable to that of cancer-unrelated genes
(Fig. [221]7a). However, when considering only damaging alterations,
helpers showed significant enrichment compared to the rest of genes
(p = 4 × 10^–13 and p = 3 × 10^–6, respectively, Fisher’s exact test;
Fig. [222]7b). Early alteration of helpers in pre-malignant lesions
suggests that they may favour the transition to cancer. To further
validate this hypothesis, we altered representative helpers in BE CP-A
cells and evaluated the impact of their alterations on cell growth. As
already reported^[223]6,[224]35, damaging point mutations were the most
common alterations in BE (Fig. [225]7c). Therefore, we edited ABI2 and
NCOR2 to mimic their putative loss-of-function alterations. Since CP-A
cells have wild-type TP53^[226]36, we also edited TP53 to compare the
effect of altering helpers to that of altering a strong driver. After
confirming high editing levels (Fig. [227]7d), we observed a
significant increase in cell proliferation compared to control cells in
all three cases (ABI2: p = 3 × 10^–4, NCOR2: p = 1 × 10^–2, TP53:
p = 1 × 10^–4, two-tailed t-test; Fig. [228]7e). Strikingly, both
helpers promote cell growth to comparable levels to TP53, suggesting
that helper alterations can favour early cancer progression to a
similar extent to driver alterations.
Fig. 7.
[229]Fig. 7
[230]Open in a new tab
Effect of helper alterations in Barrett’s esophagus cells. a
Percentages of altered helper genes (661/952), known cancer genes
(364/518) and rest of cancer unrelated genes (12,250/17,544) over the
corresponding total. b Percentages of damaged helper genes (261/952),
known cancer genes (135/518) and rest of cancer unrelated genes
(3,093/17,544) over the corresponding total. Statistical significance
was assessed with two-tailed Fisher’s exact test. c Distribution of
4862 damaging alterations in 3489 damaged genes in 82 samples of BE.
The damaging effect of BE variants was assessed as in EAC (Methods)
with the exception of gene gains. In this case, only ‘high gains’
defined as in the original publication^[231]35 were considered as
damaging. d Editing of ABI2, NCOR2 or TP53 in CP-A cells through
nucleofection of two crRNAs and Cas9 protein. The editing efficiency
was measured by MiSeq and quantified as described above. For all three
genes, a single amplicon was sequenced. NCOR2, ABI2 and TP53 crRNAs
overlapped for over 70% of their sequences and therefore it was not
possible to assess their editing efficiency independently. e
Proliferation curves of CP-A cells after ABI2, NCOR2 or TP53 editing
compared to control cells. For all proliferation experiments, n = 3
biological replicates were performed, each with four technical
replicates. Proliferation was assessed every 24 h and each time point
was normalised to time zero. Mean values at 72 h were compared by
two-tailed Student’s t-test. Error bars show standard deviation. Source
data are available in the Source Data file
Discussion
Most state-of-the-art approaches to discovering cancer driver events
rely on the detection of positively selected alterations of genes that
promote cancer development^[232]3,[233]19. Even ratiometric methods
based on gene properties^[234]37 ultimately assess the effect of
positive selection and distinguish the few selected drivers from the
many passenger events. As a result, the discovery of cancer drivers is
biased towards frequently altered genes, with significant limitations
for cancers such as EAC that have a highly variable but mostly flat
(i.e. with few recurrent events) mutational landscape. Indeed, the
overall selection acting on esophageal cancer genomes is among the
lowest across cancer types^[235]19, despite a median of 382 damaged
genes per EAC (Supplementary Data [236]1). Therefore, the exclusive
focus on genes under strong selection is likely to return only a
partial representation of the genes involved in EAC.
To overcome these limitations, our machine learning approach sysSVM
ranks somatically altered genes relevant to cancer development based on
their properties rather than mutation recurrence. sysSVM also considers
all types of gene alterations (SNVs, indels, CNVs, and structural
variations) simultaneously, providing a comprehensive overview of the
genetic modifications with a cancer role in individual patients. When
applied to 261 EACs, sysSVM prioritises 952 altered genes that,
together with known drivers, help cancer progression. The large number
reflects the positive correlation between mutational burden and number
of driver genes, which is only partially explained by a sample size
effect^[237]3. This positive correlation may indicate that the number
of functionally relevant genes increases with the number of altered
genes.
The heterogeneous landscape of EAC is substantially reduced by
considering the perturbed biological processes (Fig. [238]2c). Most of
these processes are well-known contributors to cancer development,
including intracellular signalling, cell cycle control, and DNA repair
(Supplementary Data [239]4). Interestingly, while the known drivers
tend to encode upstream players in these pathways, helpers are often
downstream effectors. For example, we found several Rho GTPase
effectors (Fig. [240]5d, Supplementary Data [241]3) or genes downstream
of previously reported EAC drivers in the Toll-like receptor cascade
(Supplementary Data [242]3). This supports a more local role of helpers
in contributing to cancer at the single patient level, through
sustaining or complementing driver functions. In this respect, helpers
are conceptually similar to mini-drivers^[243]38.
Clustering pathways disrupted by helpers allows the division of the 261
EACs into six clusters that are often functionally related. For
example, clusters 1H and 3H share perturbations in intracellular
signalling. Similarly, clusters 2H and 4H show perturbations in cell
cycle regulation, including S phase entry and DNA replication.
Consistent with this, they bring together the most genomically unstable
samples. By experimentally mimicking the amplification of E2F1
(representative of cluster 2H) and MCM7 (representative of cluster 4H),
we increased proliferation in EAC cells (Fig. [244]4c). We also provide
evidence that E2F1 increases proliferation by promoting S phase entry
(Fig. [245]4d). Interestingly, MCM7 overexpression reduced MCM complex
loading onto chromatin (Fig. [246]4e, f), maybe due to a stoichiometric
imbalance of complex subunits. This may indicate that MCM7 promotes
cell growth through a separate mechanism besides its function in the
MCM complex. For example, MCM7 interacts with the tumour suppressor
protein Rb, a well-characterised inhibitor of E2F1^[247]39. It is
possible that MCM7 overexpression may sequester Rb away from E2F1,
thereby promoting E2F1-mediated cell cycle progression. Moreover,
reducing MCM7 expression in cells with high basal expression led to
decreased cell proliferation, showing the dependence of cancer cells on
this alteration.
We also confirmed the cancer promoting role of very rare helpers,
including ABI2, NCOR2 and PAK1 that are altered in 1–4% of EACs
(Fig. [248]5c–f). Therefore, irrespective of the frequency across
patients, helpers have a substantial impact on cancer progression. We
therefore speculate that it is the contribution of several genes
perturbing the same pathways that promotes cancer progression rather
than the alteration of one gene alone. In line with this, the
dysregulation of a strong driver such as TP53^[249]33 or PIK3CA^[250]34
alone does not have a dramatic effect. This is confirmed by our data
where the alteration of one helper gene has a mild yet significant
effect on cell proliferation. Interestingly, helpers are also
frequently altered in pre-cancerous lesions known as Barrett’s
esophagus (Fig. [251]7b), indicating that their alteration may be an
early event in EAC transformation. Consistent with this, the
perturbation of helpers leads to increased proliferation in BE cells
and the effect is comparable to that of perturbing TP53 (Fig. [252]7e).
The expansion of the repertoire of cancer genes may indicate new,
patient-specific gene dependencies suggesting possible stratifications
that could inform the selection of targeted treatments. For example, 14
samples of cluster 1H have alterations of several proteasome subunits
(Fig. [253]6a, Supplementary Data [254]1). Experimentally reverting the
expression of the proteasome subunit PSMD3 to diploid levels reduced
proliferation in three different EAC cell lines (Fig. [255]6d),
suggesting that EACs become addicted to helper alterations and
vulnerable to their inhibition. Interestingly, proteasome inhibition
has been shown to have a synergic effect in combination with ERBB2
inhibitors^[256]40. Since ERBB2 is also significantly altered in
cluster 1H (Fig. [257]2d), a combined therapy may be beneficial to
patients in this cluster.
In summary, we provide an attempt to extend the discovery of acquired
perturbations contributing to cancer beyond those of recurrent drivers.
Additional efforts are required to fully exploit the potential of these
approaches to offer a more comprehensive view of the molecular
mechanisms behind cancer and to guide clinical interventions.
Methods
Annotation of molecular properties
Data on somatic single nucleotide variations (SNVs), small insertions
and deletions (indels), copy number variations (CNVs), structural
variations (SVs), and mutational signatures for 261 EACs were obtained
from ICGC and analysed as previously described^[258]16 (Supplementary
Data [259]1). Briefly, SNVs and indels were called using Strelka
v.1.0.13^[260]41 and subsequently filtered^[261]16. For CNVs, the
absolute copy number for each genomic region was obtained from
ASCAT-NGS v.2.1^[262]42 after correction for tumour content, using read
counts at germline heterozygous positions as derived from GATK
v.3.2–2^[263]43. To account for the high number of amplifications
occurring in EAC, copy number gains were corrected by the ploidy of
each sample as estimated by ASCAT-NGS. A gene was assigned with the
copy number of a CNV region if at least 25% of its length was contained
in that region. SVs (gene translocations, inversions, insertions) were
identified from discordant read pairs using Manta^[264]44 after
excluding SVs that were also present in more than two normal samples of
a panel of 15 esophagus and 50 blood samples^[265]16. In the case of
the TCGA validation cohort, SNVs, indels, and CNVs were derived from
level 3 TCGA annotation data of 86 EACs (Supplementary Data [266]1). In
the case of 21 EACs from a previous study^[267]17, SNVs, indels, and
CNVs were called as described for the ICGC samples (Supplementary
Data [268]1). The distribution of variant allele frequency of SNVs and
indels across all samples was used to remove outliers likely indicating
sequencing or calling artefacts. Variants with < 10% frequency and
indels longer than five base pairs were also removed. For CNVs, genomic
regions were considered as amplified or deleted if their segment mean
was higher than 0.3 or lower than −0.3, respectively, capping the
segment mean to 1.5 to avoid hypersegmentation^[269]45. A gene was
considered as amplified or deleted if at least 25% of its length was
contained in a CNV region and the resulting copy number (CN) was
estimated as:
[MATH: CN=2×<
mn>2segmentmean
:MATH]
1
No SV data were available for the validation cohorts.
Since only genes with predicted damaging alterations were used as input
for sysSVM, further annotation for the variant damaging effect was
performed. Stopgain, stoploss, frameshift, nonframeshift,
nonsynonymous, and splicing SNVs and indels were annotated using
ANNOVAR (December 2015)^[270]46. All truncating alterations (stopgain,
stoploss, and frameshift mutations) were considered as damaging.
Nonframeshift and nonsynonymous mutations were considered as
non-truncating damaging alterations if predicted by at least five of
seven function-based methods (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR,
MutationTaster, MutationAssessor, LRTand FATHMM) or by two out of three
conservation-based methods (PhyloP, GERP + + RS, SiPhy), using the
scores from dbNSFP v.3.0^[271]47. Splicing modifications were
considered as damaging if predicted by at least one of the two ensemble
algorithms as implemented in dbNSFP v3.0. Putative gain of function
alterations were predicted with OncodriveClust^[272]48 with default
parameters and applying a false discovery rate of 10%. The transcript
lengths to estimate mutation clustering were derived from the refGene
table of UCSC Table Browser
([273]https://genome.ucsc.edu/cgi-bin/hgTables). Gene gains, homozygous
losses, translocations, inversions, insertions were always considered
as putative damaging alterations.
Overall, 17,078 genes had at least one damaging alteration, for a total
of 116,989 redundant damaged genes across 261 EACs (Supplementary
Table [274]1). Of these, 476 were known cancer genes^[275]14,
corresponding to 4091 redundant genes (Supplementary Data [276]1). For
all 17,078 genes, the total number of exonic alterations (silent and
nonsilent) and the somatic copy number were used as additional
molecular features in sysSVM.
Annotation of systems-level properties
Protein sequences from RefSeq v.63^[277]49 were aligned to the human
reference genome assembly GRCh37 to define unique gene loci^[278]10.
The length of the longest coding sequence was taken as the gene length.
Genes aligning to more than one gene locus for at least 60% of the
protein length were considered as duplicated genes^[279]11. Data on
human ohnologs (gene duplicates retained after whole genome
duplications) were collected from Makino et al.^[280]50. The number of
protein domains was derived from CDD^[281]51. The gene chromatin state
based on Hi-C experiments^[282]13 was retrieved from the covariate
matrix of MutSigCV v1.2.01^[283]2. Data on protein-protein and
miRNA-gene interactions, gene evolutionary origin and gene expression
were retrieved as described in An et al.^[284]10. Briefly, the human
protein-protein interaction network was rebuilt from the integration of
BioGRID v.3.4.125;^[285]52 MIntAct v.190;^[286]53 DIP (April
2015);^[287]54 HPRD v.9;^[288]55 the miRNA-gene interactions were
derived from miRTarBase v.4.5^[289]56 and miRecords (April
2013);^[290]57 gene evolutionary origin was assessed as described in
D’Antonio et al.^[291]12 using gene orthology from EggNOG v.4;^[292]58
and gene expression in 30 normal tissues was retrieved from GTEx
v.1.1.8^[293]59. Except gene length, duplication and ohnologs, all
other systems-level properties had missing information for some of the
17,078 altered genes (Supplementary Table [294]1). To account for this,
median imputation for continuous properties and mode imputation for
categorical properties were implemented. Specifically, for each
property median or mode values were calculated for known cancer genes
and the rest of mutated genes. All missing values were replaced with
their corresponding median or mode values.
Application of sysSVM to EACs
The three steps of sysSVM were applied to 261 EACs (Fig. [295]1a, b,
Supplementary Note [296]1). In step 1, all 34 features derived from
molecular and systems-level properties (Supplementary Table [297]1)
were mapped to the 17,078 altered genes in the cohort. Each feature was
scaled to zero mean and unit variance to correct for the different
numerical ranges across them. In step 2, 476 known cancer genes with
damaging alterations (Supplementary Data [298]2) were used as a set of
true positives for model selection. To optimise the parameters of the
four kernels (linear, radial, sigmoid and polynomial) a grid search
using 10,000 iterations of a three-fold cross validation was performed.
At each iteration, the 476 known cancer genes were randomly split into
2/3 (around 317 genes) used as a training set and 1/3 (around 159
genes) used as the test set. At each increment of 100 cross validation
iterations, the four best models (one per kernel) were chosen based on
the median and variance of the sensitivity distribution across all
previous iterations of cross-validation. The selection of the 100 sets
of best models from all 10,000 cross-validation iterations was repeated
5 times, where all iterations were randomly re-ordered. In step 3, the
resulting 500 best models were trained with the whole training set and
used to rank the remaining 16,602 unique genes in each patient. A score
was measured to combine the predictions from the four kernels and the
genes not expressed in normal esophagus according GTEx annotation were
excluded. These produced 500 lists of top 10 genes. Out of 500 best
models, 38 had a unique set of parameters resulting in 24 unique lists
of top 10 genes (Supplementary Table [299]2). These 24 lists ranged
between 898 and 952 genes, with a core set of 598 genes shared across
all of them. The most frequent top 10 list occurred 207 times (952_A,
41.4%, Supplementary Table [300]2). It was followed by 952_B (32.2%,
161 times) and 951_A (8.6%, 43 times). These three lists accounted for
82.2% of the 500 sets of top 10 genes, they shared 950 genes and were
predicted by models differing in only one parameter (gamma in the
polynomial kernel, Supplementary Table [301]2). Furthermore, the most
frequent list was always predicted by the same set of best models.
Therefore, 952_A represented a robust set of prediction and was
considered as the final list of helper genes (Supplementary
Data [302]3). To quantify the relative contribution of the 34 features
to the four best models used to predict the final set of helper genes
in the 952_A final list, Recursive Feature Elimination (RFE)^[303]15
was implemented. RFE first defines the best set of parameters for each
kernel. Secondly, RFE trains the one-class classifier in each kernel
and computes the weight (w) of each feature, defined as the product of
the sysSVM coefficients with each support vector. Thirdly, RFE ranks
the features according to w^2 and recursively removes the feature with
the smallest value of w^2 until no feature remains. In the case of
sysSVM this was done 34 times until all sysSVM features were ranked
(Supplementary Fig. [304]3).
Identification of perturbed processes and patient clustering
To identify the perturbed biological processes in the EAC cohort, both
predicted cancer helper genes and known cancer genes were used. A
manual revision of 476 known cancer genes altered in the ICGC cohort
was performed and genes were considered as known drivers if (a) their
somatic alteration had been previously associated with EAC, (b) they
had a loss-of-function alteration and their tumour suppressor role had
been reported in other cancer types^[305]60, (c) they had a
gain-of-function alteration and their oncogenic role had been reported
in other cancer types^[306]60. The resulting 202 known cancer drivers
(Supplementary Data [307]2) and 952 cancer helpers were used for the
gene set enrichment analysis against Reactome v.58^[308]61, composed of
1877 pathways and 10,131 genes. After excluding pathways in levels 1
and 2 of Reactome hierarchy and those with less than 10 or over 500
genes, 1155 pathways were retained. These contained 9061 genes,
including 155 known drivers and 648 helpers. Gene set enrichment was
assessed using a one-sided hypergeometric test and the resulting P
values were corrected for multiple testing using the Benjamini &
Hochberg method (Supplementary Data [309]4). Enriched pathways within
the sets of known drivers or helpers were subsequently used to cluster
samples taking into account the proportion of perturbed processes
shared between samples. The Jaccard index (A) was calculated by
deriving the proportion of shared perturbed processes between all
possible sample pairs as:
[MATH:
Aij=Pi∩Pj<
/msub>∕Pi∪Pj<
/msub> :MATH]
2
where P[i] and P[j] are the perturbed processes in samples i and j,
respectively.
Complete linkage hierarchical clustering using Euclidean distance
between each row was performed on the resulting matrix. Clusters were
visualised using ComplexHeatmap R package^[310]62. To identify the
optimal number of clusters, the median silhouette value of the samples
for between 3 and 20 clusters was measured as a measure of clustering
robustness^[311]63.
Annotation of germline variants
Starting from germline variants (SNVs and indels)^[312]16, a series of
filters were applied. First, heterozygous calls with Variant Allele
Frequency (VAF) deviating from 50% by more than 2.6 standard deviations
and homozygous calls with VAF less than 95% were removed. Second,
variants were removed if their Minor Allele Frequency (MAF) in EACs
was > 10% and substantially larger than in ExAC^[313]64 and in the
European cohort from 1000 Genomes^[314]27 using the following formula:
[MATH:
MAFEA<
/mi>C>10×MAFrefe
rence
msqrt> :MATH]
3
were MAF[reference] was the MAF in ExAC or 1000 Genomes.
Finally, variants were also removed if they exhibited an excess of
heterozygosity determined as:
[MATH:
pAa>1.04-1.04
-3.74×pA
mi>×pa :MATH]
4
where p[Aa], p[A] and p[a] are the proportion of heterozygote variants,
the major allele frequency and the minor allele frequency respectively.
This formula was empirically chosen based on the principle of
Hardy-Weinberg equilibrium^[315]65. The resulting variants were
considered damaging if they were frameshift indels, introduced
stopgains or stoplosses, or if defined damaging in at least five out of
seven functional predictors as described above. Only rare damaging
variants (MAF < 1% in ExAC^[316]64 and in the Europeans from 1000
Genomes^[317]27) were further retained. A list of 152 cancer
predisposition genes was obtained from the literature^[318]26. EAC
patients that carried a rare damaging germline variant in one of these
genes were considered to carry a cancer predisposition variant.
Analysis of RNA sequencing data
Purified total RNA was extracted from 92 EACs from the ICGC cohort and
sequenced^[319]16. RNA sequencing reads were then aligned to human
reference genome hg19 and expression values were calculated using
Gencode v19. The summarise Overlaps function in the R GenomicAlignments
package was used to count any fragments overlapping with exons
(parameters mode = Union, singleEnd, invertStrand and inter.feature
were set according to the library protocol, fragments = TRUE,
ignore.strand = FALSE). Gene length was derived as the number of base
pairs in the exons after concatenating the exons per gene in
non-overlapping regions. FPKM (Fragments Per Kilobase Million) were
calculated for each gene as:
[MATH:
FPKM=genereadcount
mi>(librarysize∕
mo>1000000)×(genelengt
mi>h∕1000) :MATH]
5
Cell lines
Overexpression and editing experiments were carried out using the FLO-1
esophageal adenocarcinoma cell line obtained from the ECACC General
Cell Collection and CP-A (KR-42421) Barrett’s Esophagus cells from ATCC
(catalogue number CRL-4027). FLO-1 cells were grown at 37 °C and five
per cent CO[2] in DMEM + 2 mM Glutamine + 10% FBS (Biosera) + 1/10,000
units of penicillin–streptomycin. CP-A cells were grown at 37 °C and
five per cent CO[2] in Keratinocyte serum-free medium with 50 µg ml^−1
bovine pituitary extract and 5 ng ml^−1 recombinant human EGF
(17005042, Thermo Fisher). For passaging of CP-A cells, 250 mg L^−1
soybean trypsin inhibitor in PBS was used (17075029, Thermo Fisher).
Gene knockdown experiments were performed on OE19 cells obtained from
the Francis Crick Institute cell services, OE33 cells obtained from the
ECACC General Cell Collection and MFD1 cells obtained from the OCCAMS
Consortium. OE19 and OE33 cells were grown in RPMI + 2 mM
Glutamine + 10% FBS (Biosera) + 1/10,000 units of
penicillin–streptomycin. MFD1 cells were grown in DMEM + 2 mM
Glutamine + 10% FBS (Biosera) + 1/10,000 units of
penicillin–streptomycin. All cells were maintained at 37 °C and five
per cent CO[2], validated by short tandem repeat analysis and routinely
checked for mycoplasma contamination.
Gene overexpression
The vectors pCMVHA E2F1^[320]66 (Item ID 24225, Addgene), pLX_TRC317
(TRCN0000481188, Sigma-Aldrich) and pcDNA3.1 + /C-(K)-DYK (Clone ID:
OHu19407D, Genscript) were used to induce E2F1, MCM7, and PAK1
overexpression, respectively. FLO-1 cells were transfected according to
the manufacturer’s protocol, while CP-A cells were nucleofected
following the Neon™ kit protocol (Thermo Fisher), with 2 pulses of
1200 V for 20 ms. Overexpressing cells were selected with either
G481/Geneticin (E2F1, PAK1) or Puromycin (MCM7). Empty vectors carrying
G418 (pcDNA3.1 + /C-(K)-DYK, Genscript) or Puromycin (Item ID 85966,
Addgene) resistance were used as controls. The RNA from transfected
cells was used to assess gene overexpression via quantitative RT-PCR
using predesigned SYBR green primers (Sigma-Aldrich; Supplementary
Table [321]3) and Brilliant III Ultra-Fast SYBR Green QRT-PCR Master
Mix (Agilent Technologies). The average expression level across
triplicates (e) was relativised to the average expression level of
β-2-microglobulin (c):
[MATH: r=e-c :MATH]
6
where r is the relative gene expression. The fold change (fc) between
the relative gene expression after overexpression and the relative gene
expression in the control condition (r[c]) was calculated as:
[MATH:
fc=2rc-rKD :MATH]
7
Each sample was assessed in triplicate and each experiment was repeated
in biological duplicate.
Gene editing
To induce ABI2 and NCOR2 gene knock-out (KO) in FLO-1 cells, the
vector-free CRISPR-mediated editing approach was used as previously
described^[322]32. Briefly, cells were co-transfected using
lipofectamine CRISPR max (Life technologies) with a 69-mer tracrRNA
(Sigma-Aldrich), three gene-specific crRNAs (Sigma-Aldrich,
Supplementary Table [323]3) and GeneArt Platinum Cas9 nuclease (Life
technologies). To avoid off-target editing, all crRNAs used were
verified to map only the gene of interest with a perfect match and
additional hits in the genome with at least three mismatches. Control
cells were transfected with the same protocol but using three
non-targeting crRNAs. In CP-A cells, vector-free CRISPR-mediated
editing of ABI2, NCOR2 or TP53 was performed by introducing two
gene-specific crRNAs (Synthego, Supplementary Table [324]3) and GeneArt
Platinum Cas9 nuclease (Life technologies) into the cells by
nucleofection following the Neon™ kit protocol (Thermo Fisher), with 2
pulses of 1200 V for 20 ms. In all cases, gene editing was confirmed
with Illumina MiSeq sequencing. The regions surrounding the targeted
sites were amplified from genomic DNA of edited cells with primers
containing Illumina adapters (Supplementary Table [325]3) using Q5 Hot
Start High-Fidelity 2× Master Mix (New England Biolabs). DNA barcodes
were added with a PCR reaction before pooling the samples for
sequencing on Illumina MiSeq with the 250 base-pair paired-end
protocol. Sequencing reads were merged into single reads and aligned to
the human reference genome hg19 using BBMerge and BBMap functions of
BBTools^[326]67, obtaining an average of 78,864 aligned reads per
experiment. SNVs and small indels in the regions corresponding to each
crRNA (Supplementary Table [327]3) were called using the CrispRVariants
package in R^[328]68 and the percentage of edited alleles was estimated
as the percentage of variant reads in each experiment.
Gene knockdown
Inducible gene knockdown was carried out using lentiviral
pTRIPZ-TurboRFP (MCM7) or pSMART-TurboGFP (PSMD3) shRNA vectors
(Dharmacon). For each gene, three shRNA vectors were tested
(Supplementary Table [329]3). Virus was produced by co-transfecting
HEK293T cells with pTRIPZ or pSMART constructs alongside psPAX2 and
pMD2.G vectors (Addgene) using Fugene HD (Promega). Viral supernatant
was collected at 24 and 48 h and used for two rounds of infection of
OE19, OE33 or MFD1 cells, using 8 μg ml^−1 hexadimethrine bromide
(Sigma-Aldrich). Infected cells were selected after 48 h with
2 μg ml^−1 puromycin for 7 days. To induce shRNA expression, cells were
treated with 1 μg ml^−1 doxycycline (Sigma-Aldrich) for 16 h. Gene
expression with or without doxycycline was assessed by qRT-PCR using
predesigned SYBR green primers (Sigma-Aldrich; Supplementary
Table [330]3). Cells with the highest level of knockdown were then
sorted by FACS to isolate medium expressing cells (the middle 30% of
cells based on TurboRFP or TurboGFP fluorescence). Gene expression
after sorting was measured by qRT-PCR 24 h post-induction with
0–1 μg ml^−1 doxycycline, to determine the concentration of doxycycline
required to reduce expression to levels equivalent to FLO-1 cells. The
determined concentrations of doxycycline used for proliferation assays
were 0.05 μg ml^−1 for OE19 PSMD3 shRNA3, 0.25 μg ml^−1 for OE33 PSMD3
shRNA3, 0.25 μg ml^−1 for MFD1 PSMD3 shRNA3, 0.75 μg ml^−1 for MFD1
MCM7 shRNA3.
Cell proliferation
Cell proliferation was measured every 24 h for three or four days,
starting three hours after seeding the cells (time zero) using crystal
violet staining, CellTiter 96 Non-Radioactive Cell Proliferation Assay
(Promega) or CellTiter-Glo Luminescent Cell Viability Assay (Promega).
Briefly, 4.5 × 10^3 cells per well were seeded on 96-well plates in a
final volume of 100 μl per well. For inducible shRNA-expressing cells,
doxycycline was added 48 h prior to the start of the assay, and culture
media replaced every 24–48 h with fresh media containing doxycycline.
For the CellTiter 96 Non-Radioactive Cell Proliferation Assay, 15 μl of
the dye solution was added into each well and cells were incubated at
37 °C for two hours. The converted dye was released from the cells
using 100 μl of the solubilisation/Stop solution and absorbance was
measured at 570 nm after one hour using the Paradigm detection platform
(Beckman Coulter). For the CellTiter-Glo Luminescent Cell Viability
Assay, 100 μl of the CellTiter-Glo reagent was added into each well and
luminescence was measured after 30 min using the Paradigm detection
platform (Beckman Coulter). For all proliferation assays, four
replicates per condition were measured at each time point and each
measure was normalised to the average time zero measure for each
condition. Each experiment was repeated at least twice. Conditions were
compared using the two-tailed Student’s t-test.
Flow cytometry
EdU incorporation and MCM loading were assessed using a modified
version of the protocol described in Galanos et al.^[331]69. Briefly,
in each condition, 3 × 10^6 cells were pulsed for 30 min with 10 µM EdU
(Invitrogen) before washing in 1% BSA/PBS. Chromatin fractionation was
performed by incubating on ice for 10 min in CSK buffer (10 mM HEPES,
100 mM NaCl, 3 mM MgCl[2], 1 mM EGTA, 300 mM sucrose, 1% BSA, 0.2%
Triton-X100, 1 mM DTT, cOmplete EDTA-free protease inhibitor cocktail
tablets, Roche). Cells were then fixed in 4% formaldehyde/PBS for
10 min at room temperature before washing in 1% BSA/PBS. Cells were
permeabilised and barcoded^[332]70 by incubating in 70% ethanol
containing 0–15 µg ml^−1 Alexa Fluor 488 (Thermo Fisher) for 15 min,
then washed twice in 1% BSA/PBS. Barcoded cells were subsequently
pooled before incubating in primary antibody (mouse monoclonal
anti-MCM7: Santa Cruz Biotechnology sc-56324, or rabbit polyclonal
anti-MCM3: Bethyl Laboratories A300–192A) diluted 1:100 in 1% BSA/PBS
for 1 h. After washing in 1% BSA/PBS, samples were incubated for 30 min
in secondary antibody (Alexa Fluor 555-conjugated donkey anti-mouse:
A-31570, or donkey anti-rabbit: A-31572) diluted 1:500 in 1% BSA/PBS,
then washed again in 1% BSA/PBS. EdU labelling with Alexa Fluor 647
azide was performed using Click-iT EdU flow cytometry assay kit
(Invitrogen, [333]C10424) following the manufacturer’s instructions
before washing samples in 1% BSA/PBS. Samples were then incubated in 1%
BSA/PBS containing RNase and 10 mg ml^−1 DAPI for 15 min before
analysing with a BD LSR II Fortessa flow cytometer (BD Biosciences).
Lasers and filters used include: 407 nm laser with 450/50 bandpass
filter; 488 nm laser with 505 longpass and 530/30 bandpass filters;
561 nm laser with 570 longpass and 590/30 bandpass filters; 640 nm
laser with 670/14 bandpass filter. Compensation was performed manually
with single colour controls, using BD FACSDiva software (BD
Biosciences). FlowJo 10.3 software was used to analyse MCM loading onto
chromatin and EdU incorporation. Cells were gated to remove debris
using FSC-A/SSC-A, then gated to isolate singlets using DAPI-H/DAPI-A
(Supplementary Fig. [334]9). The cells were then separated by gating
the barcoded populations using 488-A/DAPI-A. Cells were finally
separated into cell cycle gates (G1, S1-4, G2) based on EdU-647-A and
DAPI-A (Supplementary Fig. [335]9), and the geometric mean fluorescence
intensity was obtained for each channel (MCM-555 or EdU-647).
ETHICS
All subjects gave informed consent, and this study was registered
(UKCRNID 8880) and approved by the Institutional Ethics Committees,
Cambridgeshire 4 Research Ethics Committee (REC 07/H0305/52 and
10/H0305/1).
Reporting Summary
Further information on research design is available in the [336]Nature
Research Reporting Summary linked to this article.
Supplementary information
[337]Supplementary Information^ (6MB, pdf)
[338]Peer Review File^ (91.1KB, pdf)
[339]41467_2019_10898_MOESM3_ESM.pdf^ (81.1KB, pdf)
Description of Additional Supplementary Files
[340]Supplementary Data 1^ (188KB, xlsx)
[341]Supplementary Data 2^ (49.7KB, xlsx)
[342]Supplementary Data 3^ (109.9KB, xlsx)
[343]Supplementary Data 4^ (77.7KB, xlsx)
[344]Reporting Summary^ (75.2KB, pdf)
[345]Source data^ (39.6KB, xlsx)
Acknowledgements