Graphical abstract
graphic file with name fx1.jpg
[25]Open in a new tab
Highlights
* •
Highly performing cancer survival prediction with XGBoost
* •
Pan-cancer training outperforms single-cohort training
* •
Combined approach consisting of machine learning and network
propagation
* •
Tumor microenvironment is most strongly involved in cancer survival
prediction
__________________________________________________________________
Bioinformatics; Cancer systems biology; Mathematical biosciences
Introduction
Patient survival is the ultimate goal of cancer therapy and predicting
patient survival from the molecular features of the individual tumor is
an important computational task that has implications for tumor
progression, therapy, and patient care ([26]Hoadley et al., 2018).
Large population studies have shown that cancer survival is a
multi-factorial problem and varies broadly between cancer types
([27]Allemani et al., 2015, [28]2018). This has given rise to the
identification of numerous gene expression signatures specific for
given cancer sub-types or treatments; however, such signatures are
often hardly reproducible ([29]Venet et al., 2011) and depend on the
statistical approach, the individual patient cohorts used, and even the
normalization of the data ([30]Patil et al., 2015). In addition to gene
expression signatures, it has recently also been reported for
mutation-based signatures that there is a lack of biological cause and
interpretation ([31]Alexandrov et al., 2020; [32]Kim et al., 2020).
This is due to rather small sample sizes of subtype-specific patient
cohorts and high inter-patient heterogeneity of the molecular features
of the patients within the cancer subtypes.
To overcome these issues, pan-cancer approaches have been conducted
([33]The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium,
2020), and these studies have indeed shown that there are common
regulatory mechanisms and features that are shared by patients of
different cancer sub-types, for example, on the level of signaling
pathways ([34]Sanchez-Vega et al., 2018). Additionally, it has been
shown that the immune status of the patients is an important factor in
cancer survival ([35]Thorsson et al., 2018). Furthermore, cancer onset
and progression are correlated with age, and many of the hallmarks of
cancer are shared in aging ([36]Aunan et al., 2017). In particular,
there is interplay between aging cells in the microenvironment and
cancer cells that may impact tumor metastasis and therapy responses
([37]Fane and Weeraratna, 2020). This suggests that there might be
common patterns for survival prognosis across different cancer types
and only recently survival prognoses based on pan-cancer approaches
have been developed ([38]Cheerla and Gevaert, 2019; [39]Kim et al.,
2020).
Computational approaches predicting survival from patients’ molecular
data have been conducted based on different machine learning techniques
([40]Kourou et al., 2015). For example, [41]Ishwaran et al. (2008)
developed a method for right-censored data that is based on random
forests and uses ensembles of decision trees for survival prediction,
while [42]Khan and Zubek (2008) adapted support vector machines to
predict patient survival. More recently, [43]Li et al. (2016)
introduced a multitask learning formulation for survival prediction,
where the survival time is predicted by estimating the patients’
survival status at certain time intervals over the study period. Other
recent methods for predicting cancer patient survival include Cox-nnet
([44]Ching et al., 2018), which uses a neural network in combination
with Cox regression as an output layer, and DeepSurv ([45]Katzman
et al., 2018), a deep neural network survival prediction method that
besides predicting survival also models the relationship between
patient covariates and treatment effectiveness in order to make
personalized treatment recommendations. While these approaches are
mainly gene-centered, it has been argued that they provide limited
understanding of mechanistic cancer survival processes. Therefore,
machine learning approaches have been developed that incorporate
pathway and network features directly in the learning process so that
derived features highlight potential molecular mechanisms of cancer
survival such as the Path2Surv method ([46]Dereli et al., 2019).
A key computational concern of machine learning is interpretability,
i.e., the ability to recall how the algorithm has come to its decisions
on global and local levels. This definition has recently been expanded
by defining interpretability as the “extraction of relevant knowledge”,
i.e., to also generate plausible results ([47]Murdoch et al., 2019).
Because complex machine learning methods, in particular tree ensemble
methods and (deep) neural networks, have generally a lower level of
interpretability than, for example, simple decision trees or linear
regression approaches, this might necessitate additional methods that
add plausibility in the context of the concrete data domain, such as
the incorporation of biological networks. An approach to investigate
biological mechanisms from molecular data is network propagation
([48]Cowen et al., 2017). Network propagation allows combining
experimental data with molecular interaction networks, such that the
topology of the network is used to propagate the data effects
throughout the network, and by that amplifying and functionally
interpreting the experimental data. This approach covers a wide range
of data domains and has been applied, for example, to associate genetic
variants with phenotypic disease traits ([49]Barel and Herwig, 2020;
[50]Leiserson et al., 2015).
In this work we combine state-of-the-art tree ensemble learning and
network propagation in order to derive and characterize a pan-cancer
molecular signature for survival prognosis that is both computationally
and biologically interpretable. We have analyzed gene expression data
from the TCGA cancer cohorts summarizing to 8,024 patients from 25
different cancer cohorts ([51]Liu et al., 2018). Our machine learning
approach is based on the XGBoost tree ensemble method ([52]Chen and
Guestrin, 2016), which has been shown to provide good performance in
many application domains, for instance, in diagnosing chronic kidney
disease ([53]Ogunleye and Wang, 2020) or identifying patients with
epilepsy based on their cerebral activity ([54]Torlay et al., 2017). We
compare the XGBoost method against four other state-of-the-art methods
and show highly competitive performance in single-cancer cohort
survival prediction. From the comparison, we observed that age, in
particular, is a confounding cross-cohort factor and that overall
survival prediction performance of all methods decreased with the age
of the patients. This led us to question whether there is a common
signature of cancer survival in the patients that is independent of the
particular cancer type. We thus set up a workflow for pan-cancer
training and showed increased prognostic performance compared to
single-cancer prediction. Features dominant for pan-cancer prediction
were further quantified and network propagation was carried out with
the trained feature importance weights in order to identify molecular
mechanisms related to pan-cancer survival. Network propagation
identified a subnetwork consisting of 103 genes that is strongly
associated with the tumor microenvironment (TME) and regulated by
important factors of the TME such as TNF, TP53, and STAT3. Furthermore,
we can show that the immune status of the patients can partially be
recovered by the survival network signature.
Our study shows that gradient tree boosting can be efficiently applied
to pan-cancer survival prediction and that the combination of machine
learning and network propagation can identify biologically meaningful
subnetworks that highlight the importance of the TME for patient
survival. The combination of machine learning and subsequent
feature-informed network propagation is fairly generic and can be
generalized to other disease domains and biological research questions.
Results
XGBoost gradient boosting predicts survival of cancer patients
In order to assess the ability of the XGBoost method to predict
survival from gene expression data of cancer patients, we compared the
single-cohort XGBoost method against three well-established survival
prediction methods, namely random survival forest ([55]Ishwaran et al.,
2008), survival support vector machine ([56]Khan and Zubek, 2008;
[57]Shivaswamy et al., 2007), and the Path2Surv multiple-kernel
learning method ([58]Dereli et al., 2019) (cf. STAR Methods).
[59]Figure 1A shows the prediction performances of random survival
forest (RF), survival SVM (SVM), the Path2Surv multiple-kernel learning
method on two different pathway/gene set collections (MKL[H] on the
Hallmark gene sets ([60]Liberzon et al., 2015) and MKL[P] on the
Pathway Interaction Database (PID) ([61]Schaefer et al., 2009)), and
the single-cohort XGBoost method (XGB[SINGLE]) on 25 different TCGA
cancer cohorts ([62]Table S1). These cohorts represent a wide spectrum
of the human organ system ([63]Parris, 2020). For ten of the cohorts
(TCGA-BLCA, TCGA-BRCA, TCGA-CESC, TCGA-COAD, TCGA-HNSC, TCGA-LGG,
TGCA-OV, TGCA-PAAD, TCGA-SARC, and TCGA-STAD) single-cohort XGBoost
showed the best median performance, while RF performed best for seven
cohorts (TCGA-ACC, TCGA-KIRC, TCGA-KIRP, TCGA-LAML, TCGA-READ,
TCGA-UCS, and TCGA-UVM). The Path2Surv method outperformed the other
methods in four (TCGA-LIHC, TCGA-LUAD, TCGA-LUSC, and TCGA-MESO) and
three (TCGA-ESCA, TCGA-GBM, and TCGA-SKCM) cohorts for MKL[P] and
MKL[H], respectively, while SVM could achieve the best median
performance in only one cohort (TCGA-UCEC). Overall, single-cohort
XGBoost significantly outperformed RF for 13, SVM for 17, MKL[H] for 9
and MKL[P] for 10 out of 25 TCGA cohorts.
Figure 1.
[64]Figure 1
[65]Open in a new tab
Single-cohort prediction performances
(A) C-Index boxplots over 100 replications of model training for random
survival forest (RF), survival SVM (SVM), the Path2Surv multiple-kernel
learning on the Hallmark gene sets (MKL[H]) and the Pathway Interaction
Database (MKL[P]), and the single-cohort XGBoost method (XGB[SINGLE])
on 25 different TCGA cancer cohorts. Mean C-Indices were compared with
Wilcoxon’s unpaired rank-sum test and significance levels are defined
as
[MATH: ns: p>0.05
:MATH]
,
[MATH: ∗:p≤0.05 :MATH]
,
[MATH: ∗∗:p≤0.01 :MATH]
,
[MATH: ∗∗∗:p≤0.001 :MATH]
,
[MATH: ∗∗∗∗:p≤0.0001 :MATH]
.
(B) Spearman correlations between predictions of the different methods
for test patients from the cohorts TCGA-BLCA (left) and TCGA-UVM
(right). Larger circles correspond to a greater correlation, blue
indicates a positive correlation and red indicates a negative
correlation.
(C) Spearman correlation (R) between median C-Indices of single-cohort
XGBoost predictions and median ages for 25 different TCGA cohorts. The
blue line shows the linear regression fit to the data and the gray area
indicates the 95% confidence interval.
For two of the 25 cohorts (TCGA-BLCA and TCGA-UVM) we also assessed how
well the survival predictions of the different methods were correlated.
To this end, for each of the two selected cohorts, we trained and
tested all methods on the same train-test-split of patients and
computed the Spearman correlation between survival predictions for the
test patients. Since Path2Surv (MKL[H] and MKL[P]) and SVM predict
survival times, while RF and XGBoost predict risk scores, where a
higher risk score corresponds to a shorter predicted survival,
predictions of Path2Surv and SVM are expected to be negatively
correlated to predictions of RF and XGBoost. [66]Figure 1B shows the
correlation values between predictions of the different methods. As
expected, predictions of the same metric (either survival time or risk
score) are positively correlated, while predictions of survival time
are negatively correlated to risk score predictions in most cases.
However, correlation strengths differ between methods and among the
analyzed cohorts. The survival predictions of the different methods
seem to be more highly correlated in TCGA-UVM than in TCGA-BLCA. In
TCGA-UVM predictions are most correlated between MKL[P] and RF and
between XGB[SINGLE] and RF. In TCGA-BLCA, the highest Spearman
correlation can be observed between XGB[SINGLE] and SVM, while the
other correlations are rather weak.
Age is the predominant indicator of tumor development and the
probability of developing cancer increases from 3.4% in the time span
from birth to 49 years of age (male; female 5.5%) to 32.2% in the time
span above 70 years of age (female 26%) ([67]Siegel et al., 2018). We
observed that survival prediction performance to a certain degree is
dependent on the age distribution of the cohort under study. For
example, predictions for TCGA-ACC and TCGA-LGG were exceptionally good
(median C-Index > 0.8 with the XGBoost single-cohort method) and ages
of patients in these cohorts were consistently low with an overall
median of 49 and 41 years. This overall median increases to 61 and 61.5
years with TCGA-KIRP and TCGA-UVM and, likewise, the prediction
performance drops below 0.8 in both cohorts. All prediction methods
performed bad for the TCGA-UCS cohort (∼0.5 prediction performance)
with a median age of 68.5 years. In fact, the median C-Indices of the
predictions made by the single-cohort XGBoost approach and the median
age of the correspondent cohort are negatively correlated (Spearman R =
−0.55, p = 0.004; [68]Figure 1C), meaning that prognosis performances
tend to be better for cancers of younger patients. This indicates
age-specific gene expression signatures in the datasets under study
that are easier to be resolved by machine learning methods in younger,
and presumably more intact, states.
Pan-cancer training improves over single-cohort training
We next investigated the gene expression features that were used by the
single-cohort XGBoost method to predict survival in the different TCGA
cancer cohorts. XGBoost has several built-in types of importance values
that are computed during the training process. For each cohort and each
of the training replications we computed the ‘gain’ of each gene
expression feature which refers to the improvement in the used
evaluation metric that the particular feature gives to the different
branches it is on (cf. [69]STAR Methods). Since the single-cohort
XGBoost method was trained for 100 replications on each cohort with
varying splits of training and test data, the genes selected in the
feature selection step as well as the final feature importance values
can vary among the replications and even more so among the different
cancer cohorts depending on the patients in the training set. Across
all 25 cohorts and all 100 replications per cohort, there is a total of
46,642 different RNA molecules that were captured by the TCGA data
analysis pipeline ([70]Gao et al., 2019) and used for prognosis by the
XGBoost method, including protein coding genes, processed pseudogenes
([71]Cheetham et al., 2020), and long non-coding RNAs ([72]Statello
et al., 2021), among others. In [73]Figure 2A, we analyzed to what
extent these gene features were shared between the different cohorts.
It can be observed that most genes gain feature importance for a
smaller number of cohorts (<10) while there are only a very small
number of genes that are among the important features for a larger
fraction of the cohorts (>15). This accounts for cancer subtype
differences, tissue-specificity as well as inter-patient heterogeneity.
Figure 2.
[74]Figure 2
[75]Open in a new tab
Pan-cancer XGBoost training improves over single-cohort training
(A) Histogram depicting fractions of gene features shared over
different numbers of training cohorts (x axis: number of TCGA cohorts a
gene feature is shared over; y axis: fraction of all 46,642 genes used
in at least one single-cohort model).
(B) Prediction performances of the single-cohort XGBoost method
(XGB[SINGLE]) and the pan-cancer XGBoost method (XGB[PAN]) on 25
different TCGA cancer cohorts, depicted as C-Index boxplots over 100
replications of model training. Mean C-Indices were compared with the
Wilcoxon unpaired rank-sum test and significance levels are defined as
[MATH: ns: p>0.05
:MATH]
,
[MATH: ∗:p≤0.05 :MATH]
,
[MATH: ∗∗:p≤0.01 :MATH]
,
[MATH: ∗∗∗:p≤0.001 :MATH]
,
[MATH: ∗∗∗∗:p≤0.0001 :MATH]
. See also [76]Figure S5.
(C) Venn-diagram comparing features used for prediction in the
single-cohort XGBoost method (pink) with those selected in the
pan-cancer XGBoost method (blue). See also [77]Figure S1.
(D) Prediction performances (C-Indices) of single-cohort XGBoost (pink)
and pan-cancer XGBoost (blue) for eight new cancer cohorts (not used in
model training). For the single-cohort method, the mean C-Index over
all 25 models trained on different TCGA cohorts is shown.
To overcome this heterogeneity across subtypes and patients and to
retrieve features with more general importance, we thus trained the
XGBoost method on a joint dataset combining the different cohorts
(pan-cancer training; cf. [78]STAR Methods) by using pooled gene
expression data from all 25 cancer cohorts instead of training a
separate model for each cohort. [79]Figure 2B compares the C-Indices
over 100 replications between the pan-cancer and single-cohort XGBoost
versions. For 15 out of the 25 cohorts, pan-cancer training
significantly improved the method’s prediction performance in terms of
C-Index, and for nine cohorts performance was comparably good for both
versions. Only in one cohort, namely TCGA-LAML, the prediction
performance of the pan-cancer-trained XGBoost method was significantly
worse than single-cohort-trained XGBoost. When comparing the features
selected in the single-cohort training procedures with those selected
in the pan-cancer version of the XGBoost method ([80]Figure 2C), it
becomes apparent that the vast majority (98.6%) of genes selected as
important features in at least one of the 100 replications of the
pan-cancer training were also among the important features of the
single-cohort approach. However, while for the single-cohort XGBoost
approach the set of important features comprises a total of 46,642
different genes, the pan-cancer approach results in a 74% reduction and
identifies only 12,082 genes as important features ([81]Figure 2C).
Additionally, the type of features that are of importance change
([82]Figure S1): While in the single-cohort approach 40.4% of the
important features represent protein coding genes, 25.0% represent
lncRNAs and 15.8% processed pseudogenes, in the pan-cancer approach we
observe a shift toward protein coding genes (56.5%) and a reduction of
lncRNAs (20.7%) and processed pseudogenes (11.9%). This shift might be
driven by the tissue specificity of lncRNAs and the rather
patient-specific nature of mRNA retrotransposition leading to less
feature importance in the pan-cancer training.
Another benefit of pan-cancer training is that the selected features
are not specific to a particular cancer type and can be extrapolated to
predict survival in yet unseen patient cohorts. We challenged this
claim and additionally tested the prediction performance of the
pan-cancer and single-cohort trained models on eight additional TCGA
cohorts, which had previously been excluded because the number of
uncensored patients in these cohorts was too small (cf. [83]STAR
Methods; [84]Table S1). [85]Figure 2D shows the C-Indices of the
predicted survival on these eight cohorts. Since there are 25 different
single-cohort models (each trained on one TCGA cohort), but only one
pan-cancer model trained on all cohorts jointly, we compared the
prediction performance of the pan-cancer model to the mean performance
of the 25 single-cohort models. For all eight cohorts, the C-Index of
the pan-cancer prediction was better than the mean C-Index of the
single-cohort models, and for seven cohorts, the C-Index of the
pan-cancer XGBoost method was greater than 0.5, even though the entire
cancer types of the test patients had not been seen in model training.
This supports the hypothesis that the genes identified in the
pan-cancer XGBoost approach indeed have higher predictive power for
patient survival across yet unseen cancer types than the single-cohort
approaches.
Pan-cancer feature importance weights identify genes with high biological
plausibility
We next explored the biological plausibility of the genes identified as
important survival features in the pan-cancer XGBoost approach. We
first investigated the distribution of summarized feature importance
weights over all 100 replications of the pan-cancer training procedure.
[86]Figure 3A shows the weight distribution for the 100 genes with the
highest weights, where Ensembl gene identifiers were converted to HUGO
symbols with the MyGene Python package (version 3.1,
[87]http://mygene.info) ([88]Wu et al., 2013; [89]Xin et al., 2016) and
identifiers that did not map to a HUGO symbol are named with their
Ensembl gene identifier. It is noticeable that a few genes, and in
particular the gene IGF2BP3 (Insulin Like Growth Factor 2 MRNA Binding
Protein 3), have especially high feature importance weights, indicating
a high prognostic potential. Indeed, IGF2BP3 has been found to be
overexpressed in various cancers and has been associated with
metastasis and poor survival in a number of cancer types
([90]Mancarella and Scotlandi, 2020) including colon cancer
([91]Lochhead et al., 2012).
Figure 3.
[92]Figure 3
[93]Open in a new tab
Pan-cancer features are biologically plausible
(A) Weight distribution for the 100 genes with the highest feature
importance (sums of feature importance scores over 100 model
replications) for pan-cancer XGBoost training (gene identifiers that
did not map to a Hugo symbol are named with their Ensembl identifiers).
The different colors indicate gene types (blue: protein coding, orange:
lncRNA, green: processed pseudogenes, purple: transcribed unprocessed
pseudogene, red: gene type unknown). These genes types were obtained
using the MyGene Python package (version 3.1, [94]http://mygene.info)
([95]Wu et al., 2013; [96]Xin et al., 2016). See also [97]Figure S3.
(B) Comparison of entropy distributions between the top 100 genes with
the highest feature importance (feature importance is measured as sums
of feature importance scores over 100 model replications) from the
single-cohort approach and the pan-cancer approach (mean entropies are
indicated as dashed lines). The entropy measure (xaxis) is based on the
genes used in the single-cohort approach (cf. [98]STAR Methods). The
density of the entropy distribution is displayed on the yaxis.
(C) Kaplan-Meier plots for the four most important gene features from
pan-cancer XGBoost and the cancer type with the lowest FDR-corrected p
value in Cox regression, respectively. As a cutoff for gene expression,
the 50^th percentile was selected. Cox regression data and Kaplan-Meier
plots were retrieved from OncoLnc ([99]Anaya, 2016). See also
[100]Figure S2.
In order to judge whether the genes identified as important features in
the pan-cancer approach could be prognostic for several cancer types we
applied an entropy score derived from their importance values in the
single-cohort training models (cf. [101]STAR Methods). For each of the
single cohorts we summed the gene feature importance scores over the
100 different training repetitions to obtain a feature importance
weight for each gene in each cohort. Genes have a high entropy score if
they have similar feature importance weights in many different cohorts
and a low entropy score if they have feature importance in only a few
cohorts (range [0, 4.64]). [102]Figure 3B compares the distributions of
the entropy scores of the top 100 genes from the single-cohort approach
and the top 100 genes from the pan-cancer approach. The top 100
pan-cancer genes had a significantly higher entropy than the top 100
genes from the single-cohort approach (p = 1.145 × 10^−14 in a
one-sided Wilcoxon unpaired rank-sum test), meaning the pan-cancer
method generalizes better over different cancer types than the
single-cohort method.
To validate the association of these pan-cancer genes with survival, we
analyzed the four top-scoring protein-coding genes (IGF2BP3, IL1RAP,
PIK3R3, and CISH) in more detail by querying OncoLnc ([103]Anaya,
2016), a tool that provides Cox regression analyses as well as
Kaplan-Meier survival plots on TCGA gene expression data for different
cancer types. [104]Figure 3C displays Kaplan-Meier plots of these four
genes for the cancer type with the lowest FDR-corrected p value in the
Cox regression, respectively. For example, IGF2BP3 has the highest
predictive performance in survival for LGG (FDR = 3.59 × 10^−9 in Cox
regression), while IL1RAP, PIK3R3, and CISH are most predictive for
KIRP, KIRC, and LGG, respectively. Additionally, IGF2BP3, IL1RAP,
PIK3R3, and CISH have significant survival performance (FDR <0.05 in
Cox regression) for four (KIRP, KIRC, LUAD and PAAD), two (LGG and
PAAD), two (LGG and HNSC), and four (LUAD, LIHC, KIRP, and KIRC) other
cohorts, respectively. Kaplan-Meier plots for these cohorts can be
found in the [105]Supplemental information ([106]Figure S2). For
generating the Kaplan-Meier plots in [107]Figure 3C, the patients were
first split into two groups by expression of the respective gene (i.e.,
IGF2BP3, IL1RAP, PIK3R3, or CISH). As a cutoff, we selected the 50^th
percentile to ensure that both groups had equal sizes and all patients
were included in the Kaplan-Meier analysis. To measure whether there is
a significant difference in survival between the low-expression and the
high-expression groups, OncoLnc uses the logrank test. For all but one
(CISH in LIHC) gene-cohort pairs that had a significant FDR corrected p
value in the Cox regression, the logrank p values were also significant
(p < 0.05), indicating that these genes are indeed predictive for
survival.
Inferring a pan-cancer survival network
Even though the pan-cancer approach reduces substantially the number of
genes used for survival prediction (74%) as compared to the
single-cohort approach, with the most important features having a high
biological plausibility, there are still 12,082 features that are
selected as potentially important in at least one of the 100
replications of the pan-cancer training procedure. In each replication,
there are a maximum of 500 genes selected for survival prediction (cf.
[108]STAR Methods), meaning that among the total 12,082 genes, there
are many genes that are only selected in a small number of replications
and the selected genes thus still highly depend on the composition of
the training set. Furthermore, this still large number of genes makes
it difficult to infer mechanistic information and lacks biological
focus. On the other hand, these features might still be relevant for a
subset of patients, and simply neglecting the features with low feature
importance weights with some threshold might be somewhat arbitrary. The
resulting distribution of feature importance weights resembles a
“long-tail” distribution often visible with cancer-associated SNPs
([109]Armenia et al., 2018) ([110]Figure S3).
It has been argued that plausibility of machine learning methods could
be improved by incorporating prior knowledge on biological networks in
the analysis workflow ([111]Camacho et al., 2018), and we challenged
this claim by applying our recently developed network propagation
method, NetCore, to the feature importance weights of the pan-cancer
training approach (cf. [112]STAR Methods). Network propagation has
emerged as a useful tool for inferring mechanistic information, in
particular from high-throughput data ([113]Cowen et al., 2017). In
network propagation genes are organized in a graph and edges between
genes are set through biological a priori knowledge. Typically,
protein-protein interaction networks are used as a scaffold for the
interaction context ([114]Huang et al., 2018). For this study, we used
the large integrated protein-protein interaction network from the
ConsensusPathDB comprising 10,586 nodes and 114,341 interactions
([115]Herwig et al., 2016). Gene nodes were initially weighted with the
sums of the feature importance scores over 100 replications of the
pan-cancer training and the NetCore propagation method was used to
spread these weights over the network until a steady-state condition
was reached ([116]Barel and Herwig, 2020). Then we inferred a
pan-cancer survival network by applying NetCore’s module identification
approach to the network propagation results with the top 100 genes that
are contained in the protein-protein interaction network and have the
largest feature importance sums as seed genes. NetCore extends these
seed genes by genes that were assigned a significantly high weight
after the network propagation step to form connected subgraphs, which
are called modules (cf. [117]STAR Methods). [118]Figure 4A shows the
largest network module identified by NetCore. Overall, 13 network
modules containing between 2 and 79 genes were identified. These
modules comprised 103 genes in total, 76 genes with high pan-cancer
feature importance and 27 additionally inferred genes added in the
network propagation steps. The genes and their initial as well as
propagated weights are listed in [119]Table S2.
Figure 4.
[120]Figure 4
[121]Open in a new tab
Pan-cancer survival network
(A) Largest network module identified by NetCore ([122]Barel and
Herwig, 2020) network propagation and module identification based on
pan-cancer important features. Orange nodes correspond to seed genes,
while genes that were inferred during network propagation are colored
in gray.
(B) Feature importance of the 103 module genes in single-cohort
training (100 replications). Top: Sum of feature importance scores of
the module genes per cohort. Bottom: Number of genes (of the 103 module
genes) per cohort that are among the important features in
single-cohort training (feature importance > 0). See also [123]Tables
S2 and [124]S3.
The 103 genes are informative on all different cancer subtypes with an
average of 41.48 genes gaining feature importance for the cohorts in
single-cohort training. TCGA-LUSC (59 genes) shares the largest number
of genes followed by TCGA-HNSC (54) and TCGA-OV (54) ([125]Figure 4B).
On the level of feature importance weights, however, the difference is
more drastic. While for many cohorts the contributions are high, for
TCGA-MESO, TCGA-UCS, TCGA-SKCM, TCGA-READ and TCGA-UVM, the feature
importance weights of the 103 genes are rather low. This might in part
due to the fact that these cohorts were in the lower range of patient
numbers that contributed to pan-cancer training (<200; [126]Table S1).
The pan-cancer survival network has an overlap of 27 genes with
previously annotated cancer genes ([127]Repana et al., 2019).
Interestingly, the majority of these cancer genes (16 out of 27) were
inferred by network propagation, meaning that they had very little to
no feature importance with respect to survival prognosis but were
re-ranked significantly because of their high connectivity in the
protein-protein interaction network ([128]Table S2). The largest
re-ranked pan-cancer feature importance is attributed to the gene
PIK3R3 (Phosphoinositide-3-Kinase Regulatory Subunit 3) ([129]Table
S2). PIK3R3 is an enzyme that participates in multiple cancer-related
signaling pathways, most importantly PI3K-AKT-mTOR signaling. It can
act as an oncogene in colorectal cancer and high expression of PIK3R3
inhibits cell senescence and promotes cell proliferation ([130]Chen
et al., 2020).
Pan-cancer survival network is strongly associated with the tumor
microenvironment
In order to characterize the genes of the pan-cancer survival network
we performed over-representation analysis and interrogated canonical
pathways and upstream regulators using QIAGEN Ingenuity Pathway
Analysis (IPA) ([131]Krämer et al., 2014) ([132]Table 1).
Table 1.
Over-represented pathways (p < 0.001) computed with QIAGEN Ingenuity
Pathway Analysis (IPA)
Ingenuity canonical pathway -log(p value) Ratio Molecules
Tumor microenvironment pathway 9.34 6.25 × 10^−2 FGF2, IDO1, IGF2,
JAK2, MMP1, MMP14, MMP3, PIK3R3, PLAU, SPP1, TGFB1
Glucocorticoid receptor signaling 8.60 3.25 × 10^−2 A2M, CAV1, ESR1,
JAK2, MMP1, MMP3, PGR, PIK3R3, PLA2G4A, PLA2G5, PLAU, RPS6KA5,
SERPINE1, TGFB1, TGFBR2
Role of tissue factor in cancer 8.55 7.76 × 10^−2 FRK, FYN, ITGA3,
JAK2, MMP1, PIK3R3, PLAUR, RPS6KA3, RPS6KA5
Hepatic fibrosis signaling pathway 6.85 3.17 × 10^−2 FGF2, GNAI3, INS,
ITGA3, JAK2, MMP1, PIK3R3, SERPINE1, SPP1, TGFB1, TGFBR2, TLR4
Hepatic fibrosis/Hepatic stellate cell activation 6.77 4.84 × 10^−2
A2M, COL17A1, FGF2, IGF2, MMP1, SERPINE1, TGFB1, TGFBR2, TLR4
Coagulation system 6.31 1.43 × 10^−1 A2M, PLAU, PLAUR, PLG, SERPINE1
HOTAIR regulatory pathway 6.18 5.00 × 10^−2 ESR1, MMP1, MMP14, MMP3,
PIK3R3, SPP1, TGFB1, TLR4
Osteoarthritis pathway 6.15 4.09 × 10^−2 DKK1, FGF2, ITGA3, MMP1, MMP3,
SPP1, TGFB1, TGFBR2, TLR4
Growth hormone signaling 6.08 8.45 × 10^−2 A2M, IGF2, JAK2, PIK3R3,
RPS6KA3, RPS6KA5
Inhibition of matrix metalloproteases 6.06 1.28 × 10^−1 A2M, MMP1,
MMP14, MMP3, TIMP4
Glioma invasiveness signaling 6.01 8.22 × 10^−2 PIK3R3, PLAU, PLAUR,
PLG, TIMP4, VTN
Reelin signaling in neurons 5.87 5.74 × 10^−2 APP, ARHGEF3, CDK5, FRK,
FYN, ITGA3, PIK3R3
Axonal Guidance signaling 5.62 2.43 × 10^−2 ADAM9, CDK5, DPYSL5, EPHA2,
FYN, GNAI3, ITGA3, MMP1, MMP14, MMP3, PIK3R3, SEMA7A
Estrogen receptor signaling 5.62 3.05 × 10^−2 CAV1, ESR1, GNAI3, IGF2,
JAK2, MMP1, MMP14, MMP3, PGR, PIK3R3
Leukocyte extravasation signaling 5.57 4.15 × 10^−2 CLDN4, GNAI3,
ITGA3, MMP1, MMP14, MMP3, PIK3R3, TIMP4
HIF1A signaling 5.37 3.90 × 10^−2 FGF2, IGF2, MMP1, MMP14, MMP3,
PIK3R3, SERPINE1, TGFB1
Semaphorin signaling in neurons 5.12 8.33 × 10^−2 CDK5, DPYSL3, DPYSL5,
FYN, SEMA7A
Neuroinflammation signaling pathway 5.05 3.00 × 10^−2 APP, JAK2, MMP3,
PIK3R3, PLA2G4A, PLA2G5, TGFB1, TGFBR2, TLR4
Molecular mechanisms of cancer 4.87 2.50 × 10^−2 ARHGEF3, CDK5, CDK6,
FYN, GNAI3, ITGA3, JAK2, PIK3R3, TGFB1, TGFBR2
Tec kinase signaling 4.86 4.05 × 10^−2 FRK, FYN, GNAI3, ITGA3, JAK2,
PIK3R3, TLR4
p38 MAPK signaling 4.80 5.08 × 10^−2 PLA2G4A, PLA2G5, RPS6KA3, RPS6KA5,
TGFB1, TGFBR2
Colorectal cancer metastasis signaling 4.71 3.16 × 10^−2 JAK2, MMP1,
MMP14, MMP3, PIK3R3, TGFB1, TGFBR2, TLR4
Caveolar-mediated endocytosis signaling 4.70 6.85 × 10^−2 CAV1, FLNC,
FYN, INS, ITGA3
Atherosclerosis signaling 4.61 4.72 × 10^−2 MMP1, MMP3, PLA2G4A,
PLA2G5, TGFB1, TNFRSF14
ERK/MAPK signaling 4.43 3.47 × 10^−2 ESR1, FYN, ITGA3, PIK3R3, PLA2G4A,
PLA2G5, RPS6KA5
Semaphorin neuronal repulsive signaling pathway 4.39 4.32 × 10^−2 CDK5,
DPYSL3, DPYSL5, FYN, ITGA3, PIK3R3
Oncostatin M signaling 4.37 9.30 × 10^−2 JAK2, MMP1, MMP3, PLAU
Role of osteoblasts, osteoclasts and Chondrocytes in rheumatoid
arthritis 4.22 3.21 × 10^−2 DKK1, MMP1, MMP14, MMP3, PIK3R3, SPP1,
TGFB1
Sperm motility 4.16 3.14 × 10^−2 EPHA2, ERBB4, FRK, FYN, JAK2, PLA2G4A,
PLA2G5
Bladder cancer signaling 4.10 5.15 × 10^−2 FGF2, MMP1, MMP14, MMP3,
RPS6KA5
Cardiac hypertrophy signaling (enhanced) 4.07 2.01 × 10^−2 ADRA1A,
ADRA1D, FGF2, GNAI3, ITGA3, JAK2, PIK3R3, RPS6KA5, TGFB1, TGFBR2
Role of macrophages, fibroblasts and endothelial cells in rheumatoid
arthritis 4.05 2.55 × 10^−2 DKK1, FGF2, JAK2, MMP1, MMP3, PIK3R3,
TGFB1, TLR4
Chronic myeloid leukemia signaling 3.98 4.85 × 10^−2 CDK6, PIK3R3,
RBL2, TGFB1, TGFBR2
Insulin secretion signaling pathway 3.91 2.87 × 10^−2 EIF4G3, FYN, INS,
JAK2, PCSK2, PIK3R3, RPS6KA5
CNTF signaling 3.89 7.02 × 10^−2 JAK2, PIK3R3, RPS6KA3, RPS6KA5
T cell exhaustion signaling pathway 3.84 3.43 × 10^−2 BTLA, JAK2,
PIK3R3, TGFB1, TGFBR2, TNFRSF14
Regulation of the epithelial mesenchymal transition by growth factors
pathway 3.67 3.19 × 10^−2 FGF2, JAK2, MMP1, PIK3R3, TGFB1, TGFBR2
RhoGDI signaling 3.66 3.17 × 10^−2 ARHGEF3, CDH10, CDH6, ESR1, GNAI3,
ITGA3
IL-15 production 3.65 4.13 × 10^−2 EPHA2, ERBB4, FRK, FYN, JAK2
Agranulocyte adhesion and diapedesis 3.61 3.11 × 10^−2 CLDN4, GNAI3,
ITGA3, MMP1, MMP14, MMP3
Senescence pathway 3.60 2.55 × 10^−2 CDK6, PIK3R3, RBL2, RPS6KA5,
SERPINE1, TGFB1, TGFBR2
Role of MAPK signaling in inhibiting the pathogenesis of influenza 3.43
5.33 × 10^−2 PLA2G4A, PLA2G5, RPS6KA3, TLR4
mTOR signaling 3.41 2.86 × 10^−2 EIF4G3, FKBP1A, INS, PIK3R3, RPS6KA3,
RPS6KA5
Inhibition of angiogenesis by TSP1 3.32 8.82 × 10^−2 FYN, TGFB1, TGFBR2
MIF-mediated glucocorticoid regulation 3.32 8.82 × 10^−2 PLA2G4A,
PLA2G5, TLR4
Necroptosis signaling pathway 3.13 3.18 × 10^−2 FKBP1A, PLA2G4A,
PLA2G5, RBL2, TLR4
Cardiac hypertrophy signaling 3.11 2.50 × 10^−2 ADRA1A, ADRA1D, GNAI3,
PIK3R3, TGFB1, TGFBR2
MIF regulation of innate immunity 3.04 7.14 × 10^−2 PLA2G4A, PLA2G5,
TLR4
[133]Open in a new tab
Pathway,annotated pathway name; -log(p value),-log of enrichment p
value computed with Fisher's exact test; ratio,proportion of genes in
the network module that map to the respective pathway and overall
number of genes in the pathway; molecules, network module genes that
overlap with the pathway.
The genes in the survival network are most strongly enriched with genes
from the TME (p = 4.57E-10; FGF2, IDO1, IGF2, JAK2, MMP1, MMP14, MMP3,
PIK3R3, PLAU, SPP1, TGFB1; [134]Table 1). The TME is composed of a
variety of host cells, secreted factors as well as extracellular matrix
proteins that are strongly interacting with the cancer cells, and this
interaction is critical for tumor progression and influences the
survival of the patient. The TME is built of several specialized
microenvironments such as hypoxic niche, immune microenvironment, and
metabolism microenvironment ([135]Jin and Jin, 2020). Several of these
specialized TME microenvironments are enriched by the pan-cancer
survival network, for example HIF1A signaling (p = 4.27 × 10^−6; FGF2,
IGF2, MMP1, MMP14, MMP3, PIK3R3, SERPINE1, TGFB1). Hypoxia is
correlated with cancer progression. It is a potent factor that
influences the characteristics of tumor and stromal cells to support
metastasis and HIF-1 and HIF-2 genes are associated with poor patient
survival ([136]Rankin et al., 2016). Hypoxia leads to the upregulation
of VEGF and other growth factors such as FGF2 and, thus, stimulates
angiogenesis ([137]De Palma et al., 2017).
Additionally, the pan-cancer survival network holds strong
immune-related enrichment signals. For example, glucocorticoid receptor
(GR) signaling has been found enriched by the survival network (p =
2.51 × 10^−9; A2M, CAV1, ESR1, JAK2, MMP1, MMP3, PGR, PIK3R3, PLA2G4A,
PLA2G5, PLAU, RPS6KA5, SERPINE1, TGFB1, TGFBR2). Increasing GR
signaling in the TME has been recently associated with dysfunctional
CD8+ tumor-infiltrating lymphocytes (TILs), which might interfere with
the positive effect of TILs on the survival ([138]Acharya et al.,
2020). Such positive effects have been shown for example in colon
cancer ([139]Idos et al., 2020) and subtypes of breast cancer
([140]Denkert et al., 2018). Further immune cell phenotypes relate to
dysfunctions of the immune system and major mechanisms by which tumors
escape immunosurveillance. One such phenotype is T cell exhaustion (p =
1.44 × 10^−4; BTLA, JAK2, PIK3R3, TGFB1, TGFBR2, TNFRSF14). T cell
exhaustion manifests in decreased effector cytokine production and
impaired cytotoxicity and determines a chronic infectious state which
is prone to cancer immune evasion ([141]Jiang et al., 2015).
The pan-cancer survival network further enriches the inhibition of
matrix metalloproteases (A2M, MMP1, MMP14, MMP3, TIMP4). Matrix
metalloproteinases (MMPs) are among the most prominent family of
proteinases associated with tumorigenesis. They show increased
expression in cancer patients and are potent regulators of the TME and
other signaling pathways related to cell growth, inflammation and
angiogenesis ([142]Kessenbrock et al., 2010).
The strong involvement of the TME raises the question to what extent
the gene expression signals of the 103 genes are indicative of the
immune status of the patients. We explored a recent work that
classified the TCGA patients into six immune subtypes ([143]Thorsson
et al., 2018). For 7,475 of the 8,024 patients used here the immune
subtype could be assigned and principal component analysis of these
patients with respect to the gene expression matrix of the 103 genes
shows a partial distinction of these immune subtypes ([144]Figure 5).
In particular, immune subtype C5 (immunologically quiet), that
consisted mostly of brain lower-grade gliomas (LGG), could be separated
from the other groups as well as parts of the patients of subtype 2
(IFN-γ dominant). This shows that the gene expression of the pan-cancer
survival network genes holds information on the immune status of the
patients.
Figure 5.
[145]Figure 5
[146]Open in a new tab
Association of the pan-cancer survival network with immune subtypes.
Principal component analysis (PCA) of the patients that can be assigned
to an immune subtype according to ([147]Thorsson et al., 2018). The PCA
is based on the 103 module genes and patients are colored by their
assigned immune subtype. PCA is generated with the R library ggplot2
([148]Wickham, 2009).
We were further interested in potential regulators of the pan-cancer
survival network that are frequently mutated in cancer samples. We
therefore performed an enrichment analysis with the annotation sets of
“upstream regulators” in the IPA system and found a total of 47
significantly enriched upstream regulators (p value < 1 × 10^−5) that
had been identified as cancer driver or potential cancer driver genes
in a recent cancer gene annotation effort ([149]Repana et al., 2019).
The top 15 cancer-relevant upstream regulators are ([150]Table S3): JUN
(20 target genes in the network), TNF (34), IL1B (25), TP53 (33), IL1A
(13), FGF2 (15), MAP3K1 (7), EGFR (15), STAT3 (16), HRAS (16), CDH1
(7), AKT1 (11), PTEN (16), FOXO1 (13), and SMARCA4 (15). Interestingly,
several of these factors have been associated with the TME. For
example, TNF (tumor necrosis factor alpha) is a factor in the
regulation of the TME. It is a multifunctional key cytokine in
apoptosis and cell survival as well as in inflammation and immunity
([151]van Horssen et al., 2006). STAT3 (signal transducer and activator
of transcription 3) is key in regulating the anti-tumor immune response
and a therapeutic target of immune therapies in different cancer
subtypes such as celecoxib in colorectal cancer and pyrimethamine in
chronic lymphocytic leukemia ([152]Zou et al., 2020). TP53 is among the
most studied genes in cancer and has numerous roles in particular in
escaping apoptosis. Additionally, it has been shown that TP53 can act
non-cell autonomously to suppress tumorigenesis by promoting an
antitumor microenvironment, in part through secreted factors that
modulate macrophage function ([153]Lujambio et al., 2013).
Thus, we conclude that the pan-cancer survival network emphasizes the
role of the TME for the immune status and the survival prognosis of
cancer patients.
Discussion
Cancer survival is a multi-factorial problem and has important
implications for patient care and therapy choices. The purpose of this
work was to derive a prediction model for cancer survival with high
biological plausibility. We achieved this by combining an ensemble tree
approach based on gradient boosting along with network propagation with
a comprehensive protein-protein interaction network.
In the first part of this work, we applied XGBoost tree ensemble
learning to gene expression data to predict patient survival in 25
different cancer types. To overcome the observable shortcomings of
single-cohort training such as low number of training samples and
intra-cohort heterogeneity of patients, and to empower the
identification of cross-cohort features, we next applied pan-cancer
training and showed improved performance of the method, although the
number of features used is reduced by 74%. In order to reduce the
number of features further and to gain biological plausibility of the
prediction model we revealed a pan-cancer network of 103 genes with our
recent network propagation algorithm, NetCore, utilizing a
high-confidence protein-protein interaction network. The computed
network module is predominantly indicative of the TME as was shown by
enrichment analysis and correlation to the immune status of the
patients. The role of the TME for cancer progression and metastases as
well as for the response to therapies has been emphasized ([154]Quail
and Joyce, 2013), and our findings particularly highlight the
importance of the hypoxic and immune-related parts of the TME.
Additionally, we identified a decent negative correlation (R = −0.55)
of the methods performances with the age distribution of the patients.
This may also suggest that the aging TME influences cancer progression
and survival ([155]Fane and Weeraratna, 2020), and that this aging TME
is hard to resolve by machine learning approaches, while signatures
appear more predictive in younger, and presumably more intact, states.
Survival prognosis can benefit from these findings by taking into
account age-specific training and validation cohorts.
Besides the TME- and immune-related pathways, the pan-cancer survival
network enriches further signaling pathways that have high cross-cohort
relevance for survival such as mTOR-signaling. The mechanistic target
of rapamycin (mTOR) pathway regulates fundamental cell processes and
dysregulation of the pathway is relevant for the progression of cancer
([156]Saxton and Sabatini, 2017). In our network mTOR signaling (p =
2.86 × 10^−2; EIF4G3, FKBP1A, INS, PIK3R3, RPS6KA3, RPS6KA5) is
activated by PI3K/AKT in response to insulin (INS) ([157]Zoncu et al.,
2011). INS and PI3KR3 were among the top 100 genes judged by the
feature importance weights of the pan-cancer XGBoost prediction model
and were used as seed genes for the network propagation. Another very
prominent pathway in cancer is ERK/MAPK signaling (p = 3.47 × 10^−2;
ESR1, FYN, ITGA3, PIK3R3, PLA2G4A, PLA2G5, RPS6KA5), which is the core
of the signaling network involved in regulating cell growth,
development, and division and a target of many cancer therapeutics.
Interestingly, associated proteins highlight the cooperation with the
estrogen receptor 1 (ESR1). It has been reported that the cooperation
of the estrogen and MAPK signaling is due to ESR1-ERK2 cooperative
binding and enhances the proliferative effect of estrogen signaling
([158]Madak-Erdogan et al., 2011). ESR1 is a frequently mutated cancer
gene, in particular in metastatic breast cancer ([159]Razavi et al.,
2018), and was identified among the top 100 important features by the
XGBoost model.
We based our survival prediction method on gene expression data of
cancer patients from multiple cancer types. Gene expression data has
been found to be the most informative omics datatype in different
biomedical prediction tasks ([160]Costello et al., 2014; [161]Vale
Silva and Rohr, 2020). In principle, the XGBoost survival prediction
approach described in this work can easily be extended by additional
clinical and molecular data types like DNA methylation, mutation, or
copy number variation (CNV) data and such a multi-omics approach has
recently been applied to pan-cancer survival prediction using deep
learning ([162]Cheerla and Gevaert, 2019). These additional data types
could complement the information contained in gene expression data, but
on the other hand, incorporating additional data and, thus, also
additional model features, while the number of patients does not
increase, would exacerbate the curse of dimensionality ([163]Bellman,
2015; [164]Keogh and Mueen, 2017), a phenomenon regularly encountered
in machine learning that is often responsible for model overfitting.
There are different strategies to counteract this phenomenon, including
the application of dimensionality reduction techniques such as
principal component analysis (PCA) ([165]Hotelling, 1933; [166]Pearson,
1901) to the data before feeding it to the machine learning algorithm,
or the incorporation of feature selection steps into the learning
procedure. Feature selection techniques for supervised learning
problems can be divided into three categories, namely filter methods,
wrapper methods, and embedded methods ([167]Saeys et al., 2007). Filter
methods select features by considering intrinsic properties of the data
like correlation between features or the chi-square test. However, they
ignore the relation between the features and the target variable. In
contrast, wrapper methods do consider dependencies between features and
target variable by evaluating different feature subsets with the
machine learning method used for the prediction task and selecting the
subset of features that yields the best model performance. A
disadvantage of this type of feature selection is however that it is
computationally intensive, especially for datasets with many features,
because the number of potential feature subsets that need to be tested
grows exponentially with the number of features. The third type of
feature selection methods, the embedded techniques, also consider
relations between the features and the target variable by using the
selected machine learning method to evaluate features, but instead of
evaluating model performance for different sets of features, they use
built-in feature importance measures of the machine learning method and
are thus computationally far less intensive than wrapper methods. In
our XGBoost-based cancer survival prediction method, we chose to
integrate an embedded feature selection approach, which uses 4-fold
cross-validation and applies simple XGBoost models to the training data
in order to reduce the number of features used in each model
replication from 60,483 to 500.
When combining the results of the survival prediction model with
network propagation, we are able to explore the functional content of
the genes identified as important features by XGBoost. As a measure of
feature importance, we chose ‘gain’ over other possible feature
importance measures such as ‘weight’ and ‘cover’ (see
[168]https://xgboost.readthedocs.io/en/latest/python/python_api.html),
because ‘gain’ measures the relative importance of a feature for the
prediction result with respect to the prediction improvement that this
feature brings. In contrast, ‘weight’ simply counts how often a feature
is used to split the data in any of the regression trees and does not
consider where in the tree the corresponding split is used. This is
relevant because splits closer to the root of the tree are typically
more important for the prediction result than splits further down,
where only few samples are affected by the split (an example of a
regression tree from the pan-cancer XGBoost model can be found in
[169]Figure S4). The feature importance measure ‘cover’ on the other
hand does take into account where in a tree the split is made; however,
it does so by only considering the number of samples affected by the
splits this feature is used for, and does not consider how often the
feature is used or how it affects the prediction result. Thus, we
deemed ‘gain’ the most appropriate feature importance measure for the
purpose of identifying genes relevant for survival.
We showed improved survival prediction performance when using
pan-cancer instead of single-cohort training. On the one hand, it is
likely that a considerable proportion of this improvement can be
attributed to the much larger number of training patients. This major
advantage for machine learning has been shown recently in other
application domains, for example, drug sensitivity testing ([170]Lloyd
et al., 2021). In fact, when performing the pan-cancer XGBoost training
procedure (including feature selection and hyperparameter tuning) on
randomly sampled subsets of different sizes, it becomes apparent that
for many of the cancer types the performance drops with smaller
training samples ([171]Figure S5). On the other hand, however, we
observe that pan-cancer features are generalizable whereas
single-cohort features are very specific to the respective cancer type.
Interestingly, the type of features is also changing drastically. While
with single-cohort training 40.4% of the important features refer to
protein coding genes, this number increases to 56.5% in pan-cancer
training. Accordingly, the proportion of the other feature types
(lncRNAs and processed pseudogenes) drops considerably. This reflects
the high level of specificity of lncRNAs and pseudogenes.
To the best of our knowledge this is the first study that applies
XGBoost to pan-cancer survival prognosis. Pan-cancer survival
prediction has recently also been addressed by deep learning methods
like VAECox ([172]Kim et al., 2020) and MultiSurv ([173]Vale-Silva and
Rohr, 2021). VAECox pre-trains a variational autoencoder on pan-cancer
gene expression data and then transfers the learned weights to a
survival prediction neural network architecture, which is then
fine-tuned separately on each cancer cohort. MultiSurv uses multimodal
data including omics data as well as clinical information and imaging
data to predict the conditional survival probabilities for a predefined
set of follow-up time intervals. While both methods successfully
integrate pan-cancer molecular data to predict cancer survival and
report good results, biological interpretation of MultiSurv is limited
to visualizing the learned feature representations with t-distributed
stochastic neighbor embedding (t-SNE) and no gene- or pathway-level
explanations are available. For VAECox, the nodes with the highest
variance are extracted from the hidden layers of the model and the
Pearson correlations between these nodes and the expression of the gene
features are analyzed. While this approach provides biological
interpretation to some extent, there are a large number of genes whose
expression is correlated with the most variable hidden nodes at
different levels, and pathway enrichment analysis conducted directly on
these genes might not be able to fully discover the underlying survival
mechanisms.
In contrast to these methods, our approach applies network propagation
using the NetCore tool to extract biological plausibility of the
learned feature representations, which is a fairly generic step that
could be applied to any machine learning approach. NetCore performs
random walk with restart propagation on protein-protein interactions
with a node measure that is different from the node degree and thus is
more robust against the inherent node degree bias of PPIs ([174]Barel
and Herwig, 2020). This leads to an emphasis on so-called influential
nodes that are typically in the core of the network rather than its
periphery and thus is highly suited to infer common subnetworks from
otherwise highly dispersed nodes. It uses a semi-supervised module
identification step by connecting seed nodes of interest with the nodes
that become significantly up-weighted after the propagation step. Seed
nodes can be identified either by literature knowledge (supervised) or
by top experimental signals (unsupervised). Since we wanted to
interrogate the ability of the machine learning approach to generate
plausible features, we used the unsupervised approach in this study
where seed nodes were chosen by the top 100 feature importance weights
from the pan-cancer training. Alternative approaches could, for
example, incorporate seed nodes derived from existing signatures.
Network propagation has been combined with machine learning in other
application domains, for example, drug sensitivity prediction
([175]Manica et al., 2019). This approach leverages the information
conferred by protein-protein interactions a priori to machine learning
to reduce the feature space to genes associated with known drug targets
of the evaluated drugs. Additionally, reweighted random survival forest
(RRSF) ([176]Wang and Liu, 2018) integrates network propagation in
order to incorporate a priori information on gene interactions into the
learning procedure. Instead of using the network propagation results in
a feature selection step where only genes are kept that have a high
weight after network propagation, RRSF directly incorporates the
weights into model training. This is done by splitting the nodes in the
tree growing process of the random forest according to their weight
after network propagation, where topologically important genes in the
gene interaction network receive higher weights and are thus selected
with a higher probability in the node splitting procedure. In contrast
to these methods, our approach separates the two analysis components
and applies network propagation a posteriori to enhance biological
plausibility, while leaving the machine learning part unchanged in
order to explore all available features and to gain maximal prediction
accuracy.
In summary, we have introduced a machine learning approach to
pan-cancer survival prediction, which combines ensemble tree boosting
with network propagation and highlights the role of the aging TME for
survival prognosis.
Limitations of the study
Our method is based on cancer patient gene expression data provided by
the TCGA consortium. Although TCGA comprises a variety of cancer
patients from diverse cancer types, and we have implemented rigorous
train-test splits and also tested our method on cancer types withheld
in model training, we have not evaluated the method on any cancer
patient data completely unrelated to the TCGA database.
Furthermore, we considered only one molecular data modality, namely
RNA-seq gene expression data, for model training and identification of
the pan-cancer survival network. Incorporation of additional molecular
data modalities like methylation and mutations could add more
information to the model and complement the gene expression data.
However, this would come at the cost of introducing more features into
the machine learning method, which could increase the risk of
overfitting.
STAR★Methods
Key resources table
REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data
__________________________________________________________________
HTSeq-FPKM gene expression and clinical files The Cancer Genome Atlas
(TCGA) [177]https://portal.gdc.cancer.gov/
GDC data releases v22.0 and v24.0
ConsensusPathDB (CPDB) protein-protein interaction network (version 34)
[178]Herwig et al. (2016)
[179]https://github.molgen.mpg.de/barel/NetCore/blob/6860331ae9ff8725e6
66d936e9b853ef28893a00/data/CPDB_high_confidence.txt
Known and candidate cancer genes from the network of cancer genes (NCG)
[180]Repana et al. (2019) [181]http://ncg.kcl.ac.uk/index.php
NCG version 6.0
__________________________________________________________________
Software and algorithms
__________________________________________________________________
NetCore [182]Barel and Herwig (2020)
[183]https://github.molgen.mpg.de/barel/NetCore
OncoLnc [184]Anaya (2016) [185]http://www.oncolnc.org/
Single-cohort and pan-cancer XGBoost survival prediction This Paper
[186]https://github.molgen.mpg.de/thedinga/xgb_survival_network
R Implementations of Path2Surv (MKL[H] and MKL[P]), random survival
forest (RF), and survival support vector machine (SVM) [187]Dereli
et al. (2019) [188]https://github.com/mehmetgonen/path2surv
XGBoost python package [189]Chen and Guestrin (2016)
[190]https://github.com/dmlc/xgboost/tree/master/python-package
MyGene python package [191]Xin et al. (2016); [192]Wu et al. (2013)
[193]http://mygene.info
version 3.1
QIAGEN ingenuity pathway analysis (QIAGEN IPA) [194]Krämer et al., 2014
[195]https://digitalinsights.qiagen.com/products-overview/discovery-ins
ights-portfolio/analysis-and-visualization/qiagen-ipa/
[196]Open in a new tab
Resource availability
Lead contact
Further information and requests for resources should be directed to
and will be fulfilled by the lead contact, Ralf Herwig
([197]herwig@molgen.mpg.de).
Materials availability
This study did not generate new unique reagents.
Method details
Data and preprocessing
This study is based on gene expression data from The Cancer Genome
Atlas (TCGA) consortium ([198]https://www.cancer.gov/tcga), which
provides genomic and clinical data for more than 10,000 cancer patients
and 33 cancer types through the Genomic Data Commons (GDC) data portal
([199]https://portal.gdc.cancer.gov/). We used gene expression data
from primary tumors of these 33 cancer cohorts. We chose to only use
samples with sample type annotations "Primary Tumor" and "Primary Blood
Derived Cancer - Peripheral Blood" and excluded samples derived from
normal tissue and metastatic tumors because these sample types are
expected to have different gene expression characteristics than primary
tumors. For each cancer type, we downloaded all HTSeq-FPKM files, which
contain FPKM-normalized gene expression data, and the corresponding
clinical files to extract patient survival data. For the cohorts
TCGA-COAD, TCGA-LAML, TCGA-LUAD, and TCGA-LUSC, all data was obtained
from GDC data release v22.0 (released January 16, 2020) and for the 29
remaining cohorts, all files were retrieved from GDC data release v24.0
(released May 7, 2020). For training the survival prediction model and
comparing performance between methods, we only used cohorts with at
least 20 uncensored patients (25 cohorts) since splitting cohorts with
less uncensored patients into training and test data and using 20% of
patients for testing would have led to only one to four test patients
per cohort, thus limiting the informative value of predictions on these
cohorts. The remaining eight cohorts with less than 20 uncensored
patients (TCGA-CHOL, TCGA-DLBC, TCGA-KICH, TCGA-PCPG, TCGA-PRAD,
TCGA-TGCT, TCGA-THCA, and TCGA-THYM) were not part of the training
procedure and could thus be used to demonstrate the transferability of
the pan-cancer XGBoost survival prediction model to cancer types not
seen in training. We furthermore excluded all patients for whom either
gene expression measurements or key clinical data, namely age, gender,
vital status, and time to either death or censoring, were unavailable
or inconsistent. This led to a total of 8,024 patients in the 25 cancer
cohorts used for single-cohort and pan-cancer XGBoost training and an
additional 1,571 patients in the eight cohorts used for prediction
only.
The XGBoost method
XGBoost ([200]Chen and Guestrin, 2016) is a supervised machine learning
method based on tree ensembles. It implements gradient tree boosting,
which predicts the output by additive functions represented by
regression trees. More formally, the predicted output
[MATH: yˆi∈R :MATH]
for a sample
[MATH: i :MATH]
and its
[MATH: m :MATH]
features
[MATH: xi∈Rm :MATH]
is computed as:
[MATH: yˆi=φ(xi)=∑k=1Kfk(x
i),fk∈F, :MATH]
where
[MATH: F :MATH]
is the space of regression trees and
[MATH: K :MATH]
is the number of functions used to predict the output. Each of the
trees
[MATH: fk :MATH]
is associated with a set of continuous leaf weights w and maps each
sample to one of the tree’s leaf indices and its corresponding weight
w[i]. The final prediction for a sample is then made by summing up the
leaf weights assigned to the sample by the K different trees. In order
to learn the tree functions that map samples to leaves, the following
objective is minimized:
[MATH: L(φ)<
mo
linebreak="badbreak">=∑i<
mi>l(yˆi,yi)+∑k
Ω(fk)
mrow>, :MATH]
where
[MATH: Ω(f)=γT+12λ
mi>w2 :MATH]
is a regularization term that penalizes complexity of the tree
functions to reduce overfitting. Here, T is the number of leaves in the
tree,
[MATH: w :MATH]
represents the leaf weights and γ and λ are model hyperparameters. In
the first term of the model objective, l is a differentiable convex
loss function measuring the difference between the predicted value
[MATH: yˆi
:MATH]
and the true value Y[i]. For our task of survival prediction, the loss
function is selected to be the negative partial loglikelihood for Cox
proportional hazards regression (objective = survival:cox in the
XGBoost model parameters). This loss function is defined as:
[MATH:
lCox(β)=−∑i:Ei=1(xiβ−log∑j:Tj≥Tiexjβ), :MATH]
where
[MATH: Ei :MATH]
is a censoring flag with
[MATH: Ei=1 :MATH]
indicating that patient
[MATH: i :MATH]
is dead (i.e., uncensored) and
[MATH: Ei=0 :MATH]
indicating that the patient’s survival time is censored,
[MATH: Ti :MATH]
is the survival time of patient
[MATH: i :MATH]
,
[MATH: xi :MATH]
are the features of this patient and
[MATH: β :MATH]
is a vector of unknown variables that need to be learned by the model
([201]Cox, 1972). Intuitively, this loss is calculated by comparing
each of the uncensored patients to all patients that survived at least
as long as this patient in terms of predicted death risk or hazard.
Single-cohort and pan-cancer survival prediction with XGBoost
In this study, we applied XGBoost with Cox proportional hazards
regression as a learning objective to predict the survival of cancer
patients from 25 different cancer types. A graphical overview of how
survival risk scores were predicted based on gene expression data can
be found in [202]Figure S6.
To assess the performance of the single-cohort and pan-cancer
approaches, we trained and evaluated both versions of the method for
100 replications on gene expression datasets from 25 TCGA cohorts. In
each replication, the gene expression data was randomly split into 80%
training data and 20% test data in a stratified manner, such that
training and test sets always contained approximately the same
percentage of censored and uncensored patients, and in the pan-cancer
approach 80% of each cohort were assigned to the training data and 20%
to the test data. Then, the XGBoost-based learning procedure was
performed on the training data as described below and the prediction
performance was evaluated on the test data by computing the C-Index of
the predicted risk scores.
Feature selection: In the first step, a
[MATH:
patient
×gene
mrow> :MATH]
expression matrix was constructed, containing the gene expression
values from patients either belonging to the same TCGA cohort
(single-cohort approach) or to multiple different cohorts (pan-cancer
approach). Patients with missing or inconsistent data for age, gender,
or survival/censoring time were excluded from the data. Then, an
XGBoost-based feature selection procedure was performed to reduce the
60,483 genes measured in TCGA to 500 genes meaningful for survival. In
order to identify genes that are relevant to survival not only for a
specific training set composition, a stratified 4-fold cross-validation
was performed on the training data to find genes that are relevant to
survival across all folds. Stratified 4-fold cross-validation in this
context means that each of the four data splits contained approximately
equal proportions of censored and uncensored patients to ensure
comparability across folds. In each step of the cross-validation
procedure, we first removed genes with zero mean absolute deviation
(MAD) across the three data splits used as training data in this fold.
By removing genes with
[MATH: MAD=0 :MATH]
, we excluded genes without variability from further feature selection
because these genes hold no information value for patient survival.
Next, we used XGBoost to identify predictive features for each fold. To
this end, we trained 20 XGBoost models with different sets of model
hyperparameters and limited size for each fold of the cross-validation
in order to identify features that do not depend on the yet untuned
model hyperparameters but are relevant to survival on a more general
scale. Limited size in this context means that we allowed the XGBoost
models to grow a maximum of 5 to 20 regression trees with maximal tree
depths between 1 and 3 in order to avoid overfitting and allow the
models to select only the most informative features. From each of the
trained small XGBoost models, we then extracted the feature importance
scores of all genes used as features by the respective model and for
each gene calculated the average feature importance across all four
rounds of cross-validation and all 20 models per round. As a measure of
feature importance we selected ‘gain’, which is defined as the average
gain the respective feature brings to the evaluation metric across all
decision tree splits in which it is used and thus measures the relative
importance of this feature for the prediction result
([203]https://xgboost.readthedocs.io/en/latest/python/python_api.html).
In the last step, we then selected the 500 genes with the highest
average feature importance scores across all models in the 4-fold
cross-validation as features.
Hyperparameter tuning: We performed another 4-fold cross-validation on
the training data in each model replication to tune the XGBoost model
hyperparameters including maximum tree depth, number of boosting trees,
and regularization parameters. We first generated 500 sets of random
hyperparameter combinations. In each round of the cross-validation we
then trained an XGBoost model for each of these hyperparameter sets on
the three training data splits of this round. The training data splits
each contained gene expression measurements for the 500 genes
identified in the feature selection procedure. The performance of each
model was then evaluated on the remaining data split of the round by
computing the concordance index (C-Index) of the prediction. The
formula for computing the C-Index was adapted from [204]Dereli et al.
(2019) and is defined as:
[MATH: C−Index=∑i
=1N<
msub>∑j≠iΔij1((yi−yj
)(yˆj−yˆi)
>0)∑i=1N<
msub>∑j≠iΔij, :MATH]
with
[MATH:
Δij=1
:MATH]
if patients
[MATH: i :MATH]
and
[MATH: j :MATH]
are comparable, meaning that either both patients are uncensored or
patient
[MATH: i :MATH]
is uncensored and patient
[MATH: j :MATH]
is censored, but patient
[MATH: i :MATH]
has a shorter survival time than patient
[MATH: j :MATH]
. Otherwise, if patients
[MATH: i :MATH]
and
[MATH: j :MATH]
are not comparable,
[MATH:
Δij=0
:MATH]
.
[MATH: N :MATH]
is the total number of patients for which survival has been predicted,
[MATH: yi :MATH]
is the true survival time of patient
[MATH: i :MATH]
,
[MATH: yˆi
:MATH]
is the patients predicted risk or hazard score, and the indicator
function
[MATH: 1(·) :MATH]
evaluates to 1 whenever its argument is true and 0 otherwise.
Intuitively, the C-Index measures the fraction of patient pairs with
concordant survival times – meaning the patient with the shorter true
survival time is predicted to have a higher death risk – among all
comparable pairs of patients. Thus, a larger value of the C-Index
indicates a better prediction performance of the evaluated model, while
for random predictions the C-Index would be expected to assume a value
of 0.5. To select the optimal hyperparameters in each model
replication, the average C-Index across all four rounds of the 4-fold
cross-validation was computed for each of the 500 tested parameter
combinations and the combination of hyperparameters was selected that
yielded the best average C-Index. At last, we trained the final XGBoost
model on all training patients of the respective model replication,
using the gene expression values of the genes identified in the feature
selection step and the model hyperparameters selected in the
hyperparameter tuning step.
Survival prediction validation: Finally, in each of the 100 model
replications, we applied the fully trained model to the 20% left-out
patients and computed the C-Index on these patients in order to
retrieve patient survival risk scores.
The whole survival prediction procedure, including feature selection,
hyperparameter tuning and the final training of the XGBoost survival
prediction model, was implemented in Python (release 3.7) and is based
on the Python XGBoost package
([205]https://github.com/dmlc/xgboost/tree/master/python-package). The
code is available in the following GitHub repository:
[206]https://github.molgen.mpg.de/thedinga/xgb_survival_network. The
runtime for one model replication of the pan-cancer XGBoost approach –
including feature selection and hyperparameter optimization – on all
patients from 25 TCGA cancer cohorts, split into 80% training and 20%
test data, was approximately 16 hours (on a Linux machine with 64
cores).
Other survival prediction methods
We compared our single-cohort XGBoost method against the survival
prediction methods random survival forest ([207]Ishwaran et al., 2008),
survival support vector machine ([208]Khan and Zubek, 2008;
[209]Shivaswamy et al., 2007), and the multiple-kernel learning method
Path2Surv ([210]Dereli et al., 2019). Path2Surv is an extension of
survival support vector machines that– instead of a single kernel
function like in traditional support vector machines – combines
multiple kernels in a weighted sum to learn a survival function. Each
of these kernels is calculated on expression values of genes forming a
pathway or gene set and only informative kernels are considered in the
sum by assigning them non-zero weights. This way, Path2Surv uses less
gene expression features than for instance survival support vector
machines and offers a possibility to assess which pathways or gene sets
were relevant for predicting survival. In their publication, Dereli
et al. tested Path2Surv for two different gene/pathway sets, namely the
Hallmark gene sets ([211]Liberzon et al., 2015) and the Pathway
Interaction Database (PID) ([212]Schaefer et al., 2009). We adopted
these two versions of Path2Surv to compare against our XGBoost-based
survival prediction method. For random survival forest (RF), survival
support vector machine (SVM) and Path2Surv, R implementations from
Dereli et al. were adapted and the R Optimization Infrastructure (ROI)
([213]Theuβl et al., 2020) was used to solve the quadratic optimization
problems in the SVM and Path2Surv methods. While all methods were
supplied with all 60,483 gene expression features measured in TCGA and
prior to training RF and SVM only genes with zero standard deviation
were excluded, Path2Surv additionally excluded features from kernels it
considered as uninformative. In each replication of training, the
regularization parameter C was tuned for SVM and Path2Surv (range
between 1 × 10^−4 and 1 × 10^5), and the number of trees to grow was
tuned for RF (range between 500 and 2,500 trees) performing a 4-fold
cross-validation on the training data, as described in ([214]Dereli
et al., 2019). All other parameters were kept as default. For further
details we refer the reader to ([215]Dereli et al., 2019).
Entropy measurement
Entropy is a measure of uncertainty or information content
([216]Shannon, 1948) and is defined as:
[MATH: H(X)=−∑i=1nP(xi)
log2(<
mi>P(xi)), :MATH]
where
[MATH: x1,…,xn :MATH]
are the possible outcomes of a random variable
[MATH: X :MATH]
and
[MATH:
P(xi)<
/mo> :MATH]
is the probability of
[MATH: xi :MATH]
. If all outcomes
[MATH: x1,…,xn :MATH]
are equally likely, the entropy will be maximal, while the entropy is
minimal in the scenario where only one of the possible outcomes occurs
with absolute certainty. We transferred this concept to the
single-cohort feature importance weights, which were computed by
building the sum of feature importance scores over the 100 model
replications of the XGBoost model for each cohort, to measure how well
a gene generalizes as a feature over different cancer types. To this
end, we computed a probability matrix from the single-cohort feature
importance weights by dividing each weight score by the sum of scores
for this gene over all 25 cohorts and used this probability matrix to
compute the entropies of all genes. If a gene was identified as an
important feature across all 25 TCGA cohorts with similar feature
importance weights it has high entropy, while a gene that is only among
the predictive features in one cancer cohort has minimal entropy of
zero. Thus, genes with a high entropy are equally important for
predicting survival with the single-cohort XGBoost method in different
cancer types and are identified to generalize well, while genes with a
low entropy are likely to be cancer-type specific.
Inference of a pan-cancer survival network with NetCore
We used network propagation to infer a pan-cancer survival network from
the important features identified in the 100 replications of the
pan-cancer XGBoost approach. For each gene from the pan-cancer
important features, we calculated a feature importance weight as the
sum of feature importance scores over all 100 replications, where the
feature had a positive feature importance score if it was among the
important features of this replications and a score of 0 if it was not
used for survival prediction by XGBoost in this replication. Ensembl
Gene Identifiers were converted to Hugo Symbols using the MyGene Python
package (version 3.1, [217]http://mygene.info) ([218]Wu et al., 2013;
[219]Xin et al., 2016) and gene entities that did not map to a Hugo
Symbol were removed. All genes together with their feature importance
weights were then fed to NetCore ([220]Barel and Herwig, 2020). NetCore
is a random walk with restart network propagation method that uses node
coreness instead of node degree for normalization in order to address
the node degree bias of protein-protein interaction networks (PPIs).
Following the network propagation procedure, NetCore also provides a
module identification approach, where phenotype-associated network
modules are identified in a semi-supervised manner. These network
modules are sub-networks of the PPI and comprise so called seed genes –
in our case the top 100 genes from the important features that are
included in the PPI and have the highest feature importance weights –
as well as additional genes identified in the network propagation step
that function as links between seed genes. For the cancer survival
module identification, NetCore was applied on a high confidence
protein-protein interaction network from ConsensusPathDB ([221]Herwig
et al., 2016), which was obtained from the NetCore GitHub repository
([222]https://github.molgen.mpg.de/barel/NetCore) and comprises 10,586
genes and 114,341 interactions. For the random walk with restart, the
default restart probability of 0.8 was chosen.
Over-representation analysis
Over-representation analysis (ORA) measures the overlap between a set
of genes (i.e., the most important survival features) and another
pre-defined gene set called functional gene set. We used QUIAGEN's IPA
software ([223]Krämer et al., 2014) to derive ORA results with respect
to canonical pathway and upstream regulator target sets. The ORA was
judged with Fisher’s exact test.
Quantification and statistical analysis
Comparison of model performances
To compare the performance of the single-cohort XGBoost method with the
RF, SVM, and Path2Surv methods and the pan-cancer with the
single-cohort XGBoost method, we trained each method for 100
replications and in each replication computed the concordance index
(C-Index). The mean C-Indices of the compared methods were then
analyzed with the Wilcoxon unpaired rank-sum test, as implemented in
the R function stat_compare_means, which is part of the ggpubr library
([224]https://www.rdocumentation.org/packages/ggpubr). We selected a
significance threshold of 0.05 for the Wilcoxon unpaired rank-sum test
and adopted the method’s default values for symbols indicating further
significance levels (
[MATH: ns: p>0.05
:MATH]
,
[MATH: ∗:p≤0.05 :MATH]
,
[MATH: ∗∗:p≤0.01 :MATH]
,
[MATH: ∗∗∗:p≤0.001 :MATH]
,
[MATH: ∗∗∗∗:p≤0.0001 :MATH]
).
Significance thresholds
Throughout the entire manuscript, we consider a pvalue below 0.05 as
significant.
Acknowledgments