Graphical abstract

   graphic file with name fx1.jpg
   [25]Open in a new tab

Highlights

     * •
       Highly performing cancer survival prediction with XGBoost
     * •
       Pan-cancer training outperforms single-cohort training
     * •
       Combined approach consisting of machine learning and network
       propagation
     * •
       Tumor microenvironment is most strongly involved in cancer survival
       prediction
     __________________________________________________________________

   Bioinformatics; Cancer systems biology; Mathematical biosciences

Introduction

   Patient survival is the ultimate goal of cancer therapy and predicting
   patient survival from the molecular features of the individual tumor is
   an important computational task that has implications for tumor
   progression, therapy, and patient care ([26]Hoadley et al., 2018).
   Large population studies have shown that cancer survival is a
   multi-factorial problem and varies broadly between cancer types
   ([27]Allemani et al., 2015, [28]2018). This has given rise to the
   identification of numerous gene expression signatures specific for
   given cancer sub-types or treatments; however, such signatures are
   often hardly reproducible ([29]Venet et al., 2011) and depend on the
   statistical approach, the individual patient cohorts used, and even the
   normalization of the data ([30]Patil et al., 2015). In addition to gene
   expression signatures, it has recently also been reported for
   mutation-based signatures that there is a lack of biological cause and
   interpretation ([31]Alexandrov et al., 2020; [32]Kim et al., 2020).
   This is due to rather small sample sizes of subtype-specific patient
   cohorts and high inter-patient heterogeneity of the molecular features
   of the patients within the cancer subtypes.

   To overcome these issues, pan-cancer approaches have been conducted
   ([33]The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium,
   2020), and these studies have indeed shown that there are common
   regulatory mechanisms and features that are shared by patients of
   different cancer sub-types, for example, on the level of signaling
   pathways ([34]Sanchez-Vega et al., 2018). Additionally, it has been
   shown that the immune status of the patients is an important factor in
   cancer survival ([35]Thorsson et al., 2018). Furthermore, cancer onset
   and progression are correlated with age, and many of the hallmarks of
   cancer are shared in aging ([36]Aunan et al., 2017). In particular,
   there is interplay between aging cells in the microenvironment and
   cancer cells that may impact tumor metastasis and therapy responses
   ([37]Fane and Weeraratna, 2020). This suggests that there might be
   common patterns for survival prognosis across different cancer types
   and only recently survival prognoses based on pan-cancer approaches
   have been developed ([38]Cheerla and Gevaert, 2019; [39]Kim et al.,
   2020).

   Computational approaches predicting survival from patients’ molecular
   data have been conducted based on different machine learning techniques
   ([40]Kourou et al., 2015). For example, [41]Ishwaran et al. (2008)
   developed a method for right-censored data that is based on random
   forests and uses ensembles of decision trees for survival prediction,
   while [42]Khan and Zubek (2008) adapted support vector machines to
   predict patient survival. More recently, [43]Li et al. (2016)
   introduced a multitask learning formulation for survival prediction,
   where the survival time is predicted by estimating the patients’
   survival status at certain time intervals over the study period. Other
   recent methods for predicting cancer patient survival include Cox-nnet
   ([44]Ching et al., 2018), which uses a neural network in combination
   with Cox regression as an output layer, and DeepSurv ([45]Katzman
   et al., 2018), a deep neural network survival prediction method that
   besides predicting survival also models the relationship between
   patient covariates and treatment effectiveness in order to make
   personalized treatment recommendations. While these approaches are
   mainly gene-centered, it has been argued that they provide limited
   understanding of mechanistic cancer survival processes. Therefore,
   machine learning approaches have been developed that incorporate
   pathway and network features directly in the learning process so that
   derived features highlight potential molecular mechanisms of cancer
   survival such as the Path2Surv method ([46]Dereli et al., 2019).

   A key computational concern of machine learning is interpretability,
   i.e., the ability to recall how the algorithm has come to its decisions
   on global and local levels. This definition has recently been expanded
   by defining interpretability as the “extraction of relevant knowledge”,
   i.e., to also generate plausible results ([47]Murdoch et al., 2019).
   Because complex machine learning methods, in particular tree ensemble
   methods and (deep) neural networks, have generally a lower level of
   interpretability than, for example, simple decision trees or linear
   regression approaches, this might necessitate additional methods that
   add plausibility in the context of the concrete data domain, such as
   the incorporation of biological networks. An approach to investigate
   biological mechanisms from molecular data is network propagation
   ([48]Cowen et al., 2017). Network propagation allows combining
   experimental data with molecular interaction networks, such that the
   topology of the network is used to propagate the data effects
   throughout the network, and by that amplifying and functionally
   interpreting the experimental data. This approach covers a wide range
   of data domains and has been applied, for example, to associate genetic
   variants with phenotypic disease traits ([49]Barel and Herwig, 2020;
   [50]Leiserson et al., 2015).

   In this work we combine state-of-the-art tree ensemble learning and
   network propagation in order to derive and characterize a pan-cancer
   molecular signature for survival prognosis that is both computationally
   and biologically interpretable. We have analyzed gene expression data
   from the TCGA cancer cohorts summarizing to 8,024 patients from 25
   different cancer cohorts ([51]Liu et al., 2018). Our machine learning
   approach is based on the XGBoost tree ensemble method ([52]Chen and
   Guestrin, 2016), which has been shown to provide good performance in
   many application domains, for instance, in diagnosing chronic kidney
   disease ([53]Ogunleye and Wang, 2020) or identifying patients with
   epilepsy based on their cerebral activity ([54]Torlay et al., 2017). We
   compare the XGBoost method against four other state-of-the-art methods
   and show highly competitive performance in single-cancer cohort
   survival prediction. From the comparison, we observed that age, in
   particular, is a confounding cross-cohort factor and that overall
   survival prediction performance of all methods decreased with the age
   of the patients. This led us to question whether there is a common
   signature of cancer survival in the patients that is independent of the
   particular cancer type. We thus set up a workflow for pan-cancer
   training and showed increased prognostic performance compared to
   single-cancer prediction. Features dominant for pan-cancer prediction
   were further quantified and network propagation was carried out with
   the trained feature importance weights in order to identify molecular
   mechanisms related to pan-cancer survival. Network propagation
   identified a subnetwork consisting of 103 genes that is strongly
   associated with the tumor microenvironment (TME) and regulated by
   important factors of the TME such as TNF, TP53, and STAT3. Furthermore,
   we can show that the immune status of the patients can partially be
   recovered by the survival network signature.

   Our study shows that gradient tree boosting can be efficiently applied
   to pan-cancer survival prediction and that the combination of machine
   learning and network propagation can identify biologically meaningful
   subnetworks that highlight the importance of the TME for patient
   survival. The combination of machine learning and subsequent
   feature-informed network propagation is fairly generic and can be
   generalized to other disease domains and biological research questions.

Results

XGBoost gradient boosting predicts survival of cancer patients

   In order to assess the ability of the XGBoost method to predict
   survival from gene expression data of cancer patients, we compared the
   single-cohort XGBoost method against three well-established survival
   prediction methods, namely random survival forest ([55]Ishwaran et al.,
   2008), survival support vector machine ([56]Khan and Zubek, 2008;
   [57]Shivaswamy et al., 2007), and the Path2Surv multiple-kernel
   learning method ([58]Dereli et al., 2019) (cf. STAR Methods).
   [59]Figure 1A shows the prediction performances of random survival
   forest (RF), survival SVM (SVM), the Path2Surv multiple-kernel learning
   method on two different pathway/gene set collections (MKL[H] on the
   Hallmark gene sets ([60]Liberzon et al., 2015) and MKL[P] on the
   Pathway Interaction Database (PID) ([61]Schaefer et al., 2009)), and
   the single-cohort XGBoost method (XGB[SINGLE]) on 25 different TCGA
   cancer cohorts ([62]Table S1). These cohorts represent a wide spectrum
   of the human organ system ([63]Parris, 2020). For ten of the cohorts
   (TCGA-BLCA, TCGA-BRCA, TCGA-CESC, TCGA-COAD, TCGA-HNSC, TCGA-LGG,
   TGCA-OV, TGCA-PAAD, TCGA-SARC, and TCGA-STAD) single-cohort XGBoost
   showed the best median performance, while RF performed best for seven
   cohorts (TCGA-ACC, TCGA-KIRC, TCGA-KIRP, TCGA-LAML, TCGA-READ,
   TCGA-UCS, and TCGA-UVM). The Path2Surv method outperformed the other
   methods in four (TCGA-LIHC, TCGA-LUAD, TCGA-LUSC, and TCGA-MESO) and
   three (TCGA-ESCA, TCGA-GBM, and TCGA-SKCM) cohorts for MKL[P] and
   MKL[H], respectively, while SVM could achieve the best median
   performance in only one cohort (TCGA-UCEC). Overall, single-cohort
   XGBoost significantly outperformed RF for 13, SVM for 17, MKL[H] for 9
   and MKL[P] for 10 out of 25 TCGA cohorts.

Figure 1.

   [64]Figure 1
   [65]Open in a new tab

   Single-cohort prediction performances

   (A) C-Index boxplots over 100 replications of model training for random
   survival forest (RF), survival SVM (SVM), the Path2Surv multiple-kernel
   learning on the Hallmark gene sets (MKL[H]) and the Pathway Interaction
   Database (MKL[P]), and the single-cohort XGBoost method (XGB[SINGLE])
   on 25 different TCGA cancer cohorts. Mean C-Indices were compared with
   Wilcoxon’s unpaired rank-sum test and significance levels are defined
   as
   [MATH: <mrow><mtext>ns</mtext><mo>:</mo><mtext> p</mtext><mo
   linebreak="goodbreak" linebreakstyle="after">></mo><mn>0.05</mn></mrow>
   :MATH]
   ,
   [MATH: <mrow><mo linebreak="goodbreak"
   linebreakstyle="after">∗</mo><mo>:</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≤</mo><mn>0.05</mn></mrow> :MATH]
   ,
   [MATH: <mrow><mo linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo>:</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≤</mo><mn>0.01</mn></mrow> :MATH]
   ,
   [MATH: <mrow><mo linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo>:</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≤</mo><mn>0.001</mn></mrow> :MATH]
   ,
   [MATH: <mrow><mo linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo>:</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≤</mo><mn>0.0001</mn></mrow> :MATH]
   .

   (B) Spearman correlations between predictions of the different methods
   for test patients from the cohorts TCGA-BLCA (left) and TCGA-UVM
   (right). Larger circles correspond to a greater correlation, blue
   indicates a positive correlation and red indicates a negative
   correlation.

   (C) Spearman correlation (R) between median C-Indices of single-cohort
   XGBoost predictions and median ages for 25 different TCGA cohorts. The
   blue line shows the linear regression fit to the data and the gray area
   indicates the 95% confidence interval.

   For two of the 25 cohorts (TCGA-BLCA and TCGA-UVM) we also assessed how
   well the survival predictions of the different methods were correlated.
   To this end, for each of the two selected cohorts, we trained and
   tested all methods on the same train-test-split of patients and
   computed the Spearman correlation between survival predictions for the
   test patients. Since Path2Surv (MKL[H] and MKL[P]) and SVM predict
   survival times, while RF and XGBoost predict risk scores, where a
   higher risk score corresponds to a shorter predicted survival,
   predictions of Path2Surv and SVM are expected to be negatively
   correlated to predictions of RF and XGBoost. [66]Figure 1B shows the
   correlation values between predictions of the different methods. As
   expected, predictions of the same metric (either survival time or risk
   score) are positively correlated, while predictions of survival time
   are negatively correlated to risk score predictions in most cases.
   However, correlation strengths differ between methods and among the
   analyzed cohorts. The survival predictions of the different methods
   seem to be more highly correlated in TCGA-UVM than in TCGA-BLCA. In
   TCGA-UVM predictions are most correlated between MKL[P] and RF and
   between XGB[SINGLE] and RF. In TCGA-BLCA, the highest Spearman
   correlation can be observed between XGB[SINGLE] and SVM, while the
   other correlations are rather weak.

   Age is the predominant indicator of tumor development and the
   probability of developing cancer increases from 3.4% in the time span
   from birth to 49 years of age (male; female 5.5%) to 32.2% in the time
   span above 70 years of age (female 26%) ([67]Siegel et al., 2018). We
   observed that survival prediction performance to a certain degree is
   dependent on the age distribution of the cohort under study. For
   example, predictions for TCGA-ACC and TCGA-LGG were exceptionally good
   (median C-Index > 0.8 with the XGBoost single-cohort method) and ages
   of patients in these cohorts were consistently low with an overall
   median of 49 and 41 years. This overall median increases to 61 and 61.5
   years with TCGA-KIRP and TCGA-UVM and, likewise, the prediction
   performance drops below 0.8 in both cohorts. All prediction methods
   performed bad for the TCGA-UCS cohort (∼0.5 prediction performance)
   with a median age of 68.5 years. In fact, the median C-Indices of the
   predictions made by the single-cohort XGBoost approach and the median
   age of the correspondent cohort are negatively correlated (Spearman R =
   −0.55, p = 0.004; [68]Figure 1C), meaning that prognosis performances
   tend to be better for cancers of younger patients. This indicates
   age-specific gene expression signatures in the datasets under study
   that are easier to be resolved by machine learning methods in younger,
   and presumably more intact, states.

Pan-cancer training improves over single-cohort training

   We next investigated the gene expression features that were used by the
   single-cohort XGBoost method to predict survival in the different TCGA
   cancer cohorts. XGBoost has several built-in types of importance values
   that are computed during the training process. For each cohort and each
   of the training replications we computed the ‘gain’ of each gene
   expression feature which refers to the improvement in the used
   evaluation metric that the particular feature gives to the different
   branches it is on (cf. [69]STAR Methods). Since the single-cohort
   XGBoost method was trained for 100 replications on each cohort with
   varying splits of training and test data, the genes selected in the
   feature selection step as well as the final feature importance values
   can vary among the replications and even more so among the different
   cancer cohorts depending on the patients in the training set. Across
   all 25 cohorts and all 100 replications per cohort, there is a total of
   46,642 different RNA molecules that were captured by the TCGA data
   analysis pipeline ([70]Gao et al., 2019) and used for prognosis by the
   XGBoost method, including protein coding genes, processed pseudogenes
   ([71]Cheetham et al., 2020), and long non-coding RNAs ([72]Statello
   et al., 2021), among others. In [73]Figure 2A, we analyzed to what
   extent these gene features were shared between the different cohorts.
   It can be observed that most genes gain feature importance for a
   smaller number of cohorts (<10) while there are only a very small
   number of genes that are among the important features for a larger
   fraction of the cohorts (>15). This accounts for cancer subtype
   differences, tissue-specificity as well as inter-patient heterogeneity.

Figure 2.

   [74]Figure 2
   [75]Open in a new tab

   Pan-cancer XGBoost training improves over single-cohort training

   (A) Histogram depicting fractions of gene features shared over
   different numbers of training cohorts (x axis: number of TCGA cohorts a
   gene feature is shared over; y axis: fraction of all 46,642 genes used
   in at least one single-cohort model).

   (B) Prediction performances of the single-cohort XGBoost method
   (XGB[SINGLE]) and the pan-cancer XGBoost method (XGB[PAN]) on 25
   different TCGA cancer cohorts, depicted as C-Index boxplots over 100
   replications of model training. Mean C-Indices were compared with the
   Wilcoxon unpaired rank-sum test and significance levels are defined as
   [MATH: <mrow><mtext>ns</mtext><mo>:</mo><mtext> p</mtext><mo
   linebreak="goodbreak" linebreakstyle="after">></mo><mn>0.05</mn></mrow>
   :MATH]
   ,
   [MATH: <mrow><mo linebreak="goodbreak"
   linebreakstyle="after">∗</mo><mo>:</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≤</mo><mn>0.05</mn></mrow> :MATH]
   ,
   [MATH: <mrow><mo linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo>:</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≤</mo><mn>0.01</mn></mrow> :MATH]
   ,
   [MATH: <mrow><mo linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo>:</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≤</mo><mn>0.001</mn></mrow> :MATH]
   ,
   [MATH: <mrow><mo linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo>:</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≤</mo><mn>0.0001</mn></mrow> :MATH]
   . See also [76]Figure S5.

   (C) Venn-diagram comparing features used for prediction in the
   single-cohort XGBoost method (pink) with those selected in the
   pan-cancer XGBoost method (blue). See also [77]Figure S1.

   (D) Prediction performances (C-Indices) of single-cohort XGBoost (pink)
   and pan-cancer XGBoost (blue) for eight new cancer cohorts (not used in
   model training). For the single-cohort method, the mean C-Index over
   all 25 models trained on different TCGA cohorts is shown.

   To overcome this heterogeneity across subtypes and patients and to
   retrieve features with more general importance, we thus trained the
   XGBoost method on a joint dataset combining the different cohorts
   (pan-cancer training; cf. [78]STAR Methods) by using pooled gene
   expression data from all 25 cancer cohorts instead of training a
   separate model for each cohort. [79]Figure 2B compares the C-Indices
   over 100 replications between the pan-cancer and single-cohort XGBoost
   versions. For 15 out of the 25 cohorts, pan-cancer training
   significantly improved the method’s prediction performance in terms of
   C-Index, and for nine cohorts performance was comparably good for both
   versions. Only in one cohort, namely TCGA-LAML, the prediction
   performance of the pan-cancer-trained XGBoost method was significantly
   worse than single-cohort-trained XGBoost. When comparing the features
   selected in the single-cohort training procedures with those selected
   in the pan-cancer version of the XGBoost method ([80]Figure 2C), it
   becomes apparent that the vast majority (98.6%) of genes selected as
   important features in at least one of the 100 replications of the
   pan-cancer training were also among the important features of the
   single-cohort approach. However, while for the single-cohort XGBoost
   approach the set of important features comprises a total of 46,642
   different genes, the pan-cancer approach results in a 74% reduction and
   identifies only 12,082 genes as important features ([81]Figure 2C).
   Additionally, the type of features that are of importance change
   ([82]Figure S1): While in the single-cohort approach 40.4% of the
   important features represent protein coding genes, 25.0% represent
   lncRNAs and 15.8% processed pseudogenes, in the pan-cancer approach we
   observe a shift toward protein coding genes (56.5%) and a reduction of
   lncRNAs (20.7%) and processed pseudogenes (11.9%). This shift might be
   driven by the tissue specificity of lncRNAs and the rather
   patient-specific nature of mRNA retrotransposition leading to less
   feature importance in the pan-cancer training.

   Another benefit of pan-cancer training is that the selected features
   are not specific to a particular cancer type and can be extrapolated to
   predict survival in yet unseen patient cohorts. We challenged this
   claim and additionally tested the prediction performance of the
   pan-cancer and single-cohort trained models on eight additional TCGA
   cohorts, which had previously been excluded because the number of
   uncensored patients in these cohorts was too small (cf. [83]STAR
   Methods; [84]Table S1). [85]Figure 2D shows the C-Indices of the
   predicted survival on these eight cohorts. Since there are 25 different
   single-cohort models (each trained on one TCGA cohort), but only one
   pan-cancer model trained on all cohorts jointly, we compared the
   prediction performance of the pan-cancer model to the mean performance
   of the 25 single-cohort models. For all eight cohorts, the C-Index of
   the pan-cancer prediction was better than the mean C-Index of the
   single-cohort models, and for seven cohorts, the C-Index of the
   pan-cancer XGBoost method was greater than 0.5, even though the entire
   cancer types of the test patients had not been seen in model training.
   This supports the hypothesis that the genes identified in the
   pan-cancer XGBoost approach indeed have higher predictive power for
   patient survival across yet unseen cancer types than the single-cohort
   approaches.

Pan-cancer feature importance weights identify genes with high biological
plausibility

   We next explored the biological plausibility of the genes identified as
   important survival features in the pan-cancer XGBoost approach. We
   first investigated the distribution of summarized feature importance
   weights over all 100 replications of the pan-cancer training procedure.
   [86]Figure 3A shows the weight distribution for the 100 genes with the
   highest weights, where Ensembl gene identifiers were converted to HUGO
   symbols with the MyGene Python package (version 3.1,
   [87]http://mygene.info) ([88]Wu et al., 2013; [89]Xin et al., 2016) and
   identifiers that did not map to a HUGO symbol are named with their
   Ensembl gene identifier. It is noticeable that a few genes, and in
   particular the gene IGF2BP3 (Insulin Like Growth Factor 2 MRNA Binding
   Protein 3), have especially high feature importance weights, indicating
   a high prognostic potential. Indeed, IGF2BP3 has been found to be
   overexpressed in various cancers and has been associated with
   metastasis and poor survival in a number of cancer types
   ([90]Mancarella and Scotlandi, 2020) including colon cancer
   ([91]Lochhead et al., 2012).

Figure 3.

   [92]Figure 3
   [93]Open in a new tab

   Pan-cancer features are biologically plausible

   (A) Weight distribution for the 100 genes with the highest feature
   importance (sums of feature importance scores over 100 model
   replications) for pan-cancer XGBoost training (gene identifiers that
   did not map to a Hugo symbol are named with their Ensembl identifiers).
   The different colors indicate gene types (blue: protein coding, orange:
   lncRNA, green: processed pseudogenes, purple: transcribed unprocessed
   pseudogene, red: gene type unknown). These genes types were obtained
   using the MyGene Python package (version 3.1, [94]http://mygene.info)
   ([95]Wu et al., 2013; [96]Xin et al., 2016). See also [97]Figure S3.

   (B) Comparison of entropy distributions between the top 100 genes with
   the highest feature importance (feature importance is measured as sums
   of feature importance scores over 100 model replications) from the
   single-cohort approach and the pan-cancer approach (mean entropies are
   indicated as dashed lines). The entropy measure (xaxis) is based on the
   genes used in the single-cohort approach (cf. [98]STAR Methods). The
   density of the entropy distribution is displayed on the yaxis.

   (C) Kaplan-Meier plots for the four most important gene features from
   pan-cancer XGBoost and the cancer type with the lowest FDR-corrected p
   value in Cox regression, respectively. As a cutoff for gene expression,
   the 50^th percentile was selected. Cox regression data and Kaplan-Meier
   plots were retrieved from OncoLnc ([99]Anaya, 2016). See also
   [100]Figure S2.

   In order to judge whether the genes identified as important features in
   the pan-cancer approach could be prognostic for several cancer types we
   applied an entropy score derived from their importance values in the
   single-cohort training models (cf. [101]STAR Methods). For each of the
   single cohorts we summed the gene feature importance scores over the
   100 different training repetitions to obtain a feature importance
   weight for each gene in each cohort. Genes have a high entropy score if
   they have similar feature importance weights in many different cohorts
   and a low entropy score if they have feature importance in only a few
   cohorts (range [0, 4.64]). [102]Figure 3B compares the distributions of
   the entropy scores of the top 100 genes from the single-cohort approach
   and the top 100 genes from the pan-cancer approach. The top 100
   pan-cancer genes had a significantly higher entropy than the top 100
   genes from the single-cohort approach (p = 1.145 × 10^−14 in a
   one-sided Wilcoxon unpaired rank-sum test), meaning the pan-cancer
   method generalizes better over different cancer types than the
   single-cohort method.

   To validate the association of these pan-cancer genes with survival, we
   analyzed the four top-scoring protein-coding genes (IGF2BP3, IL1RAP,
   PIK3R3, and CISH) in more detail by querying OncoLnc ([103]Anaya,
   2016), a tool that provides Cox regression analyses as well as
   Kaplan-Meier survival plots on TCGA gene expression data for different
   cancer types. [104]Figure 3C displays Kaplan-Meier plots of these four
   genes for the cancer type with the lowest FDR-corrected p value in the
   Cox regression, respectively. For example, IGF2BP3 has the highest
   predictive performance in survival for LGG (FDR = 3.59 × 10^−9 in Cox
   regression), while IL1RAP, PIK3R3, and CISH are most predictive for
   KIRP, KIRC, and LGG, respectively. Additionally, IGF2BP3, IL1RAP,
   PIK3R3, and CISH have significant survival performance (FDR <0.05 in
   Cox regression) for four (KIRP, KIRC, LUAD and PAAD), two (LGG and
   PAAD), two (LGG and HNSC), and four (LUAD, LIHC, KIRP, and KIRC) other
   cohorts, respectively. Kaplan-Meier plots for these cohorts can be
   found in the [105]Supplemental information ([106]Figure S2). For
   generating the Kaplan-Meier plots in [107]Figure 3C, the patients were
   first split into two groups by expression of the respective gene (i.e.,
   IGF2BP3, IL1RAP, PIK3R3, or CISH). As a cutoff, we selected the 50^th
   percentile to ensure that both groups had equal sizes and all patients
   were included in the Kaplan-Meier analysis. To measure whether there is
   a significant difference in survival between the low-expression and the
   high-expression groups, OncoLnc uses the logrank test. For all but one
   (CISH in LIHC) gene-cohort pairs that had a significant FDR corrected p
   value in the Cox regression, the logrank p values were also significant
   (p < 0.05), indicating that these genes are indeed predictive for
   survival.

Inferring a pan-cancer survival network

   Even though the pan-cancer approach reduces substantially the number of
   genes used for survival prediction (74%) as compared to the
   single-cohort approach, with the most important features having a high
   biological plausibility, there are still 12,082 features that are
   selected as potentially important in at least one of the 100
   replications of the pan-cancer training procedure. In each replication,
   there are a maximum of 500 genes selected for survival prediction (cf.
   [108]STAR Methods), meaning that among the total 12,082 genes, there
   are many genes that are only selected in a small number of replications
   and the selected genes thus still highly depend on the composition of
   the training set. Furthermore, this still large number of genes makes
   it difficult to infer mechanistic information and lacks biological
   focus. On the other hand, these features might still be relevant for a
   subset of patients, and simply neglecting the features with low feature
   importance weights with some threshold might be somewhat arbitrary. The
   resulting distribution of feature importance weights resembles a
   “long-tail” distribution often visible with cancer-associated SNPs
   ([109]Armenia et al., 2018) ([110]Figure S3).

   It has been argued that plausibility of machine learning methods could
   be improved by incorporating prior knowledge on biological networks in
   the analysis workflow ([111]Camacho et al., 2018), and we challenged
   this claim by applying our recently developed network propagation
   method, NetCore, to the feature importance weights of the pan-cancer
   training approach (cf. [112]STAR Methods). Network propagation has
   emerged as a useful tool for inferring mechanistic information, in
   particular from high-throughput data ([113]Cowen et al., 2017). In
   network propagation genes are organized in a graph and edges between
   genes are set through biological a priori knowledge. Typically,
   protein-protein interaction networks are used as a scaffold for the
   interaction context ([114]Huang et al., 2018). For this study, we used
   the large integrated protein-protein interaction network from the
   ConsensusPathDB comprising 10,586 nodes and 114,341 interactions
   ([115]Herwig et al., 2016). Gene nodes were initially weighted with the
   sums of the feature importance scores over 100 replications of the
   pan-cancer training and the NetCore propagation method was used to
   spread these weights over the network until a steady-state condition
   was reached ([116]Barel and Herwig, 2020). Then we inferred a
   pan-cancer survival network by applying NetCore’s module identification
   approach to the network propagation results with the top 100 genes that
   are contained in the protein-protein interaction network and have the
   largest feature importance sums as seed genes. NetCore extends these
   seed genes by genes that were assigned a significantly high weight
   after the network propagation step to form connected subgraphs, which
   are called modules (cf. [117]STAR Methods). [118]Figure 4A shows the
   largest network module identified by NetCore. Overall, 13 network
   modules containing between 2 and 79 genes were identified. These
   modules comprised 103 genes in total, 76 genes with high pan-cancer
   feature importance and 27 additionally inferred genes added in the
   network propagation steps. The genes and their initial as well as
   propagated weights are listed in [119]Table S2.

Figure 4.

   [120]Figure 4
   [121]Open in a new tab

   Pan-cancer survival network

   (A) Largest network module identified by NetCore ([122]Barel and
   Herwig, 2020) network propagation and module identification based on
   pan-cancer important features. Orange nodes correspond to seed genes,
   while genes that were inferred during network propagation are colored
   in gray.

   (B) Feature importance of the 103 module genes in single-cohort
   training (100 replications). Top: Sum of feature importance scores of
   the module genes per cohort. Bottom: Number of genes (of the 103 module
   genes) per cohort that are among the important features in
   single-cohort training (feature importance > 0). See also [123]Tables
   S2 and [124]S3.

   The 103 genes are informative on all different cancer subtypes with an
   average of 41.48 genes gaining feature importance for the cohorts in
   single-cohort training. TCGA-LUSC (59 genes) shares the largest number
   of genes followed by TCGA-HNSC (54) and TCGA-OV (54) ([125]Figure 4B).
   On the level of feature importance weights, however, the difference is
   more drastic. While for many cohorts the contributions are high, for
   TCGA-MESO, TCGA-UCS, TCGA-SKCM, TCGA-READ and TCGA-UVM, the feature
   importance weights of the 103 genes are rather low. This might in part
   due to the fact that these cohorts were in the lower range of patient
   numbers that contributed to pan-cancer training (<200; [126]Table S1).

   The pan-cancer survival network has an overlap of 27 genes with
   previously annotated cancer genes ([127]Repana et al., 2019).
   Interestingly, the majority of these cancer genes (16 out of 27) were
   inferred by network propagation, meaning that they had very little to
   no feature importance with respect to survival prognosis but were
   re-ranked significantly because of their high connectivity in the
   protein-protein interaction network ([128]Table S2). The largest
   re-ranked pan-cancer feature importance is attributed to the gene
   PIK3R3 (Phosphoinositide-3-Kinase Regulatory Subunit 3) ([129]Table
   S2). PIK3R3 is an enzyme that participates in multiple cancer-related
   signaling pathways, most importantly PI3K-AKT-mTOR signaling. It can
   act as an oncogene in colorectal cancer and high expression of PIK3R3
   inhibits cell senescence and promotes cell proliferation ([130]Chen
   et al., 2020).

Pan-cancer survival network is strongly associated with the tumor
microenvironment

   In order to characterize the genes of the pan-cancer survival network
   we performed over-representation analysis and interrogated canonical
   pathways and upstream regulators using QIAGEN Ingenuity Pathway
   Analysis (IPA) ([131]Krämer et al., 2014) ([132]Table 1).

Table 1.

   Over-represented pathways (p < 0.001) computed with QIAGEN Ingenuity
   Pathway Analysis (IPA)
   Ingenuity canonical pathway -log(p value) Ratio Molecules
   Tumor microenvironment pathway 9.34 6.25 × 10^−2 FGF2, IDO1, IGF2,
   JAK2, MMP1, MMP14, MMP3, PIK3R3, PLAU, SPP1, TGFB1
   Glucocorticoid receptor signaling 8.60 3.25 × 10^−2 A2M, CAV1, ESR1,
   JAK2, MMP1, MMP3, PGR, PIK3R3, PLA2G4A, PLA2G5, PLAU, RPS6KA5,
   SERPINE1, TGFB1, TGFBR2
   Role of tissue factor in cancer 8.55 7.76 × 10^−2 FRK, FYN, ITGA3,
   JAK2, MMP1, PIK3R3, PLAUR, RPS6KA3, RPS6KA5
   Hepatic fibrosis signaling pathway 6.85 3.17 × 10^−2 FGF2, GNAI3, INS,
   ITGA3, JAK2, MMP1, PIK3R3, SERPINE1, SPP1, TGFB1, TGFBR2, TLR4
   Hepatic fibrosis/Hepatic stellate cell activation 6.77 4.84 × 10^−2
   A2M, COL17A1, FGF2, IGF2, MMP1, SERPINE1, TGFB1, TGFBR2, TLR4
   Coagulation system 6.31 1.43 × 10^−1 A2M, PLAU, PLAUR, PLG, SERPINE1
   HOTAIR regulatory pathway 6.18 5.00 × 10^−2 ESR1, MMP1, MMP14, MMP3,
   PIK3R3, SPP1, TGFB1, TLR4
   Osteoarthritis pathway 6.15 4.09 × 10^−2 DKK1, FGF2, ITGA3, MMP1, MMP3,
   SPP1, TGFB1, TGFBR2, TLR4
   Growth hormone signaling 6.08 8.45 × 10^−2 A2M, IGF2, JAK2, PIK3R3,
   RPS6KA3, RPS6KA5
   Inhibition of matrix metalloproteases 6.06 1.28 × 10^−1 A2M, MMP1,
   MMP14, MMP3, TIMP4
   Glioma invasiveness signaling 6.01 8.22 × 10^−2 PIK3R3, PLAU, PLAUR,
   PLG, TIMP4, VTN
   Reelin signaling in neurons 5.87 5.74 × 10^−2 APP, ARHGEF3, CDK5, FRK,
   FYN, ITGA3, PIK3R3
   Axonal Guidance signaling 5.62 2.43 × 10^−2 ADAM9, CDK5, DPYSL5, EPHA2,
   FYN, GNAI3, ITGA3, MMP1, MMP14, MMP3, PIK3R3, SEMA7A
   Estrogen receptor signaling 5.62 3.05 × 10^−2 CAV1, ESR1, GNAI3, IGF2,
   JAK2, MMP1, MMP14, MMP3, PGR, PIK3R3
   Leukocyte extravasation signaling 5.57 4.15 × 10^−2 CLDN4, GNAI3,
   ITGA3, MMP1, MMP14, MMP3, PIK3R3, TIMP4
   HIF1A signaling 5.37 3.90 × 10^−2 FGF2, IGF2, MMP1, MMP14, MMP3,
   PIK3R3, SERPINE1, TGFB1
   Semaphorin signaling in neurons 5.12 8.33 × 10^−2 CDK5, DPYSL3, DPYSL5,
   FYN, SEMA7A
   Neuroinflammation signaling pathway 5.05 3.00 × 10^−2 APP, JAK2, MMP3,
   PIK3R3, PLA2G4A, PLA2G5, TGFB1, TGFBR2, TLR4
   Molecular mechanisms of cancer 4.87 2.50 × 10^−2 ARHGEF3, CDK5, CDK6,
   FYN, GNAI3, ITGA3, JAK2, PIK3R3, TGFB1, TGFBR2
   Tec kinase signaling 4.86 4.05 × 10^−2 FRK, FYN, GNAI3, ITGA3, JAK2,
   PIK3R3, TLR4
   p38 MAPK signaling 4.80 5.08 × 10^−2 PLA2G4A, PLA2G5, RPS6KA3, RPS6KA5,
   TGFB1, TGFBR2
   Colorectal cancer metastasis signaling 4.71 3.16 × 10^−2 JAK2, MMP1,
   MMP14, MMP3, PIK3R3, TGFB1, TGFBR2, TLR4
   Caveolar-mediated endocytosis signaling 4.70 6.85 × 10^−2 CAV1, FLNC,
   FYN, INS, ITGA3
   Atherosclerosis signaling 4.61 4.72 × 10^−2 MMP1, MMP3, PLA2G4A,
   PLA2G5, TGFB1, TNFRSF14
   ERK/MAPK signaling 4.43 3.47 × 10^−2 ESR1, FYN, ITGA3, PIK3R3, PLA2G4A,
   PLA2G5, RPS6KA5
   Semaphorin neuronal repulsive signaling pathway 4.39 4.32 × 10^−2 CDK5,
   DPYSL3, DPYSL5, FYN, ITGA3, PIK3R3
   Oncostatin M signaling 4.37 9.30 × 10^−2 JAK2, MMP1, MMP3, PLAU
   Role of osteoblasts, osteoclasts and Chondrocytes in rheumatoid
   arthritis 4.22 3.21 × 10^−2 DKK1, MMP1, MMP14, MMP3, PIK3R3, SPP1,
   TGFB1
   Sperm motility 4.16 3.14 × 10^−2 EPHA2, ERBB4, FRK, FYN, JAK2, PLA2G4A,
   PLA2G5
   Bladder cancer signaling 4.10 5.15 × 10^−2 FGF2, MMP1, MMP14, MMP3,
   RPS6KA5
   Cardiac hypertrophy signaling (enhanced) 4.07 2.01 × 10^−2 ADRA1A,
   ADRA1D, FGF2, GNAI3, ITGA3, JAK2, PIK3R3, RPS6KA5, TGFB1, TGFBR2
   Role of macrophages, fibroblasts and endothelial cells in rheumatoid
   arthritis 4.05 2.55 × 10^−2 DKK1, FGF2, JAK2, MMP1, MMP3, PIK3R3,
   TGFB1, TLR4
   Chronic myeloid leukemia signaling 3.98 4.85 × 10^−2 CDK6, PIK3R3,
   RBL2, TGFB1, TGFBR2
   Insulin secretion signaling pathway 3.91 2.87 × 10^−2 EIF4G3, FYN, INS,
   JAK2, PCSK2, PIK3R3, RPS6KA5
   CNTF signaling 3.89 7.02 × 10^−2 JAK2, PIK3R3, RPS6KA3, RPS6KA5
   T cell exhaustion signaling pathway 3.84 3.43 × 10^−2 BTLA, JAK2,
   PIK3R3, TGFB1, TGFBR2, TNFRSF14
   Regulation of the epithelial mesenchymal transition by growth factors
   pathway 3.67 3.19 × 10^−2 FGF2, JAK2, MMP1, PIK3R3, TGFB1, TGFBR2
   RhoGDI signaling 3.66 3.17 × 10^−2 ARHGEF3, CDH10, CDH6, ESR1, GNAI3,
   ITGA3
   IL-15 production 3.65 4.13 × 10^−2 EPHA2, ERBB4, FRK, FYN, JAK2
   Agranulocyte adhesion and diapedesis 3.61 3.11 × 10^−2 CLDN4, GNAI3,
   ITGA3, MMP1, MMP14, MMP3
   Senescence pathway 3.60 2.55 × 10^−2 CDK6, PIK3R3, RBL2, RPS6KA5,
   SERPINE1, TGFB1, TGFBR2
   Role of MAPK signaling in inhibiting the pathogenesis of influenza 3.43
   5.33 × 10^−2 PLA2G4A, PLA2G5, RPS6KA3, TLR4
   mTOR signaling 3.41 2.86 × 10^−2 EIF4G3, FKBP1A, INS, PIK3R3, RPS6KA3,
   RPS6KA5
   Inhibition of angiogenesis by TSP1 3.32 8.82 × 10^−2 FYN, TGFB1, TGFBR2
   MIF-mediated glucocorticoid regulation 3.32 8.82 × 10^−2 PLA2G4A,
   PLA2G5, TLR4
   Necroptosis signaling pathway 3.13 3.18 × 10^−2 FKBP1A, PLA2G4A,
   PLA2G5, RBL2, TLR4
   Cardiac hypertrophy signaling 3.11 2.50 × 10^−2 ADRA1A, ADRA1D, GNAI3,
   PIK3R3, TGFB1, TGFBR2
   MIF regulation of innate immunity 3.04 7.14 × 10^−2 PLA2G4A, PLA2G5,
   TLR4
   [133]Open in a new tab

   Pathway,annotated pathway name; -log(p value),-log of enrichment p
   value computed with Fisher's exact test; ratio,proportion of genes in
   the network module that map to the respective pathway and overall
   number of genes in the pathway; molecules, network module genes that
   overlap with the pathway.

   The genes in the survival network are most strongly enriched with genes
   from the TME (p = 4.57E-10; FGF2, IDO1, IGF2, JAK2, MMP1, MMP14, MMP3,
   PIK3R3, PLAU, SPP1, TGFB1; [134]Table 1). The TME is composed of a
   variety of host cells, secreted factors as well as extracellular matrix
   proteins that are strongly interacting with the cancer cells, and this
   interaction is critical for tumor progression and influences the
   survival of the patient. The TME is built of several specialized
   microenvironments such as hypoxic niche, immune microenvironment, and
   metabolism microenvironment ([135]Jin and Jin, 2020). Several of these
   specialized TME microenvironments are enriched by the pan-cancer
   survival network, for example HIF1A signaling (p = 4.27 × 10^−6; FGF2,
   IGF2, MMP1, MMP14, MMP3, PIK3R3, SERPINE1, TGFB1). Hypoxia is
   correlated with cancer progression. It is a potent factor that
   influences the characteristics of tumor and stromal cells to support
   metastasis and HIF-1 and HIF-2 genes are associated with poor patient
   survival ([136]Rankin et al., 2016). Hypoxia leads to the upregulation
   of VEGF and other growth factors such as FGF2 and, thus, stimulates
   angiogenesis ([137]De Palma et al., 2017).

   Additionally, the pan-cancer survival network holds strong
   immune-related enrichment signals. For example, glucocorticoid receptor
   (GR) signaling has been found enriched by the survival network (p =
   2.51 × 10^−9; A2M, CAV1, ESR1, JAK2, MMP1, MMP3, PGR, PIK3R3, PLA2G4A,
   PLA2G5, PLAU, RPS6KA5, SERPINE1, TGFB1, TGFBR2). Increasing GR
   signaling in the TME has been recently associated with dysfunctional
   CD8+ tumor-infiltrating lymphocytes (TILs), which might interfere with
   the positive effect of TILs on the survival ([138]Acharya et al.,
   2020). Such positive effects have been shown for example in colon
   cancer ([139]Idos et al., 2020) and subtypes of breast cancer
   ([140]Denkert et al., 2018). Further immune cell phenotypes relate to
   dysfunctions of the immune system and major mechanisms by which tumors
   escape immunosurveillance. One such phenotype is T cell exhaustion (p =
   1.44 × 10^−4; BTLA, JAK2, PIK3R3, TGFB1, TGFBR2, TNFRSF14). T cell
   exhaustion manifests in decreased effector cytokine production and
   impaired cytotoxicity and determines a chronic infectious state which
   is prone to cancer immune evasion ([141]Jiang et al., 2015).

   The pan-cancer survival network further enriches the inhibition of
   matrix metalloproteases (A2M, MMP1, MMP14, MMP3, TIMP4). Matrix
   metalloproteinases (MMPs) are among the most prominent family of
   proteinases associated with tumorigenesis. They show increased
   expression in cancer patients and are potent regulators of the TME and
   other signaling pathways related to cell growth, inflammation and
   angiogenesis ([142]Kessenbrock et al., 2010).

   The strong involvement of the TME raises the question to what extent
   the gene expression signals of the 103 genes are indicative of the
   immune status of the patients. We explored a recent work that
   classified the TCGA patients into six immune subtypes ([143]Thorsson
   et al., 2018). For 7,475 of the 8,024 patients used here the immune
   subtype could be assigned and principal component analysis of these
   patients with respect to the gene expression matrix of the 103 genes
   shows a partial distinction of these immune subtypes ([144]Figure 5).
   In particular, immune subtype C5 (immunologically quiet), that
   consisted mostly of brain lower-grade gliomas (LGG), could be separated
   from the other groups as well as parts of the patients of subtype 2
   (IFN-γ dominant). This shows that the gene expression of the pan-cancer
   survival network genes holds information on the immune status of the
   patients.

Figure 5.

   [145]Figure 5
   [146]Open in a new tab

   Association of the pan-cancer survival network with immune subtypes.
   Principal component analysis (PCA) of the patients that can be assigned
   to an immune subtype according to ([147]Thorsson et al., 2018). The PCA
   is based on the 103 module genes and patients are colored by their
   assigned immune subtype. PCA is generated with the R library ggplot2
   ([148]Wickham, 2009).

   We were further interested in potential regulators of the pan-cancer
   survival network that are frequently mutated in cancer samples. We
   therefore performed an enrichment analysis with the annotation sets of
   “upstream regulators” in the IPA system and found a total of 47
   significantly enriched upstream regulators (p value < 1 × 10^−5) that
   had been identified as cancer driver or potential cancer driver genes
   in a recent cancer gene annotation effort ([149]Repana et al., 2019).
   The top 15 cancer-relevant upstream regulators are ([150]Table S3): JUN
   (20 target genes in the network), TNF (34), IL1B (25), TP53 (33), IL1A
   (13), FGF2 (15), MAP3K1 (7), EGFR (15), STAT3 (16), HRAS (16), CDH1
   (7), AKT1 (11), PTEN (16), FOXO1 (13), and SMARCA4 (15). Interestingly,
   several of these factors have been associated with the TME. For
   example, TNF (tumor necrosis factor alpha) is a factor in the
   regulation of the TME. It is a multifunctional key cytokine in
   apoptosis and cell survival as well as in inflammation and immunity
   ([151]van Horssen et al., 2006). STAT3 (signal transducer and activator
   of transcription 3) is key in regulating the anti-tumor immune response
   and a therapeutic target of immune therapies in different cancer
   subtypes such as celecoxib in colorectal cancer and pyrimethamine in
   chronic lymphocytic leukemia ([152]Zou et al., 2020). TP53 is among the
   most studied genes in cancer and has numerous roles in particular in
   escaping apoptosis. Additionally, it has been shown that TP53 can act
   non-cell autonomously to suppress tumorigenesis by promoting an
   antitumor microenvironment, in part through secreted factors that
   modulate macrophage function ([153]Lujambio et al., 2013).

   Thus, we conclude that the pan-cancer survival network emphasizes the
   role of the TME for the immune status and the survival prognosis of
   cancer patients.

Discussion

   Cancer survival is a multi-factorial problem and has important
   implications for patient care and therapy choices. The purpose of this
   work was to derive a prediction model for cancer survival with high
   biological plausibility. We achieved this by combining an ensemble tree
   approach based on gradient boosting along with network propagation with
   a comprehensive protein-protein interaction network.

   In the first part of this work, we applied XGBoost tree ensemble
   learning to gene expression data to predict patient survival in 25
   different cancer types. To overcome the observable shortcomings of
   single-cohort training such as low number of training samples and
   intra-cohort heterogeneity of patients, and to empower the
   identification of cross-cohort features, we next applied pan-cancer
   training and showed improved performance of the method, although the
   number of features used is reduced by 74%. In order to reduce the
   number of features further and to gain biological plausibility of the
   prediction model we revealed a pan-cancer network of 103 genes with our
   recent network propagation algorithm, NetCore, utilizing a
   high-confidence protein-protein interaction network. The computed
   network module is predominantly indicative of the TME as was shown by
   enrichment analysis and correlation to the immune status of the
   patients. The role of the TME for cancer progression and metastases as
   well as for the response to therapies has been emphasized ([154]Quail
   and Joyce, 2013), and our findings particularly highlight the
   importance of the hypoxic and immune-related parts of the TME.
   Additionally, we identified a decent negative correlation (R = −0.55)
   of the methods performances with the age distribution of the patients.
   This may also suggest that the aging TME influences cancer progression
   and survival ([155]Fane and Weeraratna, 2020), and that this aging TME
   is hard to resolve by machine learning approaches, while signatures
   appear more predictive in younger, and presumably more intact, states.
   Survival prognosis can benefit from these findings by taking into
   account age-specific training and validation cohorts.

   Besides the TME- and immune-related pathways, the pan-cancer survival
   network enriches further signaling pathways that have high cross-cohort
   relevance for survival such as mTOR-signaling. The mechanistic target
   of rapamycin (mTOR) pathway regulates fundamental cell processes and
   dysregulation of the pathway is relevant for the progression of cancer
   ([156]Saxton and Sabatini, 2017). In our network mTOR signaling (p =
   2.86 × 10^−2; EIF4G3, FKBP1A, INS, PIK3R3, RPS6KA3, RPS6KA5) is
   activated by PI3K/AKT in response to insulin (INS) ([157]Zoncu et al.,
   2011). INS and PI3KR3 were among the top 100 genes judged by the
   feature importance weights of the pan-cancer XGBoost prediction model
   and were used as seed genes for the network propagation. Another very
   prominent pathway in cancer is ERK/MAPK signaling (p = 3.47 × 10^−2;
   ESR1, FYN, ITGA3, PIK3R3, PLA2G4A, PLA2G5, RPS6KA5), which is the core
   of the signaling network involved in regulating cell growth,
   development, and division and a target of many cancer therapeutics.
   Interestingly, associated proteins highlight the cooperation with the
   estrogen receptor 1 (ESR1). It has been reported that the cooperation
   of the estrogen and MAPK signaling is due to ESR1-ERK2 cooperative
   binding and enhances the proliferative effect of estrogen signaling
   ([158]Madak-Erdogan et al., 2011). ESR1 is a frequently mutated cancer
   gene, in particular in metastatic breast cancer ([159]Razavi et al.,
   2018), and was identified among the top 100 important features by the
   XGBoost model.

   We based our survival prediction method on gene expression data of
   cancer patients from multiple cancer types. Gene expression data has
   been found to be the most informative omics datatype in different
   biomedical prediction tasks ([160]Costello et al., 2014; [161]Vale
   Silva and Rohr, 2020). In principle, the XGBoost survival prediction
   approach described in this work can easily be extended by additional
   clinical and molecular data types like DNA methylation, mutation, or
   copy number variation (CNV) data and such a multi-omics approach has
   recently been applied to pan-cancer survival prediction using deep
   learning ([162]Cheerla and Gevaert, 2019). These additional data types
   could complement the information contained in gene expression data, but
   on the other hand, incorporating additional data and, thus, also
   additional model features, while the number of patients does not
   increase, would exacerbate the curse of dimensionality ([163]Bellman,
   2015; [164]Keogh and Mueen, 2017), a phenomenon regularly encountered
   in machine learning that is often responsible for model overfitting.

   There are different strategies to counteract this phenomenon, including
   the application of dimensionality reduction techniques such as
   principal component analysis (PCA) ([165]Hotelling, 1933; [166]Pearson,
   1901) to the data before feeding it to the machine learning algorithm,
   or the incorporation of feature selection steps into the learning
   procedure. Feature selection techniques for supervised learning
   problems can be divided into three categories, namely filter methods,
   wrapper methods, and embedded methods ([167]Saeys et al., 2007). Filter
   methods select features by considering intrinsic properties of the data
   like correlation between features or the chi-square test. However, they
   ignore the relation between the features and the target variable. In
   contrast, wrapper methods do consider dependencies between features and
   target variable by evaluating different feature subsets with the
   machine learning method used for the prediction task and selecting the
   subset of features that yields the best model performance. A
   disadvantage of this type of feature selection is however that it is
   computationally intensive, especially for datasets with many features,
   because the number of potential feature subsets that need to be tested
   grows exponentially with the number of features. The third type of
   feature selection methods, the embedded techniques, also consider
   relations between the features and the target variable by using the
   selected machine learning method to evaluate features, but instead of
   evaluating model performance for different sets of features, they use
   built-in feature importance measures of the machine learning method and
   are thus computationally far less intensive than wrapper methods. In
   our XGBoost-based cancer survival prediction method, we chose to
   integrate an embedded feature selection approach, which uses 4-fold
   cross-validation and applies simple XGBoost models to the training data
   in order to reduce the number of features used in each model
   replication from 60,483 to 500.

   When combining the results of the survival prediction model with
   network propagation, we are able to explore the functional content of
   the genes identified as important features by XGBoost. As a measure of
   feature importance, we chose ‘gain’ over other possible feature
   importance measures such as ‘weight’ and ‘cover’ (see
   [168]https://xgboost.readthedocs.io/en/latest/python/python_api.html),
   because ‘gain’ measures the relative importance of a feature for the
   prediction result with respect to the prediction improvement that this
   feature brings. In contrast, ‘weight’ simply counts how often a feature
   is used to split the data in any of the regression trees and does not
   consider where in the tree the corresponding split is used. This is
   relevant because splits closer to the root of the tree are typically
   more important for the prediction result than splits further down,
   where only few samples are affected by the split (an example of a
   regression tree from the pan-cancer XGBoost model can be found in
   [169]Figure S4). The feature importance measure ‘cover’ on the other
   hand does take into account where in a tree the split is made; however,
   it does so by only considering the number of samples affected by the
   splits this feature is used for, and does not consider how often the
   feature is used or how it affects the prediction result. Thus, we
   deemed ‘gain’ the most appropriate feature importance measure for the
   purpose of identifying genes relevant for survival.

   We showed improved survival prediction performance when using
   pan-cancer instead of single-cohort training. On the one hand, it is
   likely that a considerable proportion of this improvement can be
   attributed to the much larger number of training patients. This major
   advantage for machine learning has been shown recently in other
   application domains, for example, drug sensitivity testing ([170]Lloyd
   et al., 2021). In fact, when performing the pan-cancer XGBoost training
   procedure (including feature selection and hyperparameter tuning) on
   randomly sampled subsets of different sizes, it becomes apparent that
   for many of the cancer types the performance drops with smaller
   training samples ([171]Figure S5). On the other hand, however, we
   observe that pan-cancer features are generalizable whereas
   single-cohort features are very specific to the respective cancer type.
   Interestingly, the type of features is also changing drastically. While
   with single-cohort training 40.4% of the important features refer to
   protein coding genes, this number increases to 56.5% in pan-cancer
   training. Accordingly, the proportion of the other feature types
   (lncRNAs and processed pseudogenes) drops considerably. This reflects
   the high level of specificity of lncRNAs and pseudogenes.

   To the best of our knowledge this is the first study that applies
   XGBoost to pan-cancer survival prognosis. Pan-cancer survival
   prediction has recently also been addressed by deep learning methods
   like VAECox ([172]Kim et al., 2020) and MultiSurv ([173]Vale-Silva and
   Rohr, 2021). VAECox pre-trains a variational autoencoder on pan-cancer
   gene expression data and then transfers the learned weights to a
   survival prediction neural network architecture, which is then
   fine-tuned separately on each cancer cohort. MultiSurv uses multimodal
   data including omics data as well as clinical information and imaging
   data to predict the conditional survival probabilities for a predefined
   set of follow-up time intervals. While both methods successfully
   integrate pan-cancer molecular data to predict cancer survival and
   report good results, biological interpretation of MultiSurv is limited
   to visualizing the learned feature representations with t-distributed
   stochastic neighbor embedding (t-SNE) and no gene- or pathway-level
   explanations are available. For VAECox, the nodes with the highest
   variance are extracted from the hidden layers of the model and the
   Pearson correlations between these nodes and the expression of the gene
   features are analyzed. While this approach provides biological
   interpretation to some extent, there are a large number of genes whose
   expression is correlated with the most variable hidden nodes at
   different levels, and pathway enrichment analysis conducted directly on
   these genes might not be able to fully discover the underlying survival
   mechanisms.

   In contrast to these methods, our approach applies network propagation
   using the NetCore tool to extract biological plausibility of the
   learned feature representations, which is a fairly generic step that
   could be applied to any machine learning approach. NetCore performs
   random walk with restart propagation on protein-protein interactions
   with a node measure that is different from the node degree and thus is
   more robust against the inherent node degree bias of PPIs ([174]Barel
   and Herwig, 2020). This leads to an emphasis on so-called influential
   nodes that are typically in the core of the network rather than its
   periphery and thus is highly suited to infer common subnetworks from
   otherwise highly dispersed nodes. It uses a semi-supervised module
   identification step by connecting seed nodes of interest with the nodes
   that become significantly up-weighted after the propagation step. Seed
   nodes can be identified either by literature knowledge (supervised) or
   by top experimental signals (unsupervised). Since we wanted to
   interrogate the ability of the machine learning approach to generate
   plausible features, we used the unsupervised approach in this study
   where seed nodes were chosen by the top 100 feature importance weights
   from the pan-cancer training. Alternative approaches could, for
   example, incorporate seed nodes derived from existing signatures.

   Network propagation has been combined with machine learning in other
   application domains, for example, drug sensitivity prediction
   ([175]Manica et al., 2019). This approach leverages the information
   conferred by protein-protein interactions a priori to machine learning
   to reduce the feature space to genes associated with known drug targets
   of the evaluated drugs. Additionally, reweighted random survival forest
   (RRSF) ([176]Wang and Liu, 2018) integrates network propagation in
   order to incorporate a priori information on gene interactions into the
   learning procedure. Instead of using the network propagation results in
   a feature selection step where only genes are kept that have a high
   weight after network propagation, RRSF directly incorporates the
   weights into model training. This is done by splitting the nodes in the
   tree growing process of the random forest according to their weight
   after network propagation, where topologically important genes in the
   gene interaction network receive higher weights and are thus selected
   with a higher probability in the node splitting procedure. In contrast
   to these methods, our approach separates the two analysis components
   and applies network propagation a posteriori to enhance biological
   plausibility, while leaving the machine learning part unchanged in
   order to explore all available features and to gain maximal prediction
   accuracy.

   In summary, we have introduced a machine learning approach to
   pan-cancer survival prediction, which combines ensemble tree boosting
   with network propagation and highlights the role of the aging TME for
   survival prognosis.

Limitations of the study

   Our method is based on cancer patient gene expression data provided by
   the TCGA consortium. Although TCGA comprises a variety of cancer
   patients from diverse cancer types, and we have implemented rigorous
   train-test splits and also tested our method on cancer types withheld
   in model training, we have not evaluated the method on any cancer
   patient data completely unrelated to the TCGA database.

   Furthermore, we considered only one molecular data modality, namely
   RNA-seq gene expression data, for model training and identification of
   the pan-cancer survival network. Incorporation of additional molecular
   data modalities like methylation and mutations could add more
   information to the model and complement the gene expression data.
   However, this would come at the cost of introducing more features into
   the machine learning method, which could increase the risk of
   overfitting.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Deposited data
     __________________________________________________________________

   HTSeq-FPKM gene expression and clinical files The Cancer Genome Atlas
   (TCGA) [177]https://portal.gdc.cancer.gov/
   GDC data releases v22.0 and v24.0
   ConsensusPathDB (CPDB) protein-protein interaction network (version 34)
   [178]Herwig et al. (2016)
   [179]https://github.molgen.mpg.de/barel/NetCore/blob/6860331ae9ff8725e6
   66d936e9b853ef28893a00/data/CPDB_high_confidence.txt
   Known and candidate cancer genes from the network of cancer genes (NCG)
   [180]Repana et al. (2019) [181]http://ncg.kcl.ac.uk/index.php
   NCG version 6.0
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   NetCore [182]Barel and Herwig (2020)
   [183]https://github.molgen.mpg.de/barel/NetCore
   OncoLnc [184]Anaya (2016) [185]http://www.oncolnc.org/
   Single-cohort and pan-cancer XGBoost survival prediction This Paper
   [186]https://github.molgen.mpg.de/thedinga/xgb_survival_network
   R Implementations of Path2Surv (MKL[H] and MKL[P]), random survival
   forest (RF), and survival support vector machine (SVM) [187]Dereli
   et al. (2019) [188]https://github.com/mehmetgonen/path2surv
   XGBoost python package [189]Chen and Guestrin (2016)
   [190]https://github.com/dmlc/xgboost/tree/master/python-package
   MyGene python package [191]Xin et al. (2016); [192]Wu et al. (2013)
   [193]http://mygene.info
   version 3.1
   QIAGEN ingenuity pathway analysis (QIAGEN IPA) [194]Krämer et al., 2014
   [195]https://digitalinsights.qiagen.com/products-overview/discovery-ins
   ights-portfolio/analysis-and-visualization/qiagen-ipa/
   [196]Open in a new tab

Resource availability

Lead contact

   Further information and requests for resources should be directed to
   and will be fulfilled by the lead contact, Ralf Herwig
   ([197]herwig@molgen.mpg.de).

Materials availability

   This study did not generate new unique reagents.

Method details

Data and preprocessing

   This study is based on gene expression data from The Cancer Genome
   Atlas (TCGA) consortium ([198]https://www.cancer.gov/tcga), which
   provides genomic and clinical data for more than 10,000 cancer patients
   and 33 cancer types through the Genomic Data Commons (GDC) data portal
   ([199]https://portal.gdc.cancer.gov/). We used gene expression data
   from primary tumors of these 33 cancer cohorts. We chose to only use
   samples with sample type annotations "Primary Tumor" and "Primary Blood
   Derived Cancer - Peripheral Blood" and excluded samples derived from
   normal tissue and metastatic tumors because these sample types are
   expected to have different gene expression characteristics than primary
   tumors. For each cancer type, we downloaded all HTSeq-FPKM files, which
   contain FPKM-normalized gene expression data, and the corresponding
   clinical files to extract patient survival data. For the cohorts
   TCGA-COAD, TCGA-LAML, TCGA-LUAD, and TCGA-LUSC, all data was obtained
   from GDC data release v22.0 (released January 16, 2020) and for the 29
   remaining cohorts, all files were retrieved from GDC data release v24.0
   (released May 7, 2020). For training the survival prediction model and
   comparing performance between methods, we only used cohorts with at
   least 20 uncensored patients (25 cohorts) since splitting cohorts with
   less uncensored patients into training and test data and using 20% of
   patients for testing would have led to only one to four test patients
   per cohort, thus limiting the informative value of predictions on these
   cohorts. The remaining eight cohorts with less than 20 uncensored
   patients (TCGA-CHOL, TCGA-DLBC, TCGA-KICH, TCGA-PCPG, TCGA-PRAD,
   TCGA-TGCT, TCGA-THCA, and TCGA-THYM) were not part of the training
   procedure and could thus be used to demonstrate the transferability of
   the pan-cancer XGBoost survival prediction model to cancer types not
   seen in training. We furthermore excluded all patients for whom either
   gene expression measurements or key clinical data, namely age, gender,
   vital status, and time to either death or censoring, were unavailable
   or inconsistent. This led to a total of 8,024 patients in the 25 cancer
   cohorts used for single-cohort and pan-cancer XGBoost training and an
   additional 1,571 patients in the eight cohorts used for prediction
   only.

The XGBoost method

   XGBoost ([200]Chen and Guestrin, 2016) is a supervised machine learning
   method based on tree ensembles. It implements gradient tree boosting,
   which predicts the output by additive functions represented by
   regression trees. More formally, the predicted output
   [MATH: <mrow><msub><mover
   accent="true"><mi>y</mi><mo>ˆ</mo></mover><mi>i</mi></msub><mo
   linebreak="goodbreak" linebreakstyle="after">∈</mo><mi
   mathvariant="double-struck">R</mi></mrow> :MATH]
   for a sample
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   and its
   [MATH: <mrow><mi>m</mi></mrow> :MATH]
   features
   [MATH: <mrow><msub><mi>x</mi><mi>i</mi></msub><mo linebreak="goodbreak"
   linebreakstyle="after">∈</mo><msup><mi
   mathvariant="double-struck">R</mi><mi>m</mi></msup></mrow> :MATH]
   is computed as:
   [MATH: <mrow><msub><mover
   accent="true"><mi>y</mi><mo>ˆ</mo></mover><mi>i</mi></msub><mo
   linebreak="badbreak">=</mo><mi>φ</mi><mrow><mo>(</mo><msub><mi>x</mi><m
   i>i</mi></msub><mo>)</mo></mrow><mo
   linebreak="goodbreak">=</mo><munderover><mo>∑</mo><mrow><mi>k</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>K</mi></munderover><mro
   w><msub><mi>f</mi><mi>k</mi></msub><mrow><mo>(</mo><msub><mi>x</mi><mi>
   i</mi></msub><mo>)</mo></mrow></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>f</mi><mi>k</mi></msub><mo
   linebreak="goodbreak">∈</mo><mi
   mathvariant="script">F</mi><mo>,</mo></mrow> :MATH]

   where
   [MATH: <mrow><mi mathvariant="script">F</mi></mrow> :MATH]
   is the space of regression trees and
   [MATH: <mrow><mi>K</mi></mrow> :MATH]
   is the number of functions used to predict the output. Each of the
   trees
   [MATH: <mrow><msub><mi>f</mi><mi>k</mi></msub></mrow> :MATH]
   is associated with a set of continuous leaf weights w and maps each
   sample to one of the tree’s leaf indices and its corresponding weight
   w[i]. The final prediction for a sample is then made by summing up the
   leaf weights assigned to the sample by the K different trees. In order
   to learn the tree functions that map samples to leaves, the following
   objective is minimized:
   [MATH: <mrow><mi
   mathvariant="script">L</mi><mrow><mo>(</mo><mi>φ</mi><mo>)</mo></mrow><
   mo
   linebreak="badbreak">=</mo><munder><mo>∑</mo><mi>i</mi></munder><mrow><
   mi>l</mi><mrow><mo>(</mo><mrow><msub><mover
   accent="true"><mi>y</mi><mo>ˆ</mo></mover><mi>i</mi></msub><mo>,</mo><m
   sub><mi>y</mi><mi>i</mi></msub></mrow><mo>)</mo></mrow></mrow><mo
   linebreak="goodbreak">+</mo><munder><mo>∑</mo><mi>k</mi></munder><mrow>
   <mi>Ω</mi><mrow><mo>(</mo><msub><mi>f</mi><mi>k</mi></msub><mo>)</mo></
   mrow></mrow><mo>,</mo></mrow> :MATH]

   where
   [MATH: <mrow><mi>Ω</mi><mrow><mo>(</mo><mi>f</mi><mo>)</mo></mrow><mo
   linebreak="goodbreak"
   linebreakstyle="after">=</mo><mi>γ</mi><mi>T</mi><mo
   linebreak="goodbreak"
   linebreakstyle="after">+</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mi>λ</
   mi><msup><mi>w</mi><mn>2</mn></msup></mrow> :MATH]
   is a regularization term that penalizes complexity of the tree
   functions to reduce overfitting. Here, T is the number of leaves in the
   tree,
   [MATH: <mrow><mi>w</mi></mrow> :MATH]
   represents the leaf weights and γ and λ are model hyperparameters. In
   the first term of the model objective, l is a differentiable convex
   loss function measuring the difference between the predicted value
   [MATH: <mrow><msub><mover
   accent="true"><mi>y</mi><mo>ˆ</mo></mover><mi>i</mi></msub></mrow>
   :MATH]
   and the true value Y[i]. For our task of survival prediction, the loss
   function is selected to be the negative partial loglikelihood for Cox
   proportional hazards regression (objective = survival:cox in the
   XGBoost model parameters). This loss function is defined as:
   [MATH:
   <mrow><msub><mi>l</mi><mrow><mi>C</mi><mi>o</mi><mi>x</mi></mrow></msub
   ><mrow><mo>(</mo><mtext>β</mtext><mo>)</mo></mrow><mo
   linebreak="badbreak">=</mo><mo
   linebreak="goodbreak">−</mo><munder><mo>∑</mo><mrow><mi>i</mi><mo>:</mo
   ><msub><mi>E</mi><mi>i</mi></msub><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow></munder><mrow><mo>(</mo><m
   row><msub><mi>x</mi><mi>i</mi></msub><mtext>β</mtext><mo
   linebreak="badbreak">−</mo><mi>log</mi><munder><mo>∑</mo><mrow><mi>j</m
   i><mo>:</mo><msub><mi>T</mi><mi>j</mi></msub><mo
   linebreak="badbreak">≥</mo><msub><mi>T</mi><mi>i</mi></msub></mrow></mu
   nder><msup><mi>e</mi><mrow><msub><mi>x</mi><mi>j</mi></msub><mtext>β</m
   text></mrow></msup></mrow><mo>)</mo></mrow><mo>,</mo></mrow> :MATH]

   where
   [MATH: <mrow><msub><mi>E</mi><mi>i</mi></msub></mrow> :MATH]
   is a censoring flag with
   [MATH: <mrow><msub><mi>E</mi><mi>i</mi></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>1</mn></mrow> :MATH]
   indicating that patient
   [MATH: <mrow><mtext>i</mtext></mrow> :MATH]
   is dead (i.e., uncensored) and
   [MATH: <mrow><msub><mi>E</mi><mi>i</mi></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>0</mn></mrow> :MATH]
   indicating that the patient’s survival time is censored,
   [MATH: <mrow><msub><mi>T</mi><mi>i</mi></msub></mrow> :MATH]
   is the survival time of patient
   [MATH: <mrow><mtext>i</mtext></mrow> :MATH]
   ,
   [MATH: <mrow><msub><mi>x</mi><mi>i</mi></msub></mrow> :MATH]
   are the features of this patient and
   [MATH: <mrow><mtext>β</mtext></mrow> :MATH]
   is a vector of unknown variables that need to be learned by the model
   ([201]Cox, 1972). Intuitively, this loss is calculated by comparing
   each of the uncensored patients to all patients that survived at least
   as long as this patient in terms of predicted death risk or hazard.

Single-cohort and pan-cancer survival prediction with XGBoost

   In this study, we applied XGBoost with Cox proportional hazards
   regression as a learning objective to predict the survival of cancer
   patients from 25 different cancer types. A graphical overview of how
   survival risk scores were predicted based on gene expression data can
   be found in [202]Figure S6.

   To assess the performance of the single-cohort and pan-cancer
   approaches, we trained and evaluated both versions of the method for
   100 replications on gene expression datasets from 25 TCGA cohorts. In
   each replication, the gene expression data was randomly split into 80%
   training data and 20% test data in a stratified manner, such that
   training and test sets always contained approximately the same
   percentage of censored and uncensored patients, and in the pan-cancer
   approach 80% of each cohort were assigned to the training data and 20%
   to the test data. Then, the XGBoost-based learning procedure was
   performed on the training data as described below and the prediction
   performance was evaluated on the test data by computing the C-Index of
   the predicted risk scores.

   Feature selection: In the first step, a
   [MATH:
   <mrow><mi>p</mi><mi>a</mi><mi>t</mi><mi>i</mi><mi>e</mi><mi>n</mi><mi>t
   </mi><mo linebreak="goodbreak"
   linebreakstyle="after">×</mo><mi>g</mi><mi>e</mi><mi>n</mi><mi>e</mi></
   mrow> :MATH]
   expression matrix was constructed, containing the gene expression
   values from patients either belonging to the same TCGA cohort
   (single-cohort approach) or to multiple different cohorts (pan-cancer
   approach). Patients with missing or inconsistent data for age, gender,
   or survival/censoring time were excluded from the data. Then, an
   XGBoost-based feature selection procedure was performed to reduce the
   60,483 genes measured in TCGA to 500 genes meaningful for survival. In
   order to identify genes that are relevant to survival not only for a
   specific training set composition, a stratified 4-fold cross-validation
   was performed on the training data to find genes that are relevant to
   survival across all folds. Stratified 4-fold cross-validation in this
   context means that each of the four data splits contained approximately
   equal proportions of censored and uncensored patients to ensure
   comparability across folds. In each step of the cross-validation
   procedure, we first removed genes with zero mean absolute deviation
   (MAD) across the three data splits used as training data in this fold.
   By removing genes with
   [MATH: <mrow><mi>M</mi><mi>A</mi><mi>D</mi><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>0</mn></mrow> :MATH]
   , we excluded genes without variability from further feature selection
   because these genes hold no information value for patient survival.
   Next, we used XGBoost to identify predictive features for each fold. To
   this end, we trained 20 XGBoost models with different sets of model
   hyperparameters and limited size for each fold of the cross-validation
   in order to identify features that do not depend on the yet untuned
   model hyperparameters but are relevant to survival on a more general
   scale. Limited size in this context means that we allowed the XGBoost
   models to grow a maximum of 5 to 20 regression trees with maximal tree
   depths between 1 and 3 in order to avoid overfitting and allow the
   models to select only the most informative features. From each of the
   trained small XGBoost models, we then extracted the feature importance
   scores of all genes used as features by the respective model and for
   each gene calculated the average feature importance across all four
   rounds of cross-validation and all 20 models per round. As a measure of
   feature importance we selected ‘gain’, which is defined as the average
   gain the respective feature brings to the evaluation metric across all
   decision tree splits in which it is used and thus measures the relative
   importance of this feature for the prediction result
   ([203]https://xgboost.readthedocs.io/en/latest/python/python_api.html).
   In the last step, we then selected the 500 genes with the highest
   average feature importance scores across all models in the 4-fold
   cross-validation as features.

   Hyperparameter tuning: We performed another 4-fold cross-validation on
   the training data in each model replication to tune the XGBoost model
   hyperparameters including maximum tree depth, number of boosting trees,
   and regularization parameters. We first generated 500 sets of random
   hyperparameter combinations. In each round of the cross-validation we
   then trained an XGBoost model for each of these hyperparameter sets on
   the three training data splits of this round. The training data splits
   each contained gene expression measurements for the 500 genes
   identified in the feature selection procedure. The performance of each
   model was then evaluated on the remaining data split of the round by
   computing the concordance index (C-Index) of the prediction. The
   formula for computing the C-Index was adapted from [204]Dereli et al.
   (2019) and is defined as:
   [MATH: <mrow><mtext>C</mtext><mo
   linebreak="badbreak">−</mo><mtext>Index</mtext><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><msubsup><mo>∑</mo><mrow><mi>i
   </mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mrow><
   msub><mo>∑</mo><mrow><mi>j</mi><mo
   linebreak="badbreak">≠</mo><mi>i</mi></mrow></msub><mrow><msub><mo>Δ</m
   o><mrow><mi>i</mi><mi>j</mi></mrow></msub><mn>1</mn><mrow><mo>(</mo><mr
   ow><mrow><mo>(</mo><mrow><msub><mi>y</mi><mi>i</mi></msub><mo
   linebreak="badbreak">−</mo><msub><mi>y</mi><mi>j</mi></msub></mrow><mo>
   )</mo></mrow><mrow><mo>(</mo><mrow><msub><mover
   accent="true"><mi>y</mi><mo>ˆ</mo></mover><mi>j</mi></msub><mo
   linebreak="badbreak">−</mo><msub><mover
   accent="true"><mi>y</mi><mo>ˆ</mo></mover><mi>i</mi></msub></mrow><mo>)
   </mo></mrow><mo
   linebreak="badbreak">></mo><mn>0</mn></mrow><mo>)</mo></mrow></mrow></m
   row></mrow><mrow><msubsup><mo>∑</mo><mrow><mi>i</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><mrow><
   msub><mo>∑</mo><mrow><mi>j</mi><mo
   linebreak="badbreak">≠</mo><mi>i</mi></mrow></msub><msub><mo>Δ</mo><mro
   w><mi>i</mi><mi>j</mi></mrow></msub></mrow></mrow></mfrac><mo>,</mo></m
   row> :MATH]

   with
   [MATH:
   <mrow><msub><mtext>Δ</mtext><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo
   linebreak="goodbreak" linebreakstyle="after">=</mo><mn>1</mn></mrow>
   :MATH]
   if patients
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   and
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   are comparable, meaning that either both patients are uncensored or
   patient
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   is uncensored and patient
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   is censored, but patient
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   has a shorter survival time than patient
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   . Otherwise, if patients
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   and
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   are not comparable,
   [MATH:
   <mrow><msub><mtext>Δ</mtext><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo
   linebreak="goodbreak" linebreakstyle="after">=</mo><mn>0</mn></mrow>
   :MATH]
   .
   [MATH: <mrow><mi>N</mi></mrow> :MATH]
   is the total number of patients for which survival has been predicted,
   [MATH: <mrow><msub><mi>y</mi><mi>i</mi></msub></mrow> :MATH]
   is the true survival time of patient
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   ,
   [MATH: <mrow><msub><mover
   accent="true"><mi>y</mi><mo>ˆ</mo></mover><mi>i</mi></msub></mrow>
   :MATH]
   is the patients predicted risk or hazard score, and the indicator
   function
   [MATH: <mrow><mn>1</mn><mrow><mo>(</mo><mo
   linebreak="badbreak">·</mo><mo>)</mo></mrow></mrow> :MATH]
   evaluates to 1 whenever its argument is true and 0 otherwise.
   Intuitively, the C-Index measures the fraction of patient pairs with
   concordant survival times – meaning the patient with the shorter true
   survival time is predicted to have a higher death risk – among all
   comparable pairs of patients. Thus, a larger value of the C-Index
   indicates a better prediction performance of the evaluated model, while
   for random predictions the C-Index would be expected to assume a value
   of 0.5. To select the optimal hyperparameters in each model
   replication, the average C-Index across all four rounds of the 4-fold
   cross-validation was computed for each of the 500 tested parameter
   combinations and the combination of hyperparameters was selected that
   yielded the best average C-Index. At last, we trained the final XGBoost
   model on all training patients of the respective model replication,
   using the gene expression values of the genes identified in the feature
   selection step and the model hyperparameters selected in the
   hyperparameter tuning step.

   Survival prediction validation: Finally, in each of the 100 model
   replications, we applied the fully trained model to the 20% left-out
   patients and computed the C-Index on these patients in order to
   retrieve patient survival risk scores.

   The whole survival prediction procedure, including feature selection,
   hyperparameter tuning and the final training of the XGBoost survival
   prediction model, was implemented in Python (release 3.7) and is based
   on the Python XGBoost package
   ([205]https://github.com/dmlc/xgboost/tree/master/python-package). The
   code is available in the following GitHub repository:
   [206]https://github.molgen.mpg.de/thedinga/xgb_survival_network. The
   runtime for one model replication of the pan-cancer XGBoost approach –
   including feature selection and hyperparameter optimization – on all
   patients from 25 TCGA cancer cohorts, split into 80% training and 20%
   test data, was approximately 16 hours (on a Linux machine with 64
   cores).

Other survival prediction methods

   We compared our single-cohort XGBoost method against the survival
   prediction methods random survival forest ([207]Ishwaran et al., 2008),
   survival support vector machine ([208]Khan and Zubek, 2008;
   [209]Shivaswamy et al., 2007), and the multiple-kernel learning method
   Path2Surv ([210]Dereli et al., 2019). Path2Surv is an extension of
   survival support vector machines that– instead of a single kernel
   function like in traditional support vector machines – combines
   multiple kernels in a weighted sum to learn a survival function. Each
   of these kernels is calculated on expression values of genes forming a
   pathway or gene set and only informative kernels are considered in the
   sum by assigning them non-zero weights. This way, Path2Surv uses less
   gene expression features than for instance survival support vector
   machines and offers a possibility to assess which pathways or gene sets
   were relevant for predicting survival. In their publication, Dereli
   et al. tested Path2Surv for two different gene/pathway sets, namely the
   Hallmark gene sets ([211]Liberzon et al., 2015) and the Pathway
   Interaction Database (PID) ([212]Schaefer et al., 2009). We adopted
   these two versions of Path2Surv to compare against our XGBoost-based
   survival prediction method. For random survival forest (RF), survival
   support vector machine (SVM) and Path2Surv, R implementations from
   Dereli et al. were adapted and the R Optimization Infrastructure (ROI)
   ([213]Theuβl et al., 2020) was used to solve the quadratic optimization
   problems in the SVM and Path2Surv methods. While all methods were
   supplied with all 60,483 gene expression features measured in TCGA and
   prior to training RF and SVM only genes with zero standard deviation
   were excluded, Path2Surv additionally excluded features from kernels it
   considered as uninformative. In each replication of training, the
   regularization parameter C was tuned for SVM and Path2Surv (range
   between 1 × 10^−4 and 1 × 10^5), and the number of trees to grow was
   tuned for RF (range between 500 and 2,500 trees) performing a 4-fold
   cross-validation on the training data, as described in ([214]Dereli
   et al., 2019). All other parameters were kept as default. For further
   details we refer the reader to ([215]Dereli et al., 2019).

Entropy measurement

   Entropy is a measure of uncertainty or information content
   ([216]Shannon, 1948) and is defined as:
   [MATH: <mrow><mi>H</mi><mrow><mo>(</mo><mi>X</mi><mo>)</mo></mrow><mo
   linebreak="badbreak">=</mo><mo
   linebreak="goodbreak">−</mo><munderover><mo>∑</mo><mrow><mi>i</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mro
   w><mi>P</mi><mrow><mo>(</mo><msub><mi>x</mi><mi>i</mi></msub><mo>)</mo>
   </mrow><msub><mrow><mi>log</mi><mspace
   width="0.25em"></mspace></mrow><mn>2</mn></msub><mrow><mo>(</mo><mrow><
   mi>P</mi><mrow><mo>(</mo><msub><mi>x</mi><mi>i</mi></msub><mo>)</mo></m
   row></mrow><mo>)</mo></mrow></mrow><mo>,</mo></mrow> :MATH]

   where
   [MATH: <mrow><msub><mi>x</mi><mn>1</mn></msub><mo>,</mo><mspace
   width="0.25em"></mspace><mo>…</mo><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>x</mi><mi>n</mi></msub></mrow> :MATH]
   are the possible outcomes of a random variable
   [MATH: <mrow><mi>X</mi></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>P</mi><mrow><mo>(</mo><msub><mi>x</mi><mi>i</mi></msub><mo>)<
   /mo></mrow></mrow> :MATH]
   is the probability of
   [MATH: <mrow><msub><mi>x</mi><mi>i</mi></msub></mrow> :MATH]
   . If all outcomes
   [MATH: <mrow><msub><mi>x</mi><mn>1</mn></msub><mo>,</mo><mspace
   width="0.25em"></mspace><mo>…</mo><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>x</mi><mi>n</mi></msub></mrow> :MATH]
   are equally likely, the entropy will be maximal, while the entropy is
   minimal in the scenario where only one of the possible outcomes occurs
   with absolute certainty. We transferred this concept to the
   single-cohort feature importance weights, which were computed by
   building the sum of feature importance scores over the 100 model
   replications of the XGBoost model for each cohort, to measure how well
   a gene generalizes as a feature over different cancer types. To this
   end, we computed a probability matrix from the single-cohort feature
   importance weights by dividing each weight score by the sum of scores
   for this gene over all 25 cohorts and used this probability matrix to
   compute the entropies of all genes. If a gene was identified as an
   important feature across all 25 TCGA cohorts with similar feature
   importance weights it has high entropy, while a gene that is only among
   the predictive features in one cancer cohort has minimal entropy of
   zero. Thus, genes with a high entropy are equally important for
   predicting survival with the single-cohort XGBoost method in different
   cancer types and are identified to generalize well, while genes with a
   low entropy are likely to be cancer-type specific.

Inference of a pan-cancer survival network with NetCore

   We used network propagation to infer a pan-cancer survival network from
   the important features identified in the 100 replications of the
   pan-cancer XGBoost approach. For each gene from the pan-cancer
   important features, we calculated a feature importance weight as the
   sum of feature importance scores over all 100 replications, where the
   feature had a positive feature importance score if it was among the
   important features of this replications and a score of 0 if it was not
   used for survival prediction by XGBoost in this replication. Ensembl
   Gene Identifiers were converted to Hugo Symbols using the MyGene Python
   package (version 3.1, [217]http://mygene.info) ([218]Wu et al., 2013;
   [219]Xin et al., 2016) and gene entities that did not map to a Hugo
   Symbol were removed. All genes together with their feature importance
   weights were then fed to NetCore ([220]Barel and Herwig, 2020). NetCore
   is a random walk with restart network propagation method that uses node
   coreness instead of node degree for normalization in order to address
   the node degree bias of protein-protein interaction networks (PPIs).
   Following the network propagation procedure, NetCore also provides a
   module identification approach, where phenotype-associated network
   modules are identified in a semi-supervised manner. These network
   modules are sub-networks of the PPI and comprise so called seed genes –
   in our case the top 100 genes from the important features that are
   included in the PPI and have the highest feature importance weights –
   as well as additional genes identified in the network propagation step
   that function as links between seed genes. For the cancer survival
   module identification, NetCore was applied on a high confidence
   protein-protein interaction network from ConsensusPathDB ([221]Herwig
   et al., 2016), which was obtained from the NetCore GitHub repository
   ([222]https://github.molgen.mpg.de/barel/NetCore) and comprises 10,586
   genes and 114,341 interactions. For the random walk with restart, the
   default restart probability of 0.8 was chosen.

Over-representation analysis

   Over-representation analysis (ORA) measures the overlap between a set
   of genes (i.e., the most important survival features) and another
   pre-defined gene set called functional gene set. We used QUIAGEN's IPA
   software ([223]Krämer et al., 2014) to derive ORA results with respect
   to canonical pathway and upstream regulator target sets. The ORA was
   judged with Fisher’s exact test.

Quantification and statistical analysis

Comparison of model performances

   To compare the performance of the single-cohort XGBoost method with the
   RF, SVM, and Path2Surv methods and the pan-cancer with the
   single-cohort XGBoost method, we trained each method for 100
   replications and in each replication computed the concordance index
   (C-Index). The mean C-Indices of the compared methods were then
   analyzed with the Wilcoxon unpaired rank-sum test, as implemented in
   the R function stat_compare_means, which is part of the ggpubr library
   ([224]https://www.rdocumentation.org/packages/ggpubr). We selected a
   significance threshold of 0.05 for the Wilcoxon unpaired rank-sum test
   and adopted the method’s default values for symbols indicating further
   significance levels (
   [MATH: <mrow><mtext>ns</mtext><mo>:</mo><mtext> p</mtext><mo
   linebreak="goodbreak" linebreakstyle="after">></mo><mn>0.05</mn></mrow>
   :MATH]
   ,
   [MATH: <mrow><mo linebreak="goodbreak"
   linebreakstyle="after">∗</mo><mo>:</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≤</mo><mn>0.05</mn></mrow> :MATH]
   ,
   [MATH: <mrow><mo linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo>:</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≤</mo><mn>0.01</mn></mrow> :MATH]
   ,
   [MATH: <mrow><mo linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo>:</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≤</mo><mn>0.001</mn></mrow> :MATH]
   ,
   [MATH: <mrow><mo linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mo>:</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≤</mo><mn>0.0001</mn></mrow> :MATH]
   ).

Significance thresholds

   Throughout the entire manuscript, we consider a pvalue below 0.05 as
   significant.

Acknowledgments