Abstract Previous studies have found that gene expression levels are associated with prognosis and some genes can be used to predict the survival risk of glioblastoma (GBM) patients. However, most of them just built the survival-related gene signature, and personal survival risk can be evaluated only in group. This study aimed to find the prognostic survival related genes of GBM, and construct survival risk prediction model, which can be used to evaluate survival risk by individual. We collected gene expression data and clinical information from the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases. Cox regression analysis and LASSO-cox regression analysis were performed to get survival-related genes and establish the overall survival prediction model. The ROC curve and Kaplan Meier analysis were used to evaluate the prediction ability of the model in training set and two independent cohorts. We also analyzed the biological functions of survival-related genes by GO and KEGG enrichment analysis. We identified 99 genes associated with overall survival and selected 16 genes (IGFBP2, GPRASP1, C1R, CHRM3, CLSTN2, NELL1, SEZ6L2, NMB, ICAM5, HPCAL4, SNAP91, PCSK1N, PGBD5, INA, UCHL1 and LHX6) to establish the survival risk prediction model. Multivariate Cox regression analysis indicted that the risk score could predict overall survival independent of age and gender. ROC analyses showed that our model was more robust than four existing signatures. The sixteen genes can also be potential transcriptional biomarkers and the model can assist doctors on clinical decision-making and personalized treatment of GBM patients. Keywords: GBM, prognostic risk model, overall survival, survival-related genes 1. Introduction According to the 2016 WHO classification of tumors of the central nervous system, gliomas were classified into grade I–IV based on histopathology and genomics. Glioblastoma multiforme (GBM) is the highest malignant brain tumor as grade IV [[28]1]. In the newest version of the classification of tumors of the central nervous system, numerous molecular changes with clinicopathologic utility that are important for the most accurate classification of CNS neoplasms were listed, such as IDH, TERT promoter, chromosomes 7/10, EGFR, which are important for glioblastoma [[29]2]. Glioblastoma is the most common malignant primary brain tumor, accounting for 57.7% of all gliomas and 48.6% of all primary malignant central nervous system (CNS) tumors [[30]3]. Although the treatment of glioblastoma has been greatly improved in diagnosis, surgery, traditional therapy such as chemoradiotherapy, and nursing, there are still efforts in immunotherapy, the overall prognosis is still discouraging, and the overall survival is still very poor due to the heterogeneity of glioblastoma [[31]4]. For patients diagnosed with GBM, the median survival was about 12–15 months [[32]5,[33]6], and the 5-year survival rate was less than 10% [[34]7]. With the development of high-throughput sequencing technology, many potential biomarkers related to the diagnosis and prognosis for treatment decision making of glioblastoma have been found, such as MGMT (O6 methylguanine DNA methyltransferase), H3F3A (H3 histone, family 3A), IDH (isocitrate dehydrogenase), EGFR (epidermal growth factor receptor), and PTEN (phosphatase and tensin homolog) [[35]7,[36]8,[37]9,[38]10,[39]11]. The status of these biomarkers can provide a basis for individualized and targeted diagnosis and treatment management of patients. Previous studies have shown that gene expression profiling can be used to classify glioma patients and identify patients with different overall survival and clinical characteristics [[40]12,[41]13,[42]14,[43]15,[44]16,[45]17]. GBM is divided into Proneural, Neural, Classical, and Mesenchymal subtypes by gene expression-based molecular classification, the result shows that different subtypes may require different therapeutic approaches [[46]11,[47]16]. Gene expression profiling can also be used to identify risk genes associated with survival and prognosis, and to assess the survival risk of patients. Zhang et al. identified 16 endoplasmic reticulum (ER) stress-related genes (CYP2E1, SLN, BRCA1, CISD2, LRRK2, BMP2, MYH7, HSPB1, DNM1L, SHISA5, RNF185, RCN1, SPP1, RPN2, PDIA3 and ATP2A2), and established an ER stress risk model based on The Cancer Genome Atlas (TCGA) glioma database to reflect the immune characteristics and predict the prognosis of glioma patients [[48]18]. Yin et al. analyzed the gene expression profiles and identified five novel biomarkers (PTPRN, RGS14, G6PC3, IGFBP2 and TIMP4) that have potential in the prognosis prediction of GBM [[49]19]. Wang et al. identified 14 autophagy-related genes (MTMR14, LENG9, P4HB, TCIRG1, HSPA5, DRAM1, CTSD, S100A8, CCL2, MSTN, UBQLN4, PHYHIP, RRAGB and ZKSCAN3) associated with the overall survival of patients with glioblastoma, and built a novel autophagy-related signature for the prediction of prognosis [[50]20]. Cao et al. built a four-gene signature-derived (OSMR, HOXC10, SCARA3 and SLC39A10) risk score model to predict the survival and treatment response of GBM patients [[51]21]. Pan et al. identified two genes (GRIA2 and RYR3) strongly associated with survival of GBM, and the two-gene signature was a robust prognostic model to predict GBM survival [[52]22]. However, most of the studies just identified the survival-related genes and constructed a survival risk prediction model based on the prognostic signature, the model can generate a risk score for each patient, the risk score threshold divided patients into different risk groups. However, the risk score thresholds (usually the median risk score) are changed in different groups of patients collected from different institutions in these studies, which may be not convenient in clinical application. Based on public databases, we aim to explore novel prognostic biomarkers for survival prediction. By analyzing the genes expression profiles of GBM patients downloaded from Gene Expression Omnibus (GEO) and TCGA databases, we identified the prognostic genes and constructed a robust risk score model based on 16 genes, and we take the same risk score threshold in different groups. Therefore, our model can be used to predict survival risk for the individual patient with newly diagnosed glioblastoma. The workflow diagram of this study was shown in [53]Figure 1. Figure 1. [54]Figure 1 [55]Open in a new tab Schematic diagram of the study design and development of survival risk prediction model. 2. Materials and Methods 2.1. TCGA Dataset We downloaded the GBM gene expression array dataset (Affymetrix HT Human Genome U133 Array Plate Set, level 1) and clinical information from TCGA [[56]10]. After removing duplicated samples, ten normal control samples and 524 disease samples were obtained, the clinical information included age, gender, survival time, survival status, and so on. TCGA data set served as the training set. 2.2. The GEO Dataset We downloaded three raw data sets of gene expression from GEO: [57]GSE4412 [[58]15] and [59]GSE4271 [[60]16] are all generated by Affymetrix human genome U133A array Plate Set. After removing duplicate samples, [61]GSE4412 contained 50 glioblastoma samples and [62]GSE4271 contained 56 glioblastoma samples. We also downloaded the gene expression profiles of [63]GSE16011 [[64]17], which contained 155 glioblastoma samples, gene expression profiles were generated by Affymetrix Gene Chip Human Genome U133 Plus 2.0 Array Plate Set. [65]GSE4412 and [66]GSE4271 patients’ data were combined as validation set 1 and [67]GSE16011 patients’ data were used as validation set 2. All the raw data was preprocessed by the R package AFFY and the background correction and normalization were performed using robust multi-array analysis (RMA). SVA package [[68]23] was used to remove batch effects. The clinical information of all patients is shown in [69]Table S1. 2.3. Differential Expression Analysis There are 10 normal samples served as the control group, and 10 tumor samples were randomly selected from TCGA training set served as the GBM group. The differentially expressed genes (DEGs) between the GBM group and the control group were analyzed by the limma software package [[70]24] in R (version 4.0.5). Generally, the genes whose [MATH: |log< mo>(FC)| :MATH] > 1 and adjusted p-value < 0.05 is considered to be statistically significant DEGs [[71]25]. In this study, the genes were considered as DEGs by the standards of whose adjusted p-value < 0.01 and [MATH: |log< mo>(FC)| :MATH] > 2.5. R package pheatmap [[72]26] was used to draw heatmap while ggplot2 ([73]https://cran.r-project.org/web/packages/ggplot2/, accessed on 20 August 2021) was used to draw volcano map. 2.4. Identify Genes Associated with Survival Univariable Cox regression analysis was performed with differentially expressed genes in the survival R software package ([74]https://cran.r-project.org/web/views/Survival.html, accessed on 20 August 2021) in order to analyze the correlation between each gene and overall survival. Genes with log-rank p < 0.05 were considered to be associated with overall survival [[75]27]. The survival-related genes were further filtered by the Lasso-Cox regression model [[76]28,[77]29,[78]30] in the glmnet R package ([79]https://cran.r-project.org/web/packages/glmnet/index.html, accessed on 20 August 2021) to reduce the dimensionality of inputted variables. 2.5. Construction of Survival Risk Model In order to optimize the model, step-wise Multivariate Cox analysis was used to further filter the survival risk related genes and construct the survival risk model by the “survival”, “survminer” packages: Risk score = h[0] × exp(gene[1] × coefficient[1] + gene[2] × coefficient[2] + gene[i] × coefficient[i]) (1) Here, h[0] is a constant, gene[i] is the gene expression level. We assigned risk scores to each sample in the training cohort according to these gene expression levels. The patients were divided into low-risk group and high-risk group by the median risk score. The prognostic value of the model was evaluated by Kaplan-Meier survival analysis, Log-rank test, and time-dependent receiver operating characteristic (ROC) curve with “survival”, “survminer”, and “timeROC” packages in R. 2.6. Risk Model Validation The validation set 1 and validation set 2 were employed to validate the robustness of diagnostic accuracy on overall survival by the prediction model. ROC curve and Kaplan Meier analysis were used to verify the prognostic value of GBM patients. p-value < 0.05 was considered statistically significant. Multivariate Cox analysis was used to validate whether the risk score was an independent risk factor for GBM survival. 2.7. Compare with Existing Signatures As mentioned at first, lots of genes have been found related to overall survival and could be used to predict the prognosis of patients with GBM. We compared the prognostic capability of our model with signatures reported in other studies, including the five-gene signature (PTPRN, RGS14, G6PC3, IGFBP2 and TIMP4) screened by Yin et al. [[80]19], the two-gene signature (GRIA2 and RYR3) screened by Pan et al. [[81]22], the five-gene signature (DES, RANBP17, CLEC5A, HOXC11 and POSTN) screened by Wang et al. [[82]25], the four-gene signature (LHX2, MEOX2, SNAI2 and ZNF22) derived by Cheng et al. [[83]31]. For each signature, we first conducted Multivariate Cox regression analysis and built the prognostic model with its genes in TGCA cohort. Then according to the median risk score, the patients were divided into low- and high-risk groups in TGCA cohort and two validation cohorts. Kaplan-Meier analysis and ROC curves were performed to evaluate the prognostic power of each model. 2.8. GO and KEGG Enrichment Analysis of Survival-Related Genes In order to identify potential molecular biomechanisms of genes related to survival, we conducted functional enrichment analyses including GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways analysis and visualized the results by “ggplot2” package. 3. Results 3.1. Differentially Expressed Genes Can Clearly Distinguish GBM Patients from Normal Samples In total, 424 differentially expressed genes were identified by differential expression analysis of 10 normal brain tissue samples and 10 GBM tumor samples. Among these genes, 327 were down regulated and 97 were up regulated in GBM tumor tissue ([84]Figure 2A). These differentially expressed genes can clearly distinguish GBM samples from normal samples, tumor samples were tightly clustered together and notably separated from normal samples ([85]Figure 2B). The 424 differently expressed genes were listed in [86]Table S2. Figure 2. [87]Figure 2 [88]Open in a new tab Identification of DEGs between GBM and normal brain tissues. (A) Volcano plots of differentially expressed genes showing the log (Fold Change) of mRNA in GBM compared to normal brain tissues, and the corresponding–log10 (adjusted p-value). Genes with adjusted p-value below 0.01 and fold change above 2.5 (below −2.5) were marked with red (blue) dots; (B) Heatmap of differentially expressed genes. GBM samples can be clearly distinguished from normal samples according to differentially expressed genes that are from (A). The color bar means the expression level of differentially expressed genes. 3.2. 99 DEGs Were Significantly Related to Survival Univariate Cox regression analysis was performed to identify the overall survival related genes in the training set. 99 genes were identified as potential survival-related biomarkers of GBM patients (p < 0.05) ([89]Table S3). LASSO Cox regression analysis was performed to further analyze these 99 survival-related genes to avoid overfitting [[90]32,[91]33], genes were filtered by 10-fold cross-validation with a maximum of 1000 iterations, and 28 genes were screened for subsequent analysis ([92]Figure 3). Figure 3. [93]Figure 3 [94]Open in a new tab Lasso-cox gene screening process. (A) LASSO coefficient profiles of the 99 genes in TCGA GBM cohort; (B) Selection of the best parameter (lambda) in the LASSO-Cox model. 3.3. The Risk Score of Sixteen-Gene Model Were Strongly Associated with Overall Survival of GBM Patients Furthermore, multivariate Cox analysis narrowed the list down further to 16 survival related genes and we constructed the survival risk model with these genes, which can be used to predict the overall survival risk of patients with GBM ([95]Figure 4A). Of these genes, CLSTN2, NMB, SNAP91, PCSK1N, INA, and LHX6 were favorable prognostic factors for glioblastoma survival, whose risk ratio (HR) < 1, the regression coefficient < 0. IGFBP2, GPRASP1, C1R, CHRM3, NELL1, SEZ6L2, ICAM5, HPCAL4, PGBD5, and UCHL1 were risk factors, whose HR > 1, the regression coefficient > 0. The risk score of each patient was generated as follows: 0.04701 × exp(0.0842 × IGFBP2 + 0.1187 × GPRASP1 + 0.0962 × C1R + 0.0965 × CHRM3 − 0.1409 × CLSTN2 + 0.1231 × NELL1 + 0.0965 × SEZ6L2 − 0.149 × NMB + 0.4547 × ICAM5 + 0.3367 × HPCAL4 − 0.0876 × SNAP91 − 0.1406 × PCSK1N + 0.1966 × PGBD5 − 0.0901 × INA + 0.0897 × UCHL1 − 0.5555 × LHX6) ([96]Table 1). Figure 4. [97]Figure 4 [98]Open in a new tab Establishment of the survival risk model. (A) 16 genes were eventually selected to establish prognostic model. ***: p < 0.001, **: p < 0.01, *: p < 0.05; (B) Kaplan-Meier curve of overall survival in high- and low-risk group of the GBM patients in the training cohort; (C) Heatmap of 16 genes expression patterns; (D) ROC curves of the model for overall survival at 1-, 2-, and 3- years; (E) The distribution risk scores in the TCGA cohort. Table 1. The details of the 16 genes used in the prognostic prediction model. Gene Name Coef. HR 95% CI p-Value IGFBP2 0.0842 1.0878 1.0036–1.1790 0.0405 GPRASP1 0.1187 1.1260 1.0099–1.2555 0.0325 C1R 0.0962 1.1010 1.0142–1.1951 0.0216 CHRM3 0.0965 1.1013 1.0063–1.2054 0.0361 CLSTN2 −0.1409 0.8686 0.7665–0.9842 0.0272 NELL1 0.1231 1.1310 1.0151–1.2600 0.0256 SEZ6L2 0.0965 1.1013 0.9918–1.2230 0.0710 NMB −0.1490 0.8616 0.7935–0.9354 0.0004 ICAM5 0.4547 1.5757 1.0990–2.2592 0.0134 HPCAL4 0.3367 1.4004 1.1277–1.7389 0.0023 SNAP91 −0.0876 0.9162 0.8245–1.0180 0.1034 PCSK1N −0.1406 0.8688 0.8007–0.9427 0.0007 PGBD5 0.1966 1.2173 1.0611–1.3966 0.0050 INA −0.0901 0.9139 0.8208–1.0175 0.1003 UCHL1 0.0897 1.0938 0.9874–1.2117 0.0859 LHX6 −0.5555 0.5738 0.4645–0.7087 0.0000 [99]Open in a new tab The patients were divided into high-risk group and low-risk group according to the median risk score. Kaplan-Meier survival analysis and log-rank test of high-risk and low-risk GBM patients showed that there was a significant difference in the overall survival between the two groups (p < 0.0001). Patients in the low-risk group had a better prognosis, that was to say, the patients had longer overall survival, the median overall survival of high and low risk groups respectively were 10.23 and 16.12 months ([100]Figure 4B). The time-dependent ROC curves were generated to assess the ability to discriminate the prognostic risk of the model. The areas under the curve (AUC) of predicting 1-, 2-, and 3-years OS in TCGA dataset were 0.7, 0.79, and 0.86, respectively ([101]Figure 4D). The expression patterns of 16 genes in high and low risk groups of TCGA cohort were shown in [102]Figure 4C. With the increase of risk score, the expression level of CLSTN2, NMB, SNAP91, PCSK1N, INA, and LHX6 were lower and lower, and the expression level of the other genes were higher and higher. The distribution of risk scores and survival information for patients in high and low risk groups were shown in [103]Figure 4E, higher risk score patients have a shorter survival time. 3.4. The Model Was Sufficient and Effective in Predicting Overall Survival in GBM External Verification Two external validation sets were then used to evaluate the prediction efficiency of the model. Using the same model and parameters, patients in the validation sets were also divided into high-risk and low-risk groups. Similar to the training set, the overall survival of the high-risk group was significantly shorter than that of the low-risk group (p < 0.05), the median overall survival of high and low risk groups in validation set 1 respectively are 9.42 and 18.78 months, the median overall survival of high and low risk groups in validation set 2 respectively were 7.79 and 16.18 months ([104]Figure 5A,B). The AUC values of 1-, 2-, 3-years overall survival prediction in validation sets, respectively, are 0.75, 0.74, 0.79 and 0.61, 0.71, 0.74 ([105]Figure 5C,D). The expression patterns of 16 genes in verification set 1 and 2 were shown in [106]Figure 5E,F, the expression patterns with the increase of risk score of these genes in validation cohorts have similar trends as in the training set. [107]Figure 5G,H showed the distribution of risk scores and survival information of patients in the validation sets, it also shows that the higher the risk score, the shorter the patient’s survival. These results indicated the accuracy of the survival risk model constructed by 16 genes in predicting the prognosis of patients with glioblastoma. Figure 5. [108]Figure 5 [109]Open in a new tab Validation and evaluation of the effectiveness of the survival risk model. (A,B) Kaplan-Meier survival analysis curves of validation set 1, 2; (C,D) Time-dependent ROC curves of validation set 1, 2 for the first, second, and third years; (E,F) Heatmap of 16 gene expression profiles in validation set 1, 2. Red parts indicate higher expression and green parts indicate lower expression; (G,H) Risk scores distribution and survival status scatter plots of patients in validation set 1, 2. To further verify the independence of the prognostic model, Univariate and Multivariate Cox regression analyses were performed. The result showed that the prognosis risk score was significantly associated with overall survival, independent of clinical factors including age and gender ([110]Table 2). Table 2. Univariate and multivariate analyses of prognostic factors and overall survival of GBM patients. Training Set Univariate Cox Regression Analysis Multivariate Cox Regression Analysis HR 95% CI p-Value HR 95% CI p-Value Risk (high/low) 2.719 2.218–3.334 6.352 × 10^−22 2.471 2.006–3.044 1.839 × 10^−17 Age (≥60/<60) 1.866 1.544–2.255 1.093 × 10^−10 1.583 1.304–1.922 3.341 × 10^−06 Gender (male/female) 0.852 0.703–1.033 0.103 0.957 0.789–1.162 0.657 Validation Set 1 Risk (high/low) 2.476 1.584–3.872 6.99 × 10^−05 2.286 1.429–3.657 5.605 × 10^−04 Age (≥60/<60) 1.796 1.061–3.041 0.029 1.365 0.788–2.364 0.267 Gender (male/female) 1.233 0.798–1.905 0.346 1.099 0.708–1.707 0.673 Validation Set 2 Risk (high/low) 2.293 1.573–3.343 1.592 × 10^−05 2.17 1.481–3.179 7.026 × 10^−05 Age (≥60/<60) 2.449 1.734–3.459 3.637 × 10^−07 2.35 1.655–3.337 1.785 × 10^−06 Gender (male/female) 0.902 0.637–1.277 0.56 0.823 0.579–1.168 0.275 [111]Open in a new tab 3.5. The Sixteen-Gene Model Was More Robust and Effective Compaired with Four Existing-Survival-Related Gene Signatures We also built the survival prediction model with four existing survival-related signatures We evaluated the ability of these models for predicting the survival risk of GBM patients in the three cohorts. The result showed that Yin’s signature also was robust in predicting the prognosis risk of patients with GBM (log-rank test p < 0.05 in all three cohorts; [112]Figure 6A and [113]Figure S1A,H). The smaller the p-value, the greater the difference in survival between the two groups. The results of Kaplan-Meier survival analysis, ROC curves, and AUC values of different models for predicting 1-, 2-, 3-years overall survival in training cohort were shown in [114]Figure 6A–G. The results of Kaplan-Meier survival analysis, ROC curves, and AUC values of these models to predict 1-, 2-, 3-years overall survival in validation sets were shown in [115]Figure S1. Although the models can divide GBM patients into low and high-risk groups, the difference in survival between the two groups was mostly insignificant. The performance of our model and the models of the existing four signatures was summarized in [116]Table 3. As shown in the table, most of the AUC values of our model were bigger than the existing models in the three cohorts. The performance of these four models for survival risk prediction was very poor in the validation set 1, the p-values of Pan’s, Wang’s and Cheng’s signature are greater than 0.05, and the AUC values of the four models were only about 0.5–0.6 for predicting 1-, 2-, 3-years overall survival. And the AUC values of our model for predicting 1-, 2-, 3-years overall survival in three cohorts were more stable. These results demonstrated that our model was more robust and powerful to predict overall survival than the four existing signatures. Figure 6. [117]Figure 6 [118]Open in a new tab Comparison of ability for survival prediction by the sixteen-gene signature and four published prognostic signatures in the training cohort. (A–D) Kaplan-Meier analysis of four published signatures for the overall survival of glioblastoma patients in training cohort; (E–G) The ROC curves of different models for predicting 1-, 2-, 3-year overall survival in the training cohort. Table 3. The performance of our model and the models of existing four signatures. Training Set Our Model The Model of Yin’s Signature The Model of Pan’s Signature The Model of Wang’s Signature The Model of Cheng’s Signature p-value (between low and high risk) <0.0001 0.0012 0.0092 <0.0001 0.00046 AUC value of 1 year 0.7 0.59 0.53 0.67 0.57 AUC value of 2 years 0.79 0.64 0.57 0.68 0.61 AUC value of 3 years 0.86 0.63 0.57 0.7 0.66 Validation Set 1 p-value (between low and high risk) <0.0001 0.015 0.1 0.092 0.19 AUC value of 1 year 0.75 0.68 0.62 0.42 0.57 AUC value of 2 years 0.74 0.7 0.6 0.55 0.68 AUC value of 3 years 0.79 0.65 0.64 0.49 0.63 Validation Set 2 p-value (between low and high risk) <0.0001 0.00039 0.0013 0.0047 <0.0001 AUC value of 1 year 0.61 0.6 0.55 0.63 0.61 AUC value of 2 years 0.71 0.57 0.68 0.72 0.74 AUC value of 3 years 0.74 0.61 0.74 0.85 0.82 [119]Open in a new tab 3.6. The Risk Scores Were Association with Some Critical Clinicopathological Parameters In training set and validation set 2, the correlation between risk score and some important clinicopathological features (therapeutic modalities, key molecular biomarkers, etc.) was analyzed. The results showed that there was a statistically significant difference in risk score between GBM patients with IDH wild type and IDH mutation type, GBM patients with IDH wild type have higher risk scores both in TCGA cohort and [120]GSE16011 cohort ([121]Figure 7A,H), which was also demonstrated in previous studies [[122]20,[123]21,[124]34]. The higher risk score was associated with MGMT promoter unmethylation subtype, non-methylated subtype, Mesenchymal subtype, TERT mutation type, ATRX wild type, and radiotherapy alone ([125]Figure 7B–G), and high-risk score means poor prognosis, they were all risk factors for survival. All these results indicate that a high-risk score was associated with short survival. Figure 7. [126]Figure 7 [127]Open in a new tab Correlation between risk score and clinicopathology. (A,H) Comparison of risk scores in IDH-WT GBM and IDH-Mutant GBM in TCGA cohort and [128]GSE16011 cohort; (B) Comparison of risk scores in MGMT-promoter- methylation, and MGMT-promoter- unmethylation GBM in the TCGA cohort; (C) Comparison of risk scores in G-CIMP and non-G-CIMP methylation GBM in the TCGA cohort; (D) Comparison of risk scores in Proneural, Neural, Classical, and Mesenchymal GBM in the TCGA cohort; (E) Comparison of risk scores in TERT-WT and TERT-Mutant GBM in TCGA cohort; (F) Comparison of risk scores in ATRX-WT and ATRX-Mutant GBM in the TCGA cohort; (G) Comparison of risk scores for therapeutic modalities in the TCGA cohort (only radiotherapy, radiotherapy and chemotherapy, only chemotherapy). ****: p ≤ 0.0001, ***: p ≤ 0.001, **: p ≤ 0.01, *: p ≤ 0.05, ns: p > 0.05. To confirm and further convince the independent of the prognostic efficacy of the risk model, the clinical-pathological factors such as therapeutic modalities, key genes status, and expression subtypes, GBM patients were stratified by these factors, and Kaplan Meier survival analysis was performed in the training cohort and validation set 2 cohort. Kaplan Meier survival analysis was also performed for high-risk and low-risk groups of GBM patients combined with clinical parameters. As shown in [129]Figures S2 and S3, there were generally and significantly differences in overall survival between high-risk and low-risk groups of patients among stratified GBM patients by clinical-pathological factors. These results suggest that the model can distinguish the high-risk subgroup and the low-risk subgroup in each clinical subtype, the risk score can effectively predict the survival risk of patients with glioblastoma. Univariate and Multivariate Cox regression analyses were also performed with patients who having the clinical-pathological and molecular characteristics information in TCGA training set (206 samples) and validation set 2 (92 samples). The results showed that the prognosis risk score of our model was significantly associated with overall survival, independent of clinical and molecular characteristics such as IDH status, MGMT promoter status ([130]Table S4). The analysis also shows that some molecular markers and clinical information also have the ability of survival risk assessment, such as Karnofsky Performance Status (KPS) score. 3.7. The Biological Pathway of Survival-Related Genes Involved In We constructed the model using 16 survival-related genes, due to the number of these 16 genes being too small to enrich for significant pathways, we used all 99 survival-related genes for pathway analysis. Functional enrichment analysis of 99 survival-related genes was conducted, and the results are shown in [131]Figure 8. For molecular functions (MF), the major enriched GO terms were enzyme inhibitor activity, phospholipase inhibitor activity, ion channel binding, etc. ([132]Figure 8A). For cellular components (CC), the genes were mainly enriched in the synaptic membrane, transport vesicle, synaptic vesicle, etc. ([133]Figure 8B). For the biological processes (BP), the major enriched GO terms were the modulation of chemical synaptic transmission, regulation of trans-synaptic signaling, synapse organization, etc. ([134]Figure 8C). According to KEGG analysis, these genes are mainly enriched in Pertussis, Complement and coagulation cascades, ECM receptor interaction, and many other pathways ([135]Figure 8D). Figure 8. [136]Figure 8 [137]Open in a new tab The Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of genes associated with overall survival in GBM patients. (A) The top 5 statistically significant enriched terms in molecular function; (B) The top 5 statistically significant enriched terms in cellular components; (C) The top 5 statistically significant enriched terms in biological process; (D) The top 5 enriched KEGG pathways (Some are not statistically significant). In 16 survival-related genes, GPRASP1, CHRM3, CLSTN2, NELL1, SEZ6L2, ICAM5, HPCAL4, SNAP91, PCSK1N, PGBD5, INA, UCHL1 and LHX6 were down regulated, IGFBP2, C1R and NMB were up regulated compared with normal tissue samples. GPRASP1 has been found overexpressed in brain, pancreatic, and breast cancers as compared to their respective normal tissues, and it may be a biomarker for early-stage cancer [[138]35]. CHRM3 may play a vital role in the deficits of the thalamus–cortical connectivities and is associated with schizophrenia dysfunction [[139]36]. Overexpression of CHRM3 or activation of CHRM3 by carbachol promoted cell proliferation, migration, and castration resistance in prostate cancer [[140]37]. CLSTN2 is associated with episodic memory and late-onset Alzheimer’s disease [[141]38,[142]39]. NELL1 has been found to be associated with a variety of tumors, which may inhibit the progress of cancer. That makes it a promising candidate biomarker of tumor suppressor genes and a potential target for tumor therapy [[143]40,[144]41,[145]42,[146]43,[147]44]. SEZ6L2 is highly expressed in colorectal cancer and high expression means a poor prognosis [[148]45]. It also is an important regulator of drug-resistant cells and tumor spheroid cells in lung adenocarcinoma [[149]46]. ICAM5 is an intercellular adhesion molecule and may play a role in tumorigenesis and perineural invasion, most likely through P13K/Akt signaling pathway [[150]47]. ICAM-5 regulates T cell activity and participates in the immune privilege of the brain. It may be a useful anti-inflammatory agent for the treatment of various inflammatory brain diseases [[151]48]. Some studies have found that HPCAL4 may be the core gene of GBM and a potential therapeutic target [[152]49]. It also has been found can affect many cellular processes, such as homeostasis, learning and memory, cancer, and pain [[153]50]. SNAP91 is associated with Alzheimer’s disease [[154]51], schizophrenia [[155]52], Parkinson’s disease [[156]53] and colorectal cancer [[157]54]. PCSK1N has an association with various neurodegenerative diseases, widely expressed in neurons throughout the brain [[158]55,[159]56]. PGBD5 which encodes an active DNA transposase was highly expressed in various childhood and adult solid tumors [[160]57]. The PGBD5 DNA transposase confers a synthetic dependency on DNA damage repair and signaling, and expression of PGBD5 induces DNA damage, which requires both DNA damage repair and DNA damage signaling, resulting in apoptosis if impaired by their selective inhibitors [[161]58]. It also has been found to be associated with frontotemporal dementia [[162]59]. INA is a strong prognostic factor in gliomas [[163]60]. UCHL1 may play an important role in maintaining axonal function after cerebral ischemia [[164]61]. UCHL1 is also associated with many types of cancers including lung, colorectal, breast cancer, and pancreatic, which can promote tumor progression [[165]62,[166]63]. LHX6 is associated with many kinds of cancers and can inhibit the proliferation, invasion, and migration of breast cancer cells by modulating the PI3K/Akt/mTOR and Wnt/β-catenin signaling pathways [[167]64,[168]65,[169]66]. LHX6 plays a tumor-suppressing role in MC-LR-induced liver cancer through the Wnt/β-catenin and P53 pathways [[170]67]. IGFBP2 plays an essential role in cognitive development during early life, it is a potential target for learning and memory impairment therapies and neurodegenerative diseases [[171]68]. IGFBP2 is overexpressed and promotes many key oncogenic processes, including epithelial-to-mesenchymal transition, cellular migration, invasion, angiogenesis, stemness, transcriptional activation, and epigenetic programming in many solid tumors [[172]69]. IGFBP2 promotes tumor progression in pancreatic ductal adenocarcinoma through the STAT3 pathway [[173]70]. C1R is up-regulated in cutaneous squamous cell carcinoma [[174]71]. DNA-methylation of C1R is significantly correlated with overall survival in acute myeloid leukemia [[175]72]. NMB is expressed in non-tumorigenic cells, but its expression is low to undetectable in malignant cells [[176]73], and several studies have shown that NMB may act as a tumor suppressor [[177]74,[178]75,[179]76,[180]77]. 4. Discussion Glioblastoma is a kind of central nervous system tumor with a poor prognosis. Combined with clinical and overall survival information, we analyzed in-depth the gene expression data of GBM patients and constructed the survival risk prediction model. We found some survival-related genes and analyzed their molecular biomechanisms. These genes can be used as potential biomarkers in diagnosis and treatment. The results and our model can support clinical decision-making for the diagnosis and prognosis of the disease, so as to provide effective treatment and obtain better prognostic outcomes. In this study, a total of 424 differentially expressed genes were obtained between GBM and normal brain tissue. Univariate Cox analysis showed that 99 genes were associated with overall survival. The survival-related genes were selected for further analysis. In the training set, 16 genes related to overall survival were screened by LASSO-Cox regression analysis and multiple Cox regression analysis. We found that the expression levels of CLSTN2, NMB, SNAP91, PCSK1N, INA and LHX6 were positively correlated with the overall survival of glioblastoma, while the expression levels of IGFBP2, GPRASP1, C1R, CHRM3, NELL1, SEZ6L2, ICAM5, HPCAL4, PGBD5 and UCHL1 were negatively correlated with overall survival. All these genes have been found to be associated with diseases or cancers. It reflects from the side the biological significance of the survival-related genes we selected and the effectiveness of our model. Based on these 16 genes, we built a prognostic survival risk model. The patients were divided into low-risk group and high-risk group by the median risk score. Two external datasets verified the effectiveness of the model, the results show that the model was more robust and can effectively predict the survival risk of patients. Because we took the same risk score threshold in different groups, we didn’t have to adjust the risk cutoff value of high and low risk groups in different GBM patients’ datasets, so our model can be used to predict survival risk for a new individual patient of glioblastoma. We also found that GBM patients with IDH wild type, MGMT promoter unmethylation, Mesenchymal subtype, TERT mutation have higher risk scores, which means these are all negative factors. The model can distinguish the high-risk subgroup and the low-risk subgroup in each clinical subtype. Our model was more robust and effective compared with the existing four gene-related signatures, maybe because we used three steps to selected survival-related genes and constructed the model, trying to search for survival-related gene combinations as best as possible and reduce the redundancy in genes to construct the optimal model. There are some limitations to our work. First of all, our differential expression analysis only contained a very limited number of normal samples. The number of genes included in the study is only more than 12,000, which may overlook some potential mRNAs. Secondly, this study did not explore the protein expression of these 16 genes and their prognostic effects. Thirdly, the reliability of the scoring model needs further clinical and experimental verification. Fourthly, the genes expression data we used were only generated by Affymetrix human genome U133A array Plate Set and Affymetrix Gene Chip Human Genome U133 Plus 2.0 Array Plate Set, our model may not be suitable for data generated by other chips, because the data generated by different types of chips may be very different. Acknowledgments