Abstract
Previous studies have found that gene expression levels are associated
with prognosis and some genes can be used to predict the survival risk
of glioblastoma (GBM) patients. However, most of them just built the
survival-related gene signature, and personal survival risk can be
evaluated only in group. This study aimed to find the prognostic
survival related genes of GBM, and construct survival risk prediction
model, which can be used to evaluate survival risk by individual. We
collected gene expression data and clinical information from the Gene
Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases.
Cox regression analysis and LASSO-cox regression analysis were
performed to get survival-related genes and establish the overall
survival prediction model. The ROC curve and Kaplan Meier analysis were
used to evaluate the prediction ability of the model in training set
and two independent cohorts. We also analyzed the biological functions
of survival-related genes by GO and KEGG enrichment analysis. We
identified 99 genes associated with overall survival and selected 16
genes (IGFBP2, GPRASP1, C1R, CHRM3, CLSTN2, NELL1, SEZ6L2, NMB, ICAM5,
HPCAL4, SNAP91, PCSK1N, PGBD5, INA, UCHL1 and LHX6) to establish the
survival risk prediction model. Multivariate Cox regression analysis
indicted that the risk score could predict overall survival independent
of age and gender. ROC analyses showed that our model was more robust
than four existing signatures. The sixteen genes can also be potential
transcriptional biomarkers and the model can assist doctors on clinical
decision-making and personalized treatment of GBM patients.
Keywords: GBM, prognostic risk model, overall survival,
survival-related genes
1. Introduction
According to the 2016 WHO classification of tumors of the central
nervous system, gliomas were classified into grade I–IV based on
histopathology and genomics. Glioblastoma multiforme (GBM) is the
highest malignant brain tumor as grade IV [[28]1]. In the newest
version of the classification of tumors of the central nervous system,
numerous molecular changes with clinicopathologic utility that are
important for the most accurate classification of CNS neoplasms were
listed, such as IDH, TERT promoter, chromosomes 7/10, EGFR, which are
important for glioblastoma [[29]2]. Glioblastoma is the most common
malignant primary brain tumor, accounting for 57.7% of all gliomas and
48.6% of all primary malignant central nervous system (CNS) tumors
[[30]3]. Although the treatment of glioblastoma has been greatly
improved in diagnosis, surgery, traditional therapy such as
chemoradiotherapy, and nursing, there are still efforts in
immunotherapy, the overall prognosis is still discouraging, and the
overall survival is still very poor due to the heterogeneity of
glioblastoma [[31]4]. For patients diagnosed with GBM, the median
survival was about 12–15 months [[32]5,[33]6], and the 5-year survival
rate was less than 10% [[34]7].
With the development of high-throughput sequencing technology, many
potential biomarkers related to the diagnosis and prognosis for
treatment decision making of glioblastoma have been found, such as MGMT
(O6 methylguanine DNA methyltransferase), H3F3A (H3 histone, family
3A), IDH (isocitrate dehydrogenase), EGFR (epidermal growth factor
receptor), and PTEN (phosphatase and tensin homolog)
[[35]7,[36]8,[37]9,[38]10,[39]11]. The status of these biomarkers can
provide a basis for individualized and targeted diagnosis and treatment
management of patients.
Previous studies have shown that gene expression profiling can be used
to classify glioma patients and identify patients with different
overall survival and clinical characteristics
[[40]12,[41]13,[42]14,[43]15,[44]16,[45]17]. GBM is divided into
Proneural, Neural, Classical, and Mesenchymal subtypes by gene
expression-based molecular classification, the result shows that
different subtypes may require different therapeutic approaches
[[46]11,[47]16]. Gene expression profiling can also be used to identify
risk genes associated with survival and prognosis, and to assess the
survival risk of patients. Zhang et al. identified 16 endoplasmic
reticulum (ER) stress-related genes (CYP2E1, SLN, BRCA1, CISD2, LRRK2,
BMP2, MYH7, HSPB1, DNM1L, SHISA5, RNF185, RCN1, SPP1, RPN2, PDIA3 and
ATP2A2), and established an ER stress risk model based on The Cancer
Genome Atlas (TCGA) glioma database to reflect the immune
characteristics and predict the prognosis of glioma patients [[48]18].
Yin et al. analyzed the gene expression profiles and identified five
novel biomarkers (PTPRN, RGS14, G6PC3, IGFBP2 and TIMP4) that have
potential in the prognosis prediction of GBM [[49]19]. Wang et al.
identified 14 autophagy-related genes (MTMR14, LENG9, P4HB, TCIRG1,
HSPA5, DRAM1, CTSD, S100A8, CCL2, MSTN, UBQLN4, PHYHIP, RRAGB and
ZKSCAN3) associated with the overall survival of patients with
glioblastoma, and built a novel autophagy-related signature for the
prediction of prognosis [[50]20]. Cao et al. built a four-gene
signature-derived (OSMR, HOXC10, SCARA3 and SLC39A10) risk score model
to predict the survival and treatment response of GBM patients
[[51]21]. Pan et al. identified two genes (GRIA2 and RYR3) strongly
associated with survival of GBM, and the two-gene signature was a
robust prognostic model to predict GBM survival [[52]22].
However, most of the studies just identified the survival-related genes
and constructed a survival risk prediction model based on the
prognostic signature, the model can generate a risk score for each
patient, the risk score threshold divided patients into different risk
groups. However, the risk score thresholds (usually the median risk
score) are changed in different groups of patients collected from
different institutions in these studies, which may be not convenient in
clinical application. Based on public databases, we aim to explore
novel prognostic biomarkers for survival prediction. By analyzing the
genes expression profiles of GBM patients downloaded from Gene
Expression Omnibus (GEO) and TCGA databases, we identified the
prognostic genes and constructed a robust risk score model based on 16
genes, and we take the same risk score threshold in different groups.
Therefore, our model can be used to predict survival risk for the
individual patient with newly diagnosed glioblastoma. The workflow
diagram of this study was shown in [53]Figure 1.
Figure 1.
[54]Figure 1
[55]Open in a new tab
Schematic diagram of the study design and development of survival risk
prediction model.
2. Materials and Methods
2.1. TCGA Dataset
We downloaded the GBM gene expression array dataset (Affymetrix HT
Human Genome U133 Array Plate Set, level 1) and clinical information
from TCGA [[56]10]. After removing duplicated samples, ten normal
control samples and 524 disease samples were obtained, the clinical
information included age, gender, survival time, survival status, and
so on. TCGA data set served as the training set.
2.2. The GEO Dataset
We downloaded three raw data sets of gene expression from GEO:
[57]GSE4412 [[58]15] and [59]GSE4271 [[60]16] are all generated by
Affymetrix human genome U133A array Plate Set. After removing duplicate
samples, [61]GSE4412 contained 50 glioblastoma samples and [62]GSE4271
contained 56 glioblastoma samples. We also downloaded the gene
expression profiles of [63]GSE16011 [[64]17], which contained 155
glioblastoma samples, gene expression profiles were generated by
Affymetrix Gene Chip Human Genome U133 Plus 2.0 Array Plate Set.
[65]GSE4412 and [66]GSE4271 patients’ data were combined as validation
set 1 and [67]GSE16011 patients’ data were used as validation set 2.
All the raw data was preprocessed by the R package AFFY and the
background correction and normalization were performed using robust
multi-array analysis (RMA). SVA package [[68]23] was used to remove
batch effects.
The clinical information of all patients is shown in [69]Table S1.
2.3. Differential Expression Analysis
There are 10 normal samples served as the control group, and 10 tumor
samples were randomly selected from TCGA training set served as the GBM
group. The differentially expressed genes (DEGs) between the GBM group
and the control group were analyzed by the limma software package
[[70]24] in R (version 4.0.5). Generally, the genes whose
[MATH:
|log<
mo>(FC)|
:MATH]
> 1 and adjusted p-value < 0.05 is considered to be statistically
significant DEGs [[71]25]. In this study, the genes were considered as
DEGs by the standards of whose adjusted p-value < 0.01 and
[MATH:
|log<
mo>(FC)|
:MATH]
> 2.5. R package pheatmap [[72]26] was used to draw heatmap while
ggplot2 ([73]https://cran.r-project.org/web/packages/ggplot2/, accessed
on 20 August 2021) was used to draw volcano map.
2.4. Identify Genes Associated with Survival
Univariable Cox regression analysis was performed with differentially
expressed genes in the survival R software package
([74]https://cran.r-project.org/web/views/Survival.html, accessed on 20
August 2021) in order to analyze the correlation between each gene and
overall survival. Genes with log-rank p < 0.05 were considered to be
associated with overall survival [[75]27]. The survival-related genes
were further filtered by the Lasso-Cox regression model
[[76]28,[77]29,[78]30] in the glmnet R package
([79]https://cran.r-project.org/web/packages/glmnet/index.html,
accessed on 20 August 2021) to reduce the dimensionality of inputted
variables.
2.5. Construction of Survival Risk Model
In order to optimize the model, step-wise Multivariate Cox analysis was
used to further filter the survival risk related genes and construct
the survival risk model by the “survival”, “survminer” packages:
Risk score = h[0] × exp(gene[1] × coefficient[1] + gene[2] ×
coefficient[2] + gene[i] × coefficient[i]) (1)
Here, h[0] is a constant, gene[i] is the gene expression level. We
assigned risk scores to each sample in the training cohort according to
these gene expression levels. The patients were divided into low-risk
group and high-risk group by the median risk score. The prognostic
value of the model was evaluated by Kaplan-Meier survival analysis,
Log-rank test, and time-dependent receiver operating characteristic
(ROC) curve with “survival”, “survminer”, and “timeROC” packages in R.
2.6. Risk Model Validation
The validation set 1 and validation set 2 were employed to validate the
robustness of diagnostic accuracy on overall survival by the prediction
model. ROC curve and Kaplan Meier analysis were used to verify the
prognostic value of GBM patients. p-value < 0.05 was considered
statistically significant. Multivariate Cox analysis was used to
validate whether the risk score was an independent risk factor for GBM
survival.
2.7. Compare with Existing Signatures
As mentioned at first, lots of genes have been found related to overall
survival and could be used to predict the prognosis of patients with
GBM. We compared the prognostic capability of our model with signatures
reported in other studies, including the five-gene signature (PTPRN,
RGS14, G6PC3, IGFBP2 and TIMP4) screened by Yin et al. [[80]19], the
two-gene signature (GRIA2 and RYR3) screened by Pan et al. [[81]22],
the five-gene signature (DES, RANBP17, CLEC5A, HOXC11 and POSTN)
screened by Wang et al. [[82]25], the four-gene signature (LHX2, MEOX2,
SNAI2 and ZNF22) derived by Cheng et al. [[83]31].
For each signature, we first conducted Multivariate Cox regression
analysis and built the prognostic model with its genes in TGCA cohort.
Then according to the median risk score, the patients were divided into
low- and high-risk groups in TGCA cohort and two validation cohorts.
Kaplan-Meier analysis and ROC curves were performed to evaluate the
prognostic power of each model.
2.8. GO and KEGG Enrichment Analysis of Survival-Related Genes
In order to identify potential molecular biomechanisms of genes related
to survival, we conducted functional enrichment analyses including GO
(Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes)
pathways analysis and visualized the results by “ggplot2” package.
3. Results
3.1. Differentially Expressed Genes Can Clearly Distinguish GBM Patients from
Normal Samples
In total, 424 differentially expressed genes were identified by
differential expression analysis of 10 normal brain tissue samples and
10 GBM tumor samples. Among these genes, 327 were down regulated and 97
were up regulated in GBM tumor tissue ([84]Figure 2A). These
differentially expressed genes can clearly distinguish GBM samples from
normal samples, tumor samples were tightly clustered together and
notably separated from normal samples ([85]Figure 2B). The 424
differently expressed genes were listed in [86]Table S2.
Figure 2.
[87]Figure 2
[88]Open in a new tab
Identification of DEGs between GBM and normal brain tissues. (A)
Volcano plots of differentially expressed genes showing the log (Fold
Change) of mRNA in GBM compared to normal brain tissues, and the
corresponding–log10 (adjusted p-value). Genes with adjusted p-value
below 0.01 and fold change above 2.5 (below −2.5) were marked with red
(blue) dots; (B) Heatmap of differentially expressed genes. GBM samples
can be clearly distinguished from normal samples according to
differentially expressed genes that are from (A). The color bar means
the expression level of differentially expressed genes.
3.2. 99 DEGs Were Significantly Related to Survival
Univariate Cox regression analysis was performed to identify the
overall survival related genes in the training set. 99 genes were
identified as potential survival-related biomarkers of GBM patients (p
< 0.05) ([89]Table S3). LASSO Cox regression analysis was performed to
further analyze these 99 survival-related genes to avoid overfitting
[[90]32,[91]33], genes were filtered by 10-fold cross-validation with a
maximum of 1000 iterations, and 28 genes were screened for subsequent
analysis ([92]Figure 3).
Figure 3.
[93]Figure 3
[94]Open in a new tab
Lasso-cox gene screening process. (A) LASSO coefficient profiles of the
99 genes in TCGA GBM cohort; (B) Selection of the best parameter
(lambda) in the LASSO-Cox model.
3.3. The Risk Score of Sixteen-Gene Model Were Strongly Associated with
Overall Survival of GBM Patients
Furthermore, multivariate Cox analysis narrowed the list down further
to 16 survival related genes and we constructed the survival risk model
with these genes, which can be used to predict the overall survival
risk of patients with GBM ([95]Figure 4A). Of these genes, CLSTN2, NMB,
SNAP91, PCSK1N, INA, and LHX6 were favorable prognostic factors for
glioblastoma survival, whose risk ratio (HR) < 1, the regression
coefficient < 0. IGFBP2, GPRASP1, C1R, CHRM3, NELL1, SEZ6L2, ICAM5,
HPCAL4, PGBD5, and UCHL1 were risk factors, whose HR > 1, the
regression coefficient > 0. The risk score of each patient was
generated as follows: 0.04701 × exp(0.0842 × IGFBP2 + 0.1187 × GPRASP1
+ 0.0962 × C1R + 0.0965 × CHRM3 − 0.1409 × CLSTN2 + 0.1231 × NELL1 +
0.0965 × SEZ6L2 − 0.149 × NMB + 0.4547 × ICAM5 + 0.3367 × HPCAL4 −
0.0876 × SNAP91 − 0.1406 × PCSK1N + 0.1966 × PGBD5 − 0.0901 × INA +
0.0897 × UCHL1 − 0.5555 × LHX6) ([96]Table 1).
Figure 4.
[97]Figure 4
[98]Open in a new tab
Establishment of the survival risk model. (A) 16 genes were eventually
selected to establish prognostic model. ***: p < 0.001, **: p < 0.01,
*: p < 0.05; (B) Kaplan-Meier curve of overall survival in high- and
low-risk group of the GBM patients in the training cohort; (C) Heatmap
of 16 genes expression patterns; (D) ROC curves of the model for
overall survival at 1-, 2-, and 3- years; (E) The distribution risk
scores in the TCGA cohort.
Table 1.
The details of the 16 genes used in the prognostic prediction model.
Gene Name Coef. HR 95% CI p-Value
IGFBP2 0.0842 1.0878 1.0036–1.1790 0.0405
GPRASP1 0.1187 1.1260 1.0099–1.2555 0.0325
C1R 0.0962 1.1010 1.0142–1.1951 0.0216
CHRM3 0.0965 1.1013 1.0063–1.2054 0.0361
CLSTN2 −0.1409 0.8686 0.7665–0.9842 0.0272
NELL1 0.1231 1.1310 1.0151–1.2600 0.0256
SEZ6L2 0.0965 1.1013 0.9918–1.2230 0.0710
NMB −0.1490 0.8616 0.7935–0.9354 0.0004
ICAM5 0.4547 1.5757 1.0990–2.2592 0.0134
HPCAL4 0.3367 1.4004 1.1277–1.7389 0.0023
SNAP91 −0.0876 0.9162 0.8245–1.0180 0.1034
PCSK1N −0.1406 0.8688 0.8007–0.9427 0.0007
PGBD5 0.1966 1.2173 1.0611–1.3966 0.0050
INA −0.0901 0.9139 0.8208–1.0175 0.1003
UCHL1 0.0897 1.0938 0.9874–1.2117 0.0859
LHX6 −0.5555 0.5738 0.4645–0.7087 0.0000
[99]Open in a new tab
The patients were divided into high-risk group and low-risk group
according to the median risk score. Kaplan-Meier survival analysis and
log-rank test of high-risk and low-risk GBM patients showed that there
was a significant difference in the overall survival between the two
groups (p < 0.0001). Patients in the low-risk group had a better
prognosis, that was to say, the patients had longer overall survival,
the median overall survival of high and low risk groups respectively
were 10.23 and 16.12 months ([100]Figure 4B). The time-dependent ROC
curves were generated to assess the ability to discriminate the
prognostic risk of the model. The areas under the curve (AUC) of
predicting 1-, 2-, and 3-years OS in TCGA dataset were 0.7, 0.79, and
0.86, respectively ([101]Figure 4D). The expression patterns of 16
genes in high and low risk groups of TCGA cohort were shown in
[102]Figure 4C. With the increase of risk score, the expression level
of CLSTN2, NMB, SNAP91, PCSK1N, INA, and LHX6 were lower and lower, and
the expression level of the other genes were higher and higher. The
distribution of risk scores and survival information for patients in
high and low risk groups were shown in [103]Figure 4E, higher risk
score patients have a shorter survival time.
3.4. The Model Was Sufficient and Effective in Predicting Overall Survival in
GBM External Verification
Two external validation sets were then used to evaluate the prediction
efficiency of the model. Using the same model and parameters, patients
in the validation sets were also divided into high-risk and low-risk
groups. Similar to the training set, the overall survival of the
high-risk group was significantly shorter than that of the low-risk
group (p < 0.05), the median overall survival of high and low risk
groups in validation set 1 respectively are 9.42 and 18.78 months, the
median overall survival of high and low risk groups in validation set 2
respectively were 7.79 and 16.18 months ([104]Figure 5A,B). The AUC
values of 1-, 2-, 3-years overall survival prediction in validation
sets, respectively, are 0.75, 0.74, 0.79 and 0.61, 0.71, 0.74
([105]Figure 5C,D). The expression patterns of 16 genes in verification
set 1 and 2 were shown in [106]Figure 5E,F, the expression patterns
with the increase of risk score of these genes in validation cohorts
have similar trends as in the training set. [107]Figure 5G,H showed the
distribution of risk scores and survival information of patients in the
validation sets, it also shows that the higher the risk score, the
shorter the patient’s survival. These results indicated the accuracy of
the survival risk model constructed by 16 genes in predicting the
prognosis of patients with glioblastoma.
Figure 5.
[108]Figure 5
[109]Open in a new tab
Validation and evaluation of the effectiveness of the survival risk
model. (A,B) Kaplan-Meier survival analysis curves of validation set 1,
2; (C,D) Time-dependent ROC curves of validation set 1, 2 for the
first, second, and third years; (E,F) Heatmap of 16 gene expression
profiles in validation set 1, 2. Red parts indicate higher expression
and green parts indicate lower expression; (G,H) Risk scores
distribution and survival status scatter plots of patients in
validation set 1, 2.
To further verify the independence of the prognostic model, Univariate
and Multivariate Cox regression analyses were performed. The result
showed that the prognosis risk score was significantly associated with
overall survival, independent of clinical factors including age and
gender ([110]Table 2).
Table 2.
Univariate and multivariate analyses of prognostic factors and overall
survival of GBM patients.
Training Set Univariate Cox Regression Analysis Multivariate Cox
Regression Analysis
HR 95% CI p-Value HR 95% CI p-Value
Risk (high/low) 2.719 2.218–3.334 6.352 × 10^−22 2.471 2.006–3.044
1.839 × 10^−17
Age (≥60/<60) 1.866 1.544–2.255 1.093 × 10^−10 1.583 1.304–1.922 3.341
× 10^−06
Gender (male/female) 0.852 0.703–1.033 0.103 0.957 0.789–1.162 0.657
Validation Set 1
Risk (high/low) 2.476 1.584–3.872 6.99 × 10^−05 2.286 1.429–3.657 5.605
× 10^−04
Age (≥60/<60) 1.796 1.061–3.041 0.029 1.365 0.788–2.364 0.267
Gender (male/female) 1.233 0.798–1.905 0.346 1.099 0.708–1.707 0.673
Validation Set 2
Risk (high/low) 2.293 1.573–3.343 1.592 × 10^−05 2.17 1.481–3.179 7.026
× 10^−05
Age (≥60/<60) 2.449 1.734–3.459 3.637 × 10^−07 2.35 1.655–3.337 1.785 ×
10^−06
Gender (male/female) 0.902 0.637–1.277 0.56 0.823 0.579–1.168 0.275
[111]Open in a new tab
3.5. The Sixteen-Gene Model Was More Robust and Effective Compaired with Four
Existing-Survival-Related Gene Signatures
We also built the survival prediction model with four existing
survival-related signatures We evaluated the ability of these models
for predicting the survival risk of GBM patients in the three cohorts.
The result showed that Yin’s signature also was robust in predicting
the prognosis risk of patients with GBM (log-rank test p < 0.05 in all
three cohorts; [112]Figure 6A and [113]Figure S1A,H). The smaller the
p-value, the greater the difference in survival between the two groups.
The results of Kaplan-Meier survival analysis, ROC curves, and AUC
values of different models for predicting 1-, 2-, 3-years overall
survival in training cohort were shown in [114]Figure 6A–G. The results
of Kaplan-Meier survival analysis, ROC curves, and AUC values of these
models to predict 1-, 2-, 3-years overall survival in validation sets
were shown in [115]Figure S1. Although the models can divide GBM
patients into low and high-risk groups, the difference in survival
between the two groups was mostly insignificant. The performance of our
model and the models of the existing four signatures was summarized in
[116]Table 3. As shown in the table, most of the AUC values of our
model were bigger than the existing models in the three cohorts. The
performance of these four models for survival risk prediction was very
poor in the validation set 1, the p-values of Pan’s, Wang’s and Cheng’s
signature are greater than 0.05, and the AUC values of the four models
were only about 0.5–0.6 for predicting 1-, 2-, 3-years overall
survival. And the AUC values of our model for predicting 1-, 2-,
3-years overall survival in three cohorts were more stable. These
results demonstrated that our model was more robust and powerful to
predict overall survival than the four existing signatures.
Figure 6.
[117]Figure 6
[118]Open in a new tab
Comparison of ability for survival prediction by the sixteen-gene
signature and four published prognostic signatures in the training
cohort. (A–D) Kaplan-Meier analysis of four published signatures for
the overall survival of glioblastoma patients in training cohort; (E–G)
The ROC curves of different models for predicting 1-, 2-, 3-year
overall survival in the training cohort.
Table 3.
The performance of our model and the models of existing four
signatures.
Training Set Our Model The Model of Yin’s Signature The Model of Pan’s
Signature The Model of Wang’s Signature The Model of Cheng’s Signature
p-value (between low and high risk) <0.0001 0.0012 0.0092 <0.0001
0.00046
AUC value of 1 year 0.7 0.59 0.53 0.67 0.57
AUC value of 2 years 0.79 0.64 0.57 0.68 0.61
AUC value of 3 years 0.86 0.63 0.57 0.7 0.66
Validation Set 1
p-value (between low and high risk) <0.0001 0.015 0.1 0.092 0.19
AUC value of 1 year 0.75 0.68 0.62 0.42 0.57
AUC value of 2 years 0.74 0.7 0.6 0.55 0.68
AUC value of 3 years 0.79 0.65 0.64 0.49 0.63
Validation Set 2
p-value (between low and high risk) <0.0001 0.00039 0.0013 0.0047
<0.0001
AUC value of 1 year 0.61 0.6 0.55 0.63 0.61
AUC value of 2 years 0.71 0.57 0.68 0.72 0.74
AUC value of 3 years 0.74 0.61 0.74 0.85 0.82
[119]Open in a new tab
3.6. The Risk Scores Were Association with Some Critical Clinicopathological
Parameters
In training set and validation set 2, the correlation between risk
score and some important clinicopathological features (therapeutic
modalities, key molecular biomarkers, etc.) was analyzed. The results
showed that there was a statistically significant difference in risk
score between GBM patients with IDH wild type and IDH mutation type,
GBM patients with IDH wild type have higher risk scores both in TCGA
cohort and [120]GSE16011 cohort ([121]Figure 7A,H), which was also
demonstrated in previous studies [[122]20,[123]21,[124]34]. The higher
risk score was associated with MGMT promoter unmethylation subtype,
non-methylated subtype, Mesenchymal subtype, TERT mutation type, ATRX
wild type, and radiotherapy alone ([125]Figure 7B–G), and high-risk
score means poor prognosis, they were all risk factors for survival.
All these results indicate that a high-risk score was associated with
short survival.
Figure 7.
[126]Figure 7
[127]Open in a new tab
Correlation between risk score and clinicopathology. (A,H) Comparison
of risk scores in IDH-WT GBM and IDH-Mutant GBM in TCGA cohort and
[128]GSE16011 cohort; (B) Comparison of risk scores in MGMT-promoter-
methylation, and MGMT-promoter- unmethylation GBM in the TCGA cohort;
(C) Comparison of risk scores in G-CIMP and non-G-CIMP methylation GBM
in the TCGA cohort; (D) Comparison of risk scores in Proneural, Neural,
Classical, and Mesenchymal GBM in the TCGA cohort; (E) Comparison of
risk scores in TERT-WT and TERT-Mutant GBM in TCGA cohort; (F)
Comparison of risk scores in ATRX-WT and ATRX-Mutant GBM in the TCGA
cohort; (G) Comparison of risk scores for therapeutic modalities in the
TCGA cohort (only radiotherapy, radiotherapy and chemotherapy, only
chemotherapy). ****: p ≤ 0.0001, ***: p ≤ 0.001, **: p ≤ 0.01, *: p ≤
0.05, ns: p > 0.05.
To confirm and further convince the independent of the prognostic
efficacy of the risk model, the clinical-pathological factors such as
therapeutic modalities, key genes status, and expression subtypes, GBM
patients were stratified by these factors, and Kaplan Meier survival
analysis was performed in the training cohort and validation set 2
cohort. Kaplan Meier survival analysis was also performed for high-risk
and low-risk groups of GBM patients combined with clinical parameters.
As shown in [129]Figures S2 and S3, there were generally and
significantly differences in overall survival between high-risk and
low-risk groups of patients among stratified GBM patients by
clinical-pathological factors. These results suggest that the model can
distinguish the high-risk subgroup and the low-risk subgroup in each
clinical subtype, the risk score can effectively predict the survival
risk of patients with glioblastoma. Univariate and Multivariate Cox
regression analyses were also performed with patients who having the
clinical-pathological and molecular characteristics information in TCGA
training set (206 samples) and validation set 2 (92 samples). The
results showed that the prognosis risk score of our model was
significantly associated with overall survival, independent of clinical
and molecular characteristics such as IDH status, MGMT promoter status
([130]Table S4). The analysis also shows that some molecular markers
and clinical information also have the ability of survival risk
assessment, such as Karnofsky Performance Status (KPS) score.
3.7. The Biological Pathway of Survival-Related Genes Involved In
We constructed the model using 16 survival-related genes, due to the
number of these 16 genes being too small to enrich for significant
pathways, we used all 99 survival-related genes for pathway analysis.
Functional enrichment analysis of 99 survival-related genes was
conducted, and the results are shown in [131]Figure 8. For molecular
functions (MF), the major enriched GO terms were enzyme inhibitor
activity, phospholipase inhibitor activity, ion channel binding, etc.
([132]Figure 8A). For cellular components (CC), the genes were mainly
enriched in the synaptic membrane, transport vesicle, synaptic vesicle,
etc. ([133]Figure 8B). For the biological processes (BP), the major
enriched GO terms were the modulation of chemical synaptic
transmission, regulation of trans-synaptic signaling, synapse
organization, etc. ([134]Figure 8C). According to KEGG analysis, these
genes are mainly enriched in Pertussis, Complement and coagulation
cascades, ECM receptor interaction, and many other pathways
([135]Figure 8D).
Figure 8.
[136]Figure 8
[137]Open in a new tab
The Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes
(KEGG) pathway enrichment analysis of genes associated with overall
survival in GBM patients. (A) The top 5 statistically significant
enriched terms in molecular function; (B) The top 5 statistically
significant enriched terms in cellular components; (C) The top 5
statistically significant enriched terms in biological process; (D) The
top 5 enriched KEGG pathways (Some are not statistically significant).
In 16 survival-related genes, GPRASP1, CHRM3, CLSTN2, NELL1, SEZ6L2,
ICAM5, HPCAL4, SNAP91, PCSK1N, PGBD5, INA, UCHL1 and LHX6 were down
regulated, IGFBP2, C1R and NMB were up regulated compared with normal
tissue samples. GPRASP1 has been found overexpressed in brain,
pancreatic, and breast cancers as compared to their respective normal
tissues, and it may be a biomarker for early-stage cancer [[138]35].
CHRM3 may play a vital role in the deficits of the thalamus–cortical
connectivities and is associated with schizophrenia dysfunction
[[139]36]. Overexpression of CHRM3 or activation of CHRM3 by carbachol
promoted cell proliferation, migration, and castration resistance in
prostate cancer [[140]37]. CLSTN2 is associated with episodic memory
and late-onset Alzheimer’s disease [[141]38,[142]39]. NELL1 has been
found to be associated with a variety of tumors, which may inhibit the
progress of cancer. That makes it a promising candidate biomarker of
tumor suppressor genes and a potential target for tumor therapy
[[143]40,[144]41,[145]42,[146]43,[147]44]. SEZ6L2 is highly expressed
in colorectal cancer and high expression means a poor prognosis
[[148]45]. It also is an important regulator of drug-resistant cells
and tumor spheroid cells in lung adenocarcinoma [[149]46]. ICAM5 is an
intercellular adhesion molecule and may play a role in tumorigenesis
and perineural invasion, most likely through P13K/Akt signaling pathway
[[150]47]. ICAM-5 regulates T cell activity and participates in the
immune privilege of the brain. It may be a useful anti-inflammatory
agent for the treatment of various inflammatory brain diseases
[[151]48]. Some studies have found that HPCAL4 may be the core gene of
GBM and a potential therapeutic target [[152]49]. It also has been
found can affect many cellular processes, such as homeostasis, learning
and memory, cancer, and pain [[153]50]. SNAP91 is associated with
Alzheimer’s disease [[154]51], schizophrenia [[155]52], Parkinson’s
disease [[156]53] and colorectal cancer [[157]54]. PCSK1N has an
association with various neurodegenerative diseases, widely expressed
in neurons throughout the brain [[158]55,[159]56]. PGBD5 which encodes
an active DNA transposase was highly expressed in various childhood and
adult solid tumors [[160]57]. The PGBD5 DNA transposase confers a
synthetic dependency on DNA damage repair and signaling, and expression
of PGBD5 induces DNA damage, which requires both DNA damage repair and
DNA damage signaling, resulting in apoptosis if impaired by their
selective inhibitors [[161]58]. It also has been found to be associated
with frontotemporal dementia [[162]59]. INA is a strong prognostic
factor in gliomas [[163]60]. UCHL1 may play an important role in
maintaining axonal function after cerebral ischemia [[164]61]. UCHL1 is
also associated with many types of cancers including lung, colorectal,
breast cancer, and pancreatic, which can promote tumor progression
[[165]62,[166]63]. LHX6 is associated with many kinds of cancers and
can inhibit the proliferation, invasion, and migration of breast cancer
cells by modulating the PI3K/Akt/mTOR and Wnt/β-catenin signaling
pathways [[167]64,[168]65,[169]66]. LHX6 plays a tumor-suppressing role
in MC-LR-induced liver cancer through the Wnt/β-catenin and P53
pathways [[170]67]. IGFBP2 plays an essential role in cognitive
development during early life, it is a potential target for learning
and memory impairment therapies and neurodegenerative diseases
[[171]68]. IGFBP2 is overexpressed and promotes many key oncogenic
processes, including epithelial-to-mesenchymal transition, cellular
migration, invasion, angiogenesis, stemness, transcriptional
activation, and epigenetic programming in many solid tumors [[172]69].
IGFBP2 promotes tumor progression in pancreatic ductal adenocarcinoma
through the STAT3 pathway [[173]70]. C1R is up-regulated in cutaneous
squamous cell carcinoma [[174]71]. DNA-methylation of C1R is
significantly correlated with overall survival in acute myeloid
leukemia [[175]72]. NMB is expressed in non-tumorigenic cells, but its
expression is low to undetectable in malignant cells [[176]73], and
several studies have shown that NMB may act as a tumor suppressor
[[177]74,[178]75,[179]76,[180]77].
4. Discussion
Glioblastoma is a kind of central nervous system tumor with a poor
prognosis. Combined with clinical and overall survival information, we
analyzed in-depth the gene expression data of GBM patients and
constructed the survival risk prediction model. We found some
survival-related genes and analyzed their molecular biomechanisms.
These genes can be used as potential biomarkers in diagnosis and
treatment. The results and our model can support clinical
decision-making for the diagnosis and prognosis of the disease, so as
to provide effective treatment and obtain better prognostic outcomes.
In this study, a total of 424 differentially expressed genes were
obtained between GBM and normal brain tissue. Univariate Cox analysis
showed that 99 genes were associated with overall survival. The
survival-related genes were selected for further analysis. In the
training set, 16 genes related to overall survival were screened by
LASSO-Cox regression analysis and multiple Cox regression analysis. We
found that the expression levels of CLSTN2, NMB, SNAP91, PCSK1N, INA
and LHX6 were positively correlated with the overall survival of
glioblastoma, while the expression levels of IGFBP2, GPRASP1, C1R,
CHRM3, NELL1, SEZ6L2, ICAM5, HPCAL4, PGBD5 and UCHL1 were negatively
correlated with overall survival. All these genes have been found to be
associated with diseases or cancers. It reflects from the side the
biological significance of the survival-related genes we selected and
the effectiveness of our model.
Based on these 16 genes, we built a prognostic survival risk model. The
patients were divided into low-risk group and high-risk group by the
median risk score. Two external datasets verified the effectiveness of
the model, the results show that the model was more robust and can
effectively predict the survival risk of patients. Because we took the
same risk score threshold in different groups, we didn’t have to adjust
the risk cutoff value of high and low risk groups in different GBM
patients’ datasets, so our model can be used to predict survival risk
for a new individual patient of glioblastoma.
We also found that GBM patients with IDH wild type, MGMT promoter
unmethylation, Mesenchymal subtype, TERT mutation have higher risk
scores, which means these are all negative factors. The model can
distinguish the high-risk subgroup and the low-risk subgroup in each
clinical subtype.
Our model was more robust and effective compared with the existing four
gene-related signatures, maybe because we used three steps to selected
survival-related genes and constructed the model, trying to search for
survival-related gene combinations as best as possible and reduce the
redundancy in genes to construct the optimal model.
There are some limitations to our work. First of all, our differential
expression analysis only contained a very limited number of normal
samples. The number of genes included in the study is only more than
12,000, which may overlook some potential mRNAs. Secondly, this study
did not explore the protein expression of these 16 genes and their
prognostic effects. Thirdly, the reliability of the scoring model needs
further clinical and experimental verification. Fourthly, the genes
expression data we used were only generated by Affymetrix human genome
U133A array Plate Set and Affymetrix Gene Chip Human Genome U133 Plus
2.0 Array Plate Set, our model may not be suitable for data generated
by other chips, because the data generated by different types of chips
may be very different.
Acknowledgments