Abstract Gliomas are the most common lethal brain tumours and remain great heterogeneity in terms of histopathology and clinical outcomes. Among them, glioblastomas are the most aggressive tumours that lead to a median of less than one-year survival in patients. Despite the little improvement of in diagnosis and treatments for last decades, there is an urgent need for prognostic markers to distinguish high- and low-risk patients before treatment.Here, we generated a list of genes associated with glioblastoma progressions and then performed a comprehensive statistical modelling strategy to derive a 10-gene (GLO10) score from genome wide expression profiles of a large glioblastoma cohort (n=844). Our study demonstrated that the GLO10 score could successfully distinguish high- and low-risk patients with glioblastomas regardless their traditional pathological factors. Validated in four independent cohorts, the utility of GLO10 score could provide clinicians a robust prognostic prediction tool to assess risk levels upfront treatments. Keywords: glioblastoma, prognosis, biomarker, DNA microarray, LASSO INTRODUCTION Glioma is among the most common brain tumour type. With median survivals of less than a year for patients, glioblastoma is considered as a vital cause of cancer mortality in both adults and children despite aggressive surgery, chemotherapy and radiation [[38]1, [39]2]. By the year of 2007, according to the World Health Organization (WHO), gliomas were assessed based on histopathological features and categorized into astrocytomas, oligodendrogliomas, mixed oligoastrocytomas, and ependymomas [[40]3]. However, limitations of this grading system were addressed, such as high rate of inter- and intra-observer variability [[41]4, [42]5]. Therefore, in 2016, molecular signature was introduced in diagnosing tumours due to the explosion of genomic information during the last decade [[43]6]. DNA microarrays were introduced two decades ago. As this technology provides more comprehensive and objective information than traditional microscopic morphology, it has revolutionized cancer research. To date, there are molecular based diagnostic tests using DNA microarrays. Some have been incorporated in clinical practice guidelines, including Mamma Print and Oncotye Dx [[44]7]. A number of DNA microarray based studies have identified prognostically distinct molecular subtypes of gliomas [[45]8–[46]12]. These approaches were based on unsupervised hierarchical or k-means clustering of genes. As reported in these studies, patients belonging to various subgroups (clusters) showed significantly different prognostic outcomes independent to some known clinical factors, such as age and grade. However, the clustering-based prognostic signatures remain challengeable in clinical practice as the results were dramatically affected by the selected genes. Previous studies remain inconsistency in the prognostic gene candidates, which increases the difficulties in making such methods practical applicable. Here, we generated a 10-gene (GLO10) score for the prediction of glioblastoma overall survival from DNA microarray expression datasets. We trained the prognostic models in a large cohort (n=470) and demonstrated that the GLO10 score successfully distinguished high- and low-risk groups in other three independent cohorts comprising patients with glioblastomas (n=374). Our results unveiled novel glioma prognostic biomarkers that could be easily applied with great potential in producing robust results in clinical practices. RESULTS Identification of glioblastoma associated genes We hypothesized that the prognostic biomarkers were associated with tumour genesis and progression. In this case, the glioblastoma associated genes were identified by comparing gene expression profiles of tumour tissues and normal brain samples. The data was retrieved from one previously published large scale study (n= 256), Repository of Molecular Brain Neoplasia Data (Rembrandt)[[47]13]. After data preprocessing and differential expression analyses, a total of 723 probe sets corresponding to 552 genes were identified as significantly differentially expressed genes (DEGs) in glioblastomas. Among the DEGs, 137 were up-regulated and 415 were down-regulated (Figure [48]1). The gene functional analyses revealed that the glioblastoma associated genes were significantly enriched in the Gene Ontology Cellular Components including postsynapse (P=1.36×10^-27), axon(P=8.14×10^-23), ion channel complex(P=4.75×10^-16), asymmetric synapse(P=1.90×10^-15), neuron to neuron synapse(P=2.33×10^-15), exocytic vesicle membrane(P=5.92×10^-11), extracellular matrix component (P=3.34×10^-06) and cytoplasmic region (P=5.43×10^-03); Biological Processes including modulation of synaptic transmission(P=3.94×10^-15), regulation of neurotransmitter levels(P=1.94×10^-11), regulation of neuron projection development(P=3.24×10^-10), regulation of cell-substrate adhesion(P=5.05×10^-04), positive regulation of fibroblast migration(P=5.49×10^-03); Molecular Functions including gated channel activity(P=1.143×10^-13), GABA receptor activity(P=8.95×10^-09), channel regulator activity(P=1.45×10^-07), calmodulin binding(P=7.21×10^-07), ligand-gated anion channel activity(P=1.02×10^-05), ion channel regulator activity(P=2.50×10^-05), ligand-gated cation channel activity(P=1.27×10^-03) and structural constituent of myelin sheath(P=1.54×10^-03)(Figure [49]2). The pathway enrichment analysis (Figure [50]3) revealed that the identified glioblastoma associated genes were enriched in various pathways that involved in glioblastoma progression, for example GABA receptor activation [[51]14], Ion channel transport [[52]15, [53]16] and Extracellular matrix organization [[54]17–[55]19]. Figure 1. Volcano plot of differentially expressed genes in glioblastomas. [56]Figure 1 [57]Open in a new tab The red dots indicate those were considered as significant up- and down- regulated genes. Figure 2. Gene set enrichment analysis of glioblastoma associated genes using Gene Ontology (Cellular Component, Biological Process and Molecular Function). Figure 2 [58]Open in a new tab Figure 3. Gene set enrichment analysis of glioblastoma associated genes using Reactome Pathway database. [59]Figure 3 [60]Open in a new tab Univariate prognostic analysis by Cox proportional hazards models To investigate the prognostic potentiality, we integrated a larger multi-institutional data set of 470 glioblastoma patients from the TCGA cohort (Discovery set). Previously identified glioblastoma associated genes were fitted into Cox promotional hazards models. Based on the overall survival association significance assessed by log rank test, these probe sets were ranked and filtered at the cutoff of P < 0.001. A total of 28 probe sets were selected for statistical modelling. LASSSO statistical modelling to derive GLO10 score A statistical regression algorithm based on the least absolute shrinkage and selection operator (LASSO) was applied to relate the 28 probe sets to the patient survival in the discovery set. We have optimized a 10-gene (Table [61]2) signature (GLO10) as the weighted sum of expression levels of the 10 genes for each patient. High GLO10 scores were significantly associated with shorter overall survival times. After the parameter optimization, we have categorized the patients into high-, or low-risk groups based on the GLO10 score threshold of 3.58. The high- and low-risk groups showed significant differences in terms of overall survival (Figure [62]4, Table [63]1). Table 2. List of gens in the GLO10 score. Probeset ID^§ Symbol Gene ID* Chromosomal location Description 221898_at PDPN 10630 1p36.21 podoplanin 202133_at WWTR1 25937 3q25.1 WW domain containing transcription regulator 1 203706_s_at FZD7 8324 2q33.1 frizzled class receptor 7 201792_at AEBP1 165 7p13 AE binding protein 1 221766_s_at FAM46A 55603 6q14.1 family with sequence similarity 46 member A 202718_at IGFBP2 3485 2q35 insulin like growth factor binding protein 2 203729_at EMP3 2014 19q13.33 epithelial membrane protein 3 212063_at CD44 960 11p13 CD44 molecule (Indian blood group) 203504_s_at ABCA1 19 9q31.1 ATP binding cassette subfamily A member 1 201761_at MTHFD2 10797 2p13.1 methylenetetrahydrofolate dehydrogenase (NADP+ dependent) 2, methenyltetrahydrofolate cyclohydrolase [64]Open in a new tab § Affymetrix Human Genome Array probe set identifier. * NCBI Gene Database [65]https://www.ncbi.nlm.nih.gov/gene. Figure 4. Evaluation of the association of GLO10 score with glioblastoma survival. Figure 4 [66]Open in a new tab Left: Kaplan-Mert plot of high- and low-risk groups, P value was calculated using the log-rank test. Middle: Comparison of gender ratios in high- and low-groups, P value was calculated using Pearson’s chi-squared test. Right: Comparison of the distributions of age in high- and low-risk groups, P value was calculated using Wilcoxon rank-sum test. Table 1. Clinical characteristics of discovery and validation sets. Discovery Set: TCGA Characteristics Overall High risk group Low risk group P-value (n=470) (n=193) (n=277) Median overall survival[months](range) 10.37 (0.10-127.55) 10.03 (0.10-127.55) 10.58 (0.10-108.81) 0.020^§ Female [n (%)] 181 (38.51%) 77 (39.90%) 104 (37.54%) 0.68^‡ Median age[years](range) 59.00 (5.00-89.00) 58.00 (10.00-88.00) 59 (5.00-89.00) 0.20^† Validation Set 1: [67]GSE13041 Characteristics Overall High risk group Low risk group P-value (n=160) (n=24) (n=136) Median overall survival[months](range) 12.84 (0.23-110.22) 9.04 (1.74-72.02) 13.58 (0.23-110.22) 0.044^§ Female [n (%)] 63 (39.38%) 7 (29.17%) 56 (41.18%) 0.38^‡ Median age[years](range) 51.50 (18.00-86.00) 57.50 (40.00-85.00) 49.50 (18.00-86.00) 0.049^† Validation Set 2: [68]GSE16011 Characteristics Overall High risk group Low risk group P-value (n=155) (n=89) (n=66) Median overall survival[months](range) 8.76 (0.24-150.72) 7.08 (0.24-66.72) 14.16 (0.48-150.72) <0.0001^§ Female [n (%)] 50 (32.26%) 27 (30.34%) 23 (34.85%) 0.67^‡ Median age[years](range) 55.00 (14.00-80.00) 58.00 (14.00-79.00) 46.50 (15.00-80.00) 1.45×10^-6† Validation Set 3: [69]GSE83294 Characteristics Overall High risk group Low risk group P-value (n=59) (n=5) (n=54) Median overall survival[months](range) 7.80 (0.23-41.00 ) 5.03 (1.74-7.80) 9.29 (0.23-41.00) 0.014^§ Female [n (%)] 32 (54.24% ) 1 (20.00%) 31 (57.41% ) 0.26^‡ Median age[years](range) 47.00 (1.00-82.00 ) 56.00 (39.00-66.00) 44.50 (1.00-82.00) 0.35^† [70]Open in a new tab ‡ P value calculated using the Pearson’s chi-squared test. † P value calculated using the Wilcoxon rank-sum test. § P value calculated using the log-rank test. To assess the robustness of the GLO10 score, we then evaluated its performance in other three independent glioblastoma cohorts with GEO accessions [71]GSE13041 (validation set 1, n=160), [72]GSE16011 (validation set 2, n=155) and [73]GSE83294 (validation set 3, n=59). Patients with GLO10 scores higher than 3.58 were assigned into the high risk group, while those with GLO10 scores lower than the threshold were considered as low risk. As shown in Figure [74]4, there are significant differences between high-risk and low-risk groups (validation set 1: P = 0.044, validation set 2: P < 0.0001, validation set 3: P=0.014). Surprisingly, although high-risk patients tended to have higher age, the gender ratio showed insignificant different patterns (Figure [75]4, Table [76]1 ). This indicated that the GLO10 score retained prognostic power independent of other traditional factors, such as age and gender. The dominance of GLO10 score in the prognosis prediction revealed the importance of the 10 genes (Table [77]2) in glioblastoma diagnosis and treatment researches. DISCUSSION For the prognostic biomarker discovery in glioblastomas, there remains discordance among scientific publications given various sample requirements, data complexity, evolving technologies and lack of golden standard practice guidelines [[78]7, [79]20]. However, due to high complexity of genomics in glioma patients, considerable challenges still present in implementing such strategies. To address these challenges, we applied statistical modelling approaches and leveraged large scale of gene expression profiles from multi-institutional cohorts. The robustness of the GLO10 score was demonstrated by successfully implemented in more than three independent cohorts. Functional analyses revealed that genes consisting the GLO10 score were involved in the genesis and progression of glioblastomas (Figures [80]2 and [81]3). Among them, the prognostic associations with glioblastoma outcomes have already been reported. For example, Insulin-like growth factor-binding protein 2 (IGFBP2) has been considered as a glioma oncogene [[82]21]. Increasing expression of IGFBP2 could associated to poor glioma prognosis as it may play major role in glioma tumour progression [[83]22–[84]27]. It is also reported that in IDH-mutant glioma, IGFBP2 was inhibited so that patients’ survival could be improved [[85]28]. The expressions of FZD7, along with other two genes, SFRP1 and SFRP4, were identified to be associated with poor prognosis in glioma patients [[86]29]. The up-regulated expression of FZD7 could promote glioma cell proliferation [[87]30]. However, to our understanding, GLO10 the first gene signature that utilize the expression data and sum into a unified score that ease its applications in clinical practices. For the future work, we will collect more datasets and re-train the model to improve the robustness and predictive power. We would also assess its prognostic value by comparing other biomarkers such as IDH1 and 1p19q status. In summary, our analysis of data from different independent cohorts demonstrates that the utility of GLO10 score as a tool for glioblastoma prognosis prediction. The incorporation of the GLO10 into the prognosis prediction for newly diagnosed glioma patients will facilitate the development of biomarker and drug target discovery. MATERIALS AND METHODS Datasets collection and preprocess In this study, we collected datasets from various cohorts including The Cancer Genome Atlas (TCGA)[[88]31], Repository for Molecular Brain Neoplasia Data (Rembrandt)[[89]13] and other published studies. Gene expression profiling datasets and clinical information of five independent cohorts were retrieved NCBI GEO Database with accession numbers: [90]GSE68848 (Rembrandt cohort), [91]GSE83130 (TCGA cohort), [92]GSE13041 [[93]11], [94]GSE16011 [[95]32] and [96]GSE83294 [[97]8]. Given the complexity of data sets, we restricted the samples to grade three and four glioblastomas in this study. For the Rembrandt cohort, a total to 28 normal and 228 glioblastoma samples were considered. For the glioblastoma prognostic gene signature training, the numbers of samples considered for this study were 470, 160, 155 and 59 for the TCGA cohort (discovery set), [98]GSE13041 (validation set 1), [99]GSE16011 (validation set 2) and [100]GSE83294 (validation set 3). While searching for the data sets, Affymetrix HG-U133A and HG-U133 Plus 2.0 microarrays were considered. Since they are sharing 22,277 probe sets, the performances of the derived gene signatures could be easily evaluated in different studies without gene ID mapping. The raw fluorescence intensity profiles (*.CEL) were preprocessed, background corrected and normalized with RMA algorithm [[101]33] using Bioconductor package, affy [[102]34], in the R environment. Differential gene expression analysis We applied the differentially expressed gene analysis by using the limma (linear models for microarray data, [[103]35]) algorithm. All the gene expression data was converted into the log base-2 scale before comparison using the Welch's t-test with Benjamini & Hochberg correction [[104]36]. The functional analysis was based on the gene set enrichment analysis in Gene Ontology and Reactome Pathway database [[105]37]. Statistical modelling For the development of prognostic gene signature, we used the datasets from the discovery set (TCGA cohort, n=470). A linear regression modelling based on the LASSO algorithm was implemented by the glmnet [[106]38] R package. Briefly, the glmnet package fits a generalized linear model via penalized maximum likelihood and extract the prognostic gene signature by10-fold cross validation approaches. A subset of 10 genes was selected as their weighted combined gene expression data was significantly correlated to the overall survival outcomes of glioblastoma patients in the discovery set. Footnotes CONFLICTS OF INTEREST The authors do not have any conflicts of interest. REFERENCES