Abstract Background Gene expression can be employed for the discovery of prognostic gene or multigene signatures cancer. In this study, we assessed the prognostic value of a 35-gene expression signature selected by pathway and machine learning based methods in adjuvant therapy-linked glioblastoma multiforme (GBM) patients from the Cancer Genome Atlas. Methods Genes with high expression variance was subjected to pathway enrichment analysis and those having roles in chemoradioresistance pathways were used in expression-based feature selection. A modified Support Vector Machine Recursive Feature Elimination algorithm was employed to select a subset of these genes that discriminated between rapidly-progressing and slowly-progressing patients. Results Survival analysis on TCGA samples not used in feature selection and samples from four GBM subclasses, as well as from an entirely independent study, showed that the 35-gene signature discriminated between the survival groups in all cases (p<0.05) and could accurately predict survival irrespective of the subtype. In a multivariate analysis, the signature predicted progression-free and overall survival independently of other factors considered. Conclusion We propose that the performance of the signature makes it an attractive candidate for further studies to assess its utility as a clinical prognostic and predictive biomarker in GBM patients. Additionally, the signature genes may also be useful therapeutic targets to improve both progression-free and overall survival in GBM patients. Electronic supplementary material The online version of this article (10.1186/s12885-018-4103-5) contains supplementary material, which is available to authorized users. Keywords: Glioblastoma multiforme, Prognostic genes, Risk groups, Chemoradiation resistance pathways Background Glioblastoma multiforme (GBM) is the most common and highly aggressive brain tumour. Patients with GBM have very poor prognosis, with the median OS time of 14.5 months [[27]1]. Chemotherapy and radiotherapies are intended to improve patient survival, but are, however, hampered by development of resistance. Methylation of the promoter of the MGMT gene, which encodes O-6-methylguanine-DNA methyl-transferase, a DNA-repair enzyme that removes alkylating groups at the O6 of guanine residues, is a predictor of treatment response in GBM. Most studies that considered progression-free survival assessed only the prognostic value of MGMT promoter methylation [[28]2–[29]4]. Tumours with hypermethylated MGMT promoters are expected to benefit from temozolomide, an alkylating agent used for treating GBM, but reports regarding the prognostic value of this biomarker have been conflicting [[30]5, [31]6]. Several gene expression prognostic and predictive signatures have been translated into clinical applications for cancer treatment. Oncotype DX is a 21-gene qRT-PCR assay used to predict likelihood of recurrence in women with estrogen receptor positive breast cancer [[32]7, [33]8]. Mammostrat is prognostic immunohistochemical test that uses antibodies specific for SLC7A5, p53, HTF9C, NDRG1, and CEACAM5 to classify ER-positive, lymph node negative breast cancer cases into low-, moderate- or high-risk groups [[34]9, [35]10]. Mammaprint is a 70-gene microarray-based test for predicting risk of metastasis in breast cancer [[36]11]. In light of the lack of standardised prognostic biomarkers for GBM, we aimed to identify a mRNA expression derived prognostic signature using data from the Cancer Genome Atlas (TCGA - [37]http://cancergenome.nih.gov/). As current prognostic feature selection approaches lack reproducibility and do not take chemoradioresistant pathways into consideration, we used a combination of pathway enrichment analysis and Support Vector Machine based Recursive Feature Elimination (SVM-RFE) to ensure that the genes selected as having predictive potential would also be biologically relevant to the phenoptype. We here describe a multigene signature that successfully predicts both progression-free and overall survival in glioblastoma multiforme. Methods Gene-centric expression data Five hundred fifty eight GBM gene expression profiles generated by the Cancer Genome Atlas (TCGA) were downloaded from the NCI Genomic Data Commons Data Portal ([38]https://portal.gdc.cancer.gov/projects/TCGA-GBM). Five hundred forty eight of the these profiles were obtained from GBM patients, and ten were from non-neoplastic patients. One profile was selected for each of the samples profiled two or more times. Five hundred twenty nine profiles left after removing those of non-neoplastic samples were used in this study (Additional file [39]1). The expression were profiled on Affymetrix HT HG-U133A platform. As gene expression of the TCGA samples was profiled in batches which could introduce bias in classification analysis [[40]12], the statistical significance of batch effect was assessed as a function of the selected genes using guided Principal Component Analysis (gPCA) from the R package gPCA [[41]13]. The approach used by TCGA (2008) [[42]14] and Verhaak et al. (2011) [[43]15] was employed to generate gene-centric expression data. The probe sequences of HT HG-U133A downloaded from Affymetrix were mapped against a database composed of RefSeq version 41 and GenBank 178 complete coding sequences using SpliceMiner [[44]16]. Only perfect matches were considered and probes mapping to more than one gene were excluded. The output file from SpliceMiner and the HT HG-U133A chip definition file (cdf) were passed to the alternate cdf-generating function makealtcdf of AffyProbeMiner [[45]17]. Probe sets with less than five probes were excluded from the resulting alternative cdf, which was then converted to an R package using makecdfenv. The cdf was used to perform Robust Multi-array Average normalization and summarization of the gene expression data, resulting in gene-centric data for 12161 genes. An independent validation data set ([46]GSE7696) profiled on HG-U133 Plus 2 Affymetrix platform and downloaded from the NCBI Gene Expression Omnibus ([47]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7696) was equivalently treated. This data set contained gene expression data for 80 GBM and four non-neoplastic samples, and was chosen because of the availability of patients’ treatment information. Sample selection To ensure that treatment did not introduce confounding effects, samples from patients that received adjuvant chemotherapy and radiation and had uncensored days to death or progression were selected. Figure [48]1 shows sample selection for the identification of genes with prognostic value. Four hundred fifteen patients received the standard GBM treatment. Semantically, tumour progression is a radiologically documented increase in tumour size after a subtotal surgical excision [[49]18]. The time for this to occur is known as time to progression, which is the same as uncensored progression-free survival (PFS) [[50]19]. Two hundred one patients had associated uncensored progression-free survival (PFS) times, and 380 had overall survival OS times (censored or uncensored). Fig. 1. Fig. 1 [51]Open in a new tab Sample selection for the identification of prognostic genes in glioblastoma multiforme. PFS: progression-free survival (days); OS: overall survival (days); adjuvant treatment: chemotherapy and radiation Clinical data for all the patients used in this study were obtained from TCGA. PFS times for patients who experienced tumour progression within the follow-up period were obtained from the TCGA file for new tumour events. The GBM subtypes of samples used in our study were obtained from the supplementary clinical file provided by Brennan et al., (2013) [[52]20]. There is no standard for classifying patients as rapid and slow GBM progressors after standard treatment. While the median PFS after treatment could be used as a separation point, it does not provide a ’buffer zone’ to filter out borderline samples close to the median that may fall in the incorrect group due to unknown confounding factors. Rather than defining an arbitrary exclusion range, we used the first (Q1) and third (Q3) quartiles, 120 and 341 days respectively, as boundaries to divide patients into three classes, since they are still dependent on the median and not influenced by extreme outliers. Class 1 contained 48 patients having PFS times between 6 and 120 days (rapidly-progressing) and class 2 contained 35 patients having PFS times between 358 and 720 days (slow progressing). Classes 1 and 2 were used in feature selection and the 118 remaining samples (Class 3) that fell within the inter-quartile range were used in PFS and OS analysis. Selection of genes discriminating between rapidly and slowly progressing GBM patients In this present study, genes in the cancer-related pathways were considered in our feature selection because of their known roles in chemoradiation resistance, and to reduce the likelihood of selecting genes related to survival by chance. Studies have identified pathways and processes that drive resistance to chemotherapy and radiotherapy in cancer. Several of these genes are found in known cancer pathways [[53]21–[54]28]. Several genes in the NF- κB and PI3K/Akt signaling pathways are associated with chemoresistance development in cancer [[55]29, [56]30]. Also, genes involved in drug inactivation and efflux, DNA repair, and epithelial-mesenchymal transition have been shown to enhance drug resistance mechanisms [[57]26, [58]31]. Pathway enrichment analysis was performed on the genes with high expression variance (median absolute deviation ≥ 0.5) across the 529 samples using the Set Analyser web service provided by the Comparative Toxicogenomics Database [[59]32]. Genes were selected from the pathway categories related to cancer signaling pathways, reactive oxygen species metabolism, DNA repair, and drug transport and metabolism. A set of genes that discriminated between the rapidly-progressing and slowly-progressing groups were selected using a modified Support Vector Machine-Recursive Feature Elimination (SVM-RFE). SVM-RFE, proposed by [[60]33], was modified by introducing 5-fold cross-validation into the SVM classifier step and capturing the error rate generated at this step (the figure showing the workflow for SVM-RFE is attached as Additional file [61]2). Survival analysis The 118 Class 3 patients not used in the feature selection step were used to calculate regression coefficients (β) for the selected genes using univariate Cox proportional hazards analysis. The β’s were computed for the genes using coxph from the R survival package. Prognostic index, PI, was then calculated for each of the patients who received adjuvant chemotherapy and radiation and had PFS and/or OS data using the equation [MATH: PI=β< mrow>1gene 1+β2< /mn>gene2< mo>++βgeneg< /mrow> :MATH] where β[g] and gene[g] are the regression coefficient and the gene expression value for gene g, respectively. Patients in Class 3 were classified into low-risk and high-risk groups by choosing a value between the highest and lowest PI that ensured proper patients distribution based on PI. Patients with PI scores greater or equal than the chosen value were assigned to the high-risk group, whereas those with PI scores less than the value were assigned to the low-risk group. 380 patients with OS times were also classified into low-risk and high-risk groups in the same way. Assessment of signature prognostic value in GBM subtypes Verhaak et al. (2010) [[62]15] identified four subtypes of GBM, namely proneural, neural, classical and mesenchymal, using gene expression data from 200 GBM samples. Brennan et al. (2013) [[63]20] assigned additional 342 TGCA samples into the four subtypes using single-sample gene set enrichment analysis. A summarised clinical file provided by the authors was used in our study to assign patients to GBM subtypes. 95, 60, 105 and 120 of the 380 patients with available OS times were assigned to proneural, neural, classical and mesenchymal subtypes, respectively. 51, 33, 51 and 66 of the 201 patient group having associated PFS times were assigned to proneural, neural, classical and mesenchymal subtypes, respectively. We further categorised patients in each subtype into low-risk and high-risk groups. Assessment of signature prognostic value in an independent dataset The prognostic value of the selected gene signature was validated with the data from patients in the Murat et al. [[64]34] validation dataset who had primary tumours and received adjuvant chemo- and radiotherapy. PI was calculated for the patients using the β’s obtained from the TCGA training set and the expression values of the selected genes in the samples from the patients. They were classified into low-risk and high-risk groups in such a manner as to ensure proper patient distribution between the two groups. Survival of the low-risk and high-risk groups were determined for both the TCGA and validation cohorts using the Kaplan-Meier method. Differences in survival between the risk groups were estimated statistically by log rank test. Survival differences between groups was said to be statistically significant if p<0.05. Hazard ratios (HR) between risk groups were determined by Cox proportional hazards regression model. Mutivariate survival analysis to assess independent prognostic value A multivariate Cox survival model was built using three variables: our prognostic index, MGMT promoter methylation, and age. Ages of patients at diagnosis were obtained from the clinical file provided by TCGA. MGMT promoter methylation status data were obtained from the clinical file provided by Brennan et al. [[65]20] The univariate Cox analysis was first carried out on each variable followed by multivariate Cox analysis on all the variables. The coxph function in the R survival package was used for the analysis. Using the median PI value, the patients were assigned into low-risk or high-risk groups. Those with PI values lower than the median were assigned to low-risk groups, and those with PI to high-risk groups. The low-risk and the MGMT methylated promoter groups were used as references for prognostic index and MGMT