Abstract

Background

   Gene expression can be employed for the discovery of prognostic gene or
   multigene signatures cancer. In this study, we assessed the prognostic
   value of a 35-gene expression signature selected by pathway and machine
   learning based methods in adjuvant therapy-linked glioblastoma
   multiforme (GBM) patients from the Cancer Genome Atlas.

Methods

   Genes with high expression variance was subjected to pathway enrichment
   analysis and those having roles in chemoradioresistance pathways were
   used in expression-based feature selection. A modified Support Vector
   Machine Recursive Feature Elimination algorithm was employed to select
   a subset of these genes that discriminated between rapidly-progressing
   and slowly-progressing patients.

Results

   Survival analysis on TCGA samples not used in feature selection and
   samples from four GBM subclasses, as well as from an entirely
   independent study, showed that the 35-gene signature discriminated
   between the survival groups in all cases (p<0.05) and could accurately
   predict survival irrespective of the subtype. In a multivariate
   analysis, the signature predicted progression-free and overall survival
   independently of other factors considered.

Conclusion

   We propose that the performance of the signature makes it an attractive
   candidate for further studies to assess its utility as a clinical
   prognostic and predictive biomarker in GBM patients. Additionally, the
   signature genes may also be useful therapeutic targets to improve both
   progression-free and overall survival in GBM patients.

Electronic supplementary material

   The online version of this article (10.1186/s12885-018-4103-5) contains
   supplementary material, which is available to authorized users.

   Keywords: Glioblastoma multiforme, Prognostic genes, Risk groups,
   Chemoradiation resistance pathways

Background

   Glioblastoma multiforme (GBM) is the most common and highly aggressive
   brain tumour. Patients with GBM have very poor prognosis, with the
   median OS time of 14.5 months [[27]1]. Chemotherapy and radiotherapies
   are intended to improve patient survival, but are, however, hampered by
   development of resistance. Methylation of the promoter of the MGMT
   gene, which encodes O-6-methylguanine-DNA methyl-transferase, a
   DNA-repair enzyme that removes alkylating groups at the O6 of guanine
   residues, is a predictor of treatment response in GBM. Most studies
   that considered progression-free survival assessed only the prognostic
   value of MGMT promoter methylation [[28]2–[29]4]. Tumours with
   hypermethylated MGMT promoters are expected to benefit from
   temozolomide, an alkylating agent used for treating GBM, but reports
   regarding the prognostic value of this biomarker have been conflicting
   [[30]5, [31]6].

   Several gene expression prognostic and predictive signatures have been
   translated into clinical applications for cancer treatment. Oncotype DX
   is a 21-gene qRT-PCR assay used to predict likelihood of recurrence in
   women with estrogen receptor positive breast cancer [[32]7, [33]8].
   Mammostrat is prognostic immunohistochemical test that uses antibodies
   specific for SLC7A5, p53, HTF9C, NDRG1, and CEACAM5 to classify
   ER-positive, lymph node negative breast cancer cases into low-,
   moderate- or high-risk groups [[34]9, [35]10]. Mammaprint is a 70-gene
   microarray-based test for predicting risk of metastasis in breast
   cancer [[36]11].

   In light of the lack of standardised prognostic biomarkers for GBM, we
   aimed to identify a mRNA expression derived prognostic signature using
   data from the Cancer Genome Atlas (TCGA -
   [37]http://cancergenome.nih.gov/). As current prognostic feature
   selection approaches lack reproducibility and do not take
   chemoradioresistant pathways into consideration, we used a combination
   of pathway enrichment analysis and Support Vector Machine based
   Recursive Feature Elimination (SVM-RFE) to ensure that the genes
   selected as having predictive potential would also be biologically
   relevant to the phenoptype. We here describe a multigene signature that
   successfully predicts both progression-free and overall survival in
   glioblastoma multiforme.

Methods

Gene-centric expression data

   Five hundred fifty eight GBM gene expression profiles generated by the
   Cancer Genome Atlas (TCGA) were downloaded from the NCI Genomic Data
   Commons Data Portal
   ([38]https://portal.gdc.cancer.gov/projects/TCGA-GBM). Five hundred
   forty eight of the these profiles were obtained from GBM patients, and
   ten were from non-neoplastic patients. One profile was selected for
   each of the samples profiled two or more times. Five hundred twenty
   nine profiles left after removing those of non-neoplastic samples were
   used in this study (Additional file [39]1). The expression were
   profiled on Affymetrix HT HG-U133A platform. As gene expression of the
   TCGA samples was profiled in batches which could introduce bias in
   classification analysis [[40]12], the statistical significance of batch
   effect was assessed as a function of the selected genes using guided
   Principal Component Analysis (gPCA) from the R package gPCA [[41]13].
   The approach used by TCGA (2008) [[42]14] and Verhaak et al. (2011)
   [[43]15] was employed to generate gene-centric expression data. The
   probe sequences of HT HG-U133A downloaded from Affymetrix were mapped
   against a database composed of RefSeq version 41 and GenBank 178
   complete coding sequences using SpliceMiner [[44]16]. Only perfect
   matches were considered and probes mapping to more than one gene were
   excluded. The output file from SpliceMiner and the HT HG-U133A chip
   definition file (cdf) were passed to the alternate cdf-generating
   function makealtcdf of AffyProbeMiner [[45]17]. Probe sets with less
   than five probes were excluded from the resulting alternative cdf,
   which was then converted to an R package using makecdfenv. The cdf was
   used to perform Robust Multi-array Average normalization and
   summarization of the gene expression data, resulting in gene-centric
   data for 12161 genes.

   An independent validation data set ([46]GSE7696) profiled on HG-U133
   Plus 2 Affymetrix platform and downloaded from the NCBI Gene Expression
   Omnibus
   ([47]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7696) was
   equivalently treated. This data set contained gene expression data for
   80 GBM and four non-neoplastic samples, and was chosen because of the
   availability of patients’ treatment information.

Sample selection

   To ensure that treatment did not introduce confounding effects, samples
   from patients that received adjuvant chemotherapy and radiation and had
   uncensored days to death or progression were selected. Figure [48]1
   shows sample selection for the identification of genes with prognostic
   value. Four hundred fifteen patients received the standard GBM
   treatment. Semantically, tumour progression is a radiologically
   documented increase in tumour size after a subtotal surgical excision
   [[49]18]. The time for this to occur is known as time to progression,
   which is the same as uncensored progression-free survival (PFS)
   [[50]19]. Two hundred one patients had associated uncensored
   progression-free survival (PFS) times, and 380 had overall survival OS
   times (censored or uncensored).

Fig. 1.

   Fig. 1
   [51]Open in a new tab

   Sample selection for the identification of prognostic genes in
   glioblastoma multiforme. PFS: progression-free survival (days); OS:
   overall survival (days); adjuvant treatment: chemotherapy and radiation

   Clinical data for all the patients used in this study were obtained
   from TCGA. PFS times for patients who experienced tumour progression
   within the follow-up period were obtained from the TCGA file for new
   tumour events. The GBM subtypes of samples used in our study were
   obtained from the supplementary clinical file provided by Brennan et
   al., (2013) [[52]20].

   There is no standard for classifying patients as rapid and slow GBM
   progressors after standard treatment. While the median PFS after
   treatment could be used as a separation point, it does not provide a
   ’buffer zone’ to filter out borderline samples close to the median that
   may fall in the incorrect group due to unknown confounding factors.
   Rather than defining an arbitrary exclusion range, we used the first
   (Q1) and third (Q3) quartiles, 120 and 341 days respectively, as
   boundaries to divide patients into three classes, since they are still
   dependent on the median and not influenced by extreme outliers. Class 1
   contained 48 patients having PFS times between 6 and 120 days
   (rapidly-progressing) and class 2 contained 35 patients having PFS
   times between 358 and 720 days (slow progressing). Classes 1 and 2 were
   used in feature selection and the 118 remaining samples (Class 3) that
   fell within the inter-quartile range were used in PFS and OS analysis.

Selection of genes discriminating between rapidly and slowly progressing GBM
patients

   In this present study, genes in the cancer-related pathways were
   considered in our feature selection because of their known roles in
   chemoradiation resistance, and to reduce the likelihood of selecting
   genes related to survival by chance. Studies have identified pathways
   and processes that drive resistance to chemotherapy and radiotherapy in
   cancer. Several of these genes are found in known cancer pathways
   [[53]21–[54]28]. Several genes in the NF- κB and PI3K/Akt signaling
   pathways are associated with chemoresistance development in cancer
   [[55]29, [56]30]. Also, genes involved in drug inactivation and efflux,
   DNA repair, and epithelial-mesenchymal transition have been shown to
   enhance drug resistance mechanisms [[57]26, [58]31]. Pathway enrichment
   analysis was performed on the genes with high expression variance
   (median absolute deviation ≥ 0.5) across the 529 samples using the Set
   Analyser web service provided by the Comparative Toxicogenomics
   Database [[59]32]. Genes were selected from the pathway categories
   related to cancer signaling pathways, reactive oxygen species
   metabolism, DNA repair, and drug transport and metabolism. A set of
   genes that discriminated between the rapidly-progressing and
   slowly-progressing groups were selected using a modified Support Vector
   Machine-Recursive Feature Elimination (SVM-RFE). SVM-RFE, proposed by
   [[60]33], was modified by introducing 5-fold cross-validation into the
   SVM classifier step and capturing the error rate generated at this step
   (the figure showing the workflow for SVM-RFE is attached as Additional
   file [61]2).

Survival analysis

   The 118 Class 3 patients not used in the feature selection step were
   used to calculate regression coefficients (β) for the selected genes
   using univariate Cox proportional hazards analysis. The β’s were
   computed for the genes using coxph from the R survival package.
   Prognostic index, PI, was then calculated for each of the patients who
   received adjuvant chemotherapy and radiation and had PFS and/or OS data
   using the equation
   [MATH: <mrow><mtext
   mathvariant="italic">PI</mtext><mo>=</mo><msub><mrow><mi>β</mi></mrow><
   mrow><mn>1</mn></mrow></msub><mo>∗</mo><mtext
   mathvariant="italic">gen</mtext><msub><mrow><mi>e</mi></mrow><mrow><mn>
   1</mn></mrow></msub><mo>+</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>2<
   /mn></mrow></msub><mo>∗</mo><msub><mrow><mtext
   mathvariant="italic">gene</mtext></mrow><mrow><mn>2</mn></mrow></msub><
   mo>+</mo><mo>…</mo><mo>+</mo><mi>β</mi><mo>∗</mo><msub><mrow><mtext
   mathvariant="italic">gene</mtext></mrow><mrow><mi>g</mi></mrow></msub><
   /mrow> :MATH]

   where β[g] and gene[g] are the regression coefficient and the gene
   expression value for gene g, respectively. Patients in Class 3 were
   classified into low-risk and high-risk groups by choosing a value
   between the highest and lowest PI that ensured proper patients
   distribution based on PI. Patients with PI scores greater or equal than
   the chosen value were assigned to the high-risk group, whereas those
   with PI scores less than the value were assigned to the low-risk group.
   380 patients with OS times were also classified into low-risk and
   high-risk groups in the same way.

Assessment of signature prognostic value in GBM subtypes

   Verhaak et al. (2010) [[62]15] identified four subtypes of GBM, namely
   proneural, neural, classical and mesenchymal, using gene expression
   data from 200 GBM samples. Brennan et al. (2013) [[63]20] assigned
   additional 342 TGCA samples into the four subtypes using single-sample
   gene set enrichment analysis. A summarised clinical file provided by
   the authors was used in our study to assign patients to GBM subtypes.
   95, 60, 105 and 120 of the 380 patients with available OS times were
   assigned to proneural, neural, classical and mesenchymal subtypes,
   respectively. 51, 33, 51 and 66 of the 201 patient group having
   associated PFS times were assigned to proneural, neural, classical and
   mesenchymal subtypes, respectively. We further categorised patients in
   each subtype into low-risk and high-risk groups.

Assessment of signature prognostic value in an independent dataset

   The prognostic value of the selected gene signature was validated with
   the data from patients in the Murat et al. [[64]34] validation dataset
   who had primary tumours and received adjuvant chemo- and radiotherapy.
   PI was calculated for the patients using the β’s obtained from the TCGA
   training set and the expression values of the selected genes in the
   samples from the patients. They were classified into low-risk and
   high-risk groups in such a manner as to ensure proper patient
   distribution between the two groups. Survival of the low-risk and
   high-risk groups were determined for both the TCGA and validation
   cohorts using the Kaplan-Meier method. Differences in survival between
   the risk groups were estimated statistically by log rank test. Survival
   differences between groups was said to be statistically significant if
   p<0.05. Hazard ratios (HR) between risk groups were determined by Cox
   proportional hazards regression model.

Mutivariate survival analysis to assess independent prognostic value

   A multivariate Cox survival model was built using three variables: our
   prognostic index, MGMT promoter methylation, and age. Ages of patients
   at diagnosis were obtained from the clinical file provided by TCGA.
   MGMT promoter methylation status data were obtained from the clinical
   file provided by Brennan et al. [[65]20] The univariate Cox analysis
   was first carried out on each variable followed by multivariate Cox
   analysis on all the variables. The coxph function in the R survival
   package was used for the analysis. Using the median PI value, the
   patients were assigned into low-risk or high-risk groups. Those with PI
   values lower than the median were assigned to low-risk groups, and
   those with PI to high-risk groups. The low-risk and the MGMT methylated
   promoter groups were used as references for prognostic index and MGMT