Abstract
Background
Gene expression can be employed for the discovery of prognostic gene or
multigene signatures cancer. In this study, we assessed the prognostic
value of a 35-gene expression signature selected by pathway and machine
learning based methods in adjuvant therapy-linked glioblastoma
multiforme (GBM) patients from the Cancer Genome Atlas.
Methods
Genes with high expression variance was subjected to pathway enrichment
analysis and those having roles in chemoradioresistance pathways were
used in expression-based feature selection. A modified Support Vector
Machine Recursive Feature Elimination algorithm was employed to select
a subset of these genes that discriminated between rapidly-progressing
and slowly-progressing patients.
Results
Survival analysis on TCGA samples not used in feature selection and
samples from four GBM subclasses, as well as from an entirely
independent study, showed that the 35-gene signature discriminated
between the survival groups in all cases (p<0.05) and could accurately
predict survival irrespective of the subtype. In a multivariate
analysis, the signature predicted progression-free and overall survival
independently of other factors considered.
Conclusion
We propose that the performance of the signature makes it an attractive
candidate for further studies to assess its utility as a clinical
prognostic and predictive biomarker in GBM patients. Additionally, the
signature genes may also be useful therapeutic targets to improve both
progression-free and overall survival in GBM patients.
Electronic supplementary material
The online version of this article (10.1186/s12885-018-4103-5) contains
supplementary material, which is available to authorized users.
Keywords: Glioblastoma multiforme, Prognostic genes, Risk groups,
Chemoradiation resistance pathways
Background
Glioblastoma multiforme (GBM) is the most common and highly aggressive
brain tumour. Patients with GBM have very poor prognosis, with the
median OS time of 14.5 months [[27]1]. Chemotherapy and radiotherapies
are intended to improve patient survival, but are, however, hampered by
development of resistance. Methylation of the promoter of the MGMT
gene, which encodes O-6-methylguanine-DNA methyl-transferase, a
DNA-repair enzyme that removes alkylating groups at the O6 of guanine
residues, is a predictor of treatment response in GBM. Most studies
that considered progression-free survival assessed only the prognostic
value of MGMT promoter methylation [[28]2–[29]4]. Tumours with
hypermethylated MGMT promoters are expected to benefit from
temozolomide, an alkylating agent used for treating GBM, but reports
regarding the prognostic value of this biomarker have been conflicting
[[30]5, [31]6].
Several gene expression prognostic and predictive signatures have been
translated into clinical applications for cancer treatment. Oncotype DX
is a 21-gene qRT-PCR assay used to predict likelihood of recurrence in
women with estrogen receptor positive breast cancer [[32]7, [33]8].
Mammostrat is prognostic immunohistochemical test that uses antibodies
specific for SLC7A5, p53, HTF9C, NDRG1, and CEACAM5 to classify
ER-positive, lymph node negative breast cancer cases into low-,
moderate- or high-risk groups [[34]9, [35]10]. Mammaprint is a 70-gene
microarray-based test for predicting risk of metastasis in breast
cancer [[36]11].
In light of the lack of standardised prognostic biomarkers for GBM, we
aimed to identify a mRNA expression derived prognostic signature using
data from the Cancer Genome Atlas (TCGA -
[37]http://cancergenome.nih.gov/). As current prognostic feature
selection approaches lack reproducibility and do not take
chemoradioresistant pathways into consideration, we used a combination
of pathway enrichment analysis and Support Vector Machine based
Recursive Feature Elimination (SVM-RFE) to ensure that the genes
selected as having predictive potential would also be biologically
relevant to the phenoptype. We here describe a multigene signature that
successfully predicts both progression-free and overall survival in
glioblastoma multiforme.
Methods
Gene-centric expression data
Five hundred fifty eight GBM gene expression profiles generated by the
Cancer Genome Atlas (TCGA) were downloaded from the NCI Genomic Data
Commons Data Portal
([38]https://portal.gdc.cancer.gov/projects/TCGA-GBM). Five hundred
forty eight of the these profiles were obtained from GBM patients, and
ten were from non-neoplastic patients. One profile was selected for
each of the samples profiled two or more times. Five hundred twenty
nine profiles left after removing those of non-neoplastic samples were
used in this study (Additional file [39]1). The expression were
profiled on Affymetrix HT HG-U133A platform. As gene expression of the
TCGA samples was profiled in batches which could introduce bias in
classification analysis [[40]12], the statistical significance of batch
effect was assessed as a function of the selected genes using guided
Principal Component Analysis (gPCA) from the R package gPCA [[41]13].
The approach used by TCGA (2008) [[42]14] and Verhaak et al. (2011)
[[43]15] was employed to generate gene-centric expression data. The
probe sequences of HT HG-U133A downloaded from Affymetrix were mapped
against a database composed of RefSeq version 41 and GenBank 178
complete coding sequences using SpliceMiner [[44]16]. Only perfect
matches were considered and probes mapping to more than one gene were
excluded. The output file from SpliceMiner and the HT HG-U133A chip
definition file (cdf) were passed to the alternate cdf-generating
function makealtcdf of AffyProbeMiner [[45]17]. Probe sets with less
than five probes were excluded from the resulting alternative cdf,
which was then converted to an R package using makecdfenv. The cdf was
used to perform Robust Multi-array Average normalization and
summarization of the gene expression data, resulting in gene-centric
data for 12161 genes.
An independent validation data set ([46]GSE7696) profiled on HG-U133
Plus 2 Affymetrix platform and downloaded from the NCBI Gene Expression
Omnibus
([47]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7696) was
equivalently treated. This data set contained gene expression data for
80 GBM and four non-neoplastic samples, and was chosen because of the
availability of patients’ treatment information.
Sample selection
To ensure that treatment did not introduce confounding effects, samples
from patients that received adjuvant chemotherapy and radiation and had
uncensored days to death or progression were selected. Figure [48]1
shows sample selection for the identification of genes with prognostic
value. Four hundred fifteen patients received the standard GBM
treatment. Semantically, tumour progression is a radiologically
documented increase in tumour size after a subtotal surgical excision
[[49]18]. The time for this to occur is known as time to progression,
which is the same as uncensored progression-free survival (PFS)
[[50]19]. Two hundred one patients had associated uncensored
progression-free survival (PFS) times, and 380 had overall survival OS
times (censored or uncensored).
Fig. 1.
Fig. 1
[51]Open in a new tab
Sample selection for the identification of prognostic genes in
glioblastoma multiforme. PFS: progression-free survival (days); OS:
overall survival (days); adjuvant treatment: chemotherapy and radiation
Clinical data for all the patients used in this study were obtained
from TCGA. PFS times for patients who experienced tumour progression
within the follow-up period were obtained from the TCGA file for new
tumour events. The GBM subtypes of samples used in our study were
obtained from the supplementary clinical file provided by Brennan et
al., (2013) [[52]20].
There is no standard for classifying patients as rapid and slow GBM
progressors after standard treatment. While the median PFS after
treatment could be used as a separation point, it does not provide a
’buffer zone’ to filter out borderline samples close to the median that
may fall in the incorrect group due to unknown confounding factors.
Rather than defining an arbitrary exclusion range, we used the first
(Q1) and third (Q3) quartiles, 120 and 341 days respectively, as
boundaries to divide patients into three classes, since they are still
dependent on the median and not influenced by extreme outliers. Class 1
contained 48 patients having PFS times between 6 and 120 days
(rapidly-progressing) and class 2 contained 35 patients having PFS
times between 358 and 720 days (slow progressing). Classes 1 and 2 were
used in feature selection and the 118 remaining samples (Class 3) that
fell within the inter-quartile range were used in PFS and OS analysis.
Selection of genes discriminating between rapidly and slowly progressing GBM
patients
In this present study, genes in the cancer-related pathways were
considered in our feature selection because of their known roles in
chemoradiation resistance, and to reduce the likelihood of selecting
genes related to survival by chance. Studies have identified pathways
and processes that drive resistance to chemotherapy and radiotherapy in
cancer. Several of these genes are found in known cancer pathways
[[53]21–[54]28]. Several genes in the NF- κB and PI3K/Akt signaling
pathways are associated with chemoresistance development in cancer
[[55]29, [56]30]. Also, genes involved in drug inactivation and efflux,
DNA repair, and epithelial-mesenchymal transition have been shown to
enhance drug resistance mechanisms [[57]26, [58]31]. Pathway enrichment
analysis was performed on the genes with high expression variance
(median absolute deviation ≥ 0.5) across the 529 samples using the Set
Analyser web service provided by the Comparative Toxicogenomics
Database [[59]32]. Genes were selected from the pathway categories
related to cancer signaling pathways, reactive oxygen species
metabolism, DNA repair, and drug transport and metabolism. A set of
genes that discriminated between the rapidly-progressing and
slowly-progressing groups were selected using a modified Support Vector
Machine-Recursive Feature Elimination (SVM-RFE). SVM-RFE, proposed by
[[60]33], was modified by introducing 5-fold cross-validation into the
SVM classifier step and capturing the error rate generated at this step
(the figure showing the workflow for SVM-RFE is attached as Additional
file [61]2).
Survival analysis
The 118 Class 3 patients not used in the feature selection step were
used to calculate regression coefficients (β) for the selected genes
using univariate Cox proportional hazards analysis. The β’s were
computed for the genes using coxph from the R survival package.
Prognostic index, PI, was then calculated for each of the patients who
received adjuvant chemotherapy and radiation and had PFS and/or OS data
using the equation
[MATH: PI=β<
mrow>1∗gene
1+β2<
/mn>∗gene2<
mo>+…+β∗geneg<
/mrow> :MATH]
where β[g] and gene[g] are the regression coefficient and the gene
expression value for gene g, respectively. Patients in Class 3 were
classified into low-risk and high-risk groups by choosing a value
between the highest and lowest PI that ensured proper patients
distribution based on PI. Patients with PI scores greater or equal than
the chosen value were assigned to the high-risk group, whereas those
with PI scores less than the value were assigned to the low-risk group.
380 patients with OS times were also classified into low-risk and
high-risk groups in the same way.
Assessment of signature prognostic value in GBM subtypes
Verhaak et al. (2010) [[62]15] identified four subtypes of GBM, namely
proneural, neural, classical and mesenchymal, using gene expression
data from 200 GBM samples. Brennan et al. (2013) [[63]20] assigned
additional 342 TGCA samples into the four subtypes using single-sample
gene set enrichment analysis. A summarised clinical file provided by
the authors was used in our study to assign patients to GBM subtypes.
95, 60, 105 and 120 of the 380 patients with available OS times were
assigned to proneural, neural, classical and mesenchymal subtypes,
respectively. 51, 33, 51 and 66 of the 201 patient group having
associated PFS times were assigned to proneural, neural, classical and
mesenchymal subtypes, respectively. We further categorised patients in
each subtype into low-risk and high-risk groups.
Assessment of signature prognostic value in an independent dataset
The prognostic value of the selected gene signature was validated with
the data from patients in the Murat et al. [[64]34] validation dataset
who had primary tumours and received adjuvant chemo- and radiotherapy.
PI was calculated for the patients using the β’s obtained from the TCGA
training set and the expression values of the selected genes in the
samples from the patients. They were classified into low-risk and
high-risk groups in such a manner as to ensure proper patient
distribution between the two groups. Survival of the low-risk and
high-risk groups were determined for both the TCGA and validation
cohorts using the Kaplan-Meier method. Differences in survival between
the risk groups were estimated statistically by log rank test. Survival
differences between groups was said to be statistically significant if
p<0.05. Hazard ratios (HR) between risk groups were determined by Cox
proportional hazards regression model.
Mutivariate survival analysis to assess independent prognostic value
A multivariate Cox survival model was built using three variables: our
prognostic index, MGMT promoter methylation, and age. Ages of patients
at diagnosis were obtained from the clinical file provided by TCGA.
MGMT promoter methylation status data were obtained from the clinical
file provided by Brennan et al. [[65]20] The univariate Cox analysis
was first carried out on each variable followed by multivariate Cox
analysis on all the variables. The coxph function in the R survival
package was used for the analysis. Using the median PI value, the
patients were assigned into low-risk or high-risk groups. Those with PI
values lower than the median were assigned to low-risk groups, and
those with PI to high-risk groups. The low-risk and the MGMT methylated
promoter groups were used as references for prognostic index and MGMT