Abstract

Background

   Nearly one-third of serous ovarian cancer (OVCA) patients will not
   respond to initial treatment with surgery and chemotherapy and die
   within one year of diagnosis. If patients who are unlikely to respond
   to current standard therapy can be identified up front, enhanced tumor
   analyses and treatment regimens could potentially be offered. Using the
   Cancer Genome Atlas (TCGA) serous OVCA database, we previously
   identified a robust molecular signature of 422-genes associated with
   chemo-response. Our objective was to test whether this signature is an
   accurate and sensitive predictor of chemo-response in serous OVCA.

Methods

   We first constructed prediction models to predict chemo-response using
   our previously described 422-gene signature that was associated with
   response to treatment in serous OVCA. Performance of all prediction
   models were measured with area under the curves (AUCs, a measure of the
   model’s accuracy) and their respective confidence intervals (CIs). To
   optimize the prediction process, we determined which elements of the
   signature most contributed to chemo-response prediction. All prediction
   models were replicated and validated using six publicly available
   independent gene expression datasets.

Results

   The 422-gene signature prediction models predicted chemo-response with
   AUCs of ~70 %. Optimization of prediction models identified the 34 most
   important genes in chemo-response prediction. These 34-gene models had
   improved performance, with AUCs approaching 80 %. Both 422-gene and
   34-gene prediction models were replicated and validated in six
   independent datasets.

Conclusions

   These prediction models serve as the foundation for the future
   development and implementation of a diagnostic tool to predict response
   to chemotherapy for serous OVCA patients.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12943-016-0548-9)
   contains supplementary material, which is available to authorized
   users.

   Keywords: Ovarian cancer, Chemo-response, Prediction model, Data
   integration, Individualized treatment

Background

   Epithelial ovarian cancer (OVCA) has the highest mortality rate of all
   gynecologic cancers [[39]1]. The most common histological subtype of
   OVCA is serous [[40]2]. The majority of patients present with advanced
   disease at diagnosis and, while some benefit from a treatment combining
   cytoreductive surgery and chemotherapy [[41]3], nearly a third of
   patients with serous OVCA will not respond to this initial treatment
   and die from disease within one year after diagnosis [[42]1, [43]4].
   Despite significant research directed at understanding the biology of
   OVCA [[44]5, [45]6], outcomes remain poor for a majority of patients,
   particularly those who do not respond to initial chemotherapy. A major
   limitation is the lack of validated biomarkers that can effectively
   predict response to chemotherapy [[46]7, [47]8].

   Previous attempts to define predictors of response to treatment have
   been limited by number of patients included, mixture of histological
   types and stages, and lack of validation in independent sets [[48]9,
   [49]10]. In contrast, breast cancer gene signatures have been
   identified that can accurately predict recurrence [[50]11] and
   chemotherapeutic response [[51]12, [52]13]. These signatures were
   subsequently validated in independent clinical studies [[53]13–[54]15].
   For example, one of these signatures, OncotypeDx, used 600 cases to
   create an association model and validated it in an additional 400 cases
   [[55]11, [56]12]. Currently, there is no similar clinically available
   test for OVCA to identify which patients will respond to initial
   treatment [[57]16].

   In recently published studies using the Cancer Genome Atlas (TCGA)
   serous OVCA database [[58]17], we identified a robust molecular
   signature associated with chemo-response by integrating publicly
   available biological and clinical data from 450 serous OVCA patients.
   This yielded a 422-gene molecular signature that was replicated in five
   independent gene expression experiments [[59]18]. The contributing data
   used to identify this signature included gene expression, gene copy
   number alteration, gene mutations, DNA methylation, and miRNA profiles,
   all of which are available in TCGA dataset for serous OVCA. The
   presence of a strong association between the 422-gene signature and
   chemo-response from our previous work, though, does not imply that the
   signature also is predictive of chemo-response [[60]9].

   Therefore, the main objective of the present study was to determine the
   performance of the 422-gene signature as a predictor of chemo-response
   in serous OVCA. We also optimized and determined which of the elements
   of the signature contributed more to all prediction models. In this
   process, we identified a smaller set of 34 genes (the “optimized” set)
   from the original 422 signature that are predictive of response and
   that replicated the area under the curve (AUC) of the original complete
   gene set. Our data demonstrate that both the complete and the optimized
   models are predictive of outcome and are now replicated and validated
   in independent datasets.

Methods

Patients and data collection for prediction model

   All data collection and processing, including the consenting process,
   were performed after approval by all local institutional review boards
   and in accord with the TCGA Human Subjects Protection and Data Access
   Policies, adopted by the National Cancer Institute (NCI) and the
   National Human Genome Research Institute (NHGRI).

   Patients with serous OVCA in TCGA were utilized to create a prediction
   model in the testing dataset, and were divided into two categories:
   complete responders (CR) and incomplete responders (IR). Clinical
   complete response (CR) was defined as progression-free survival
   6 months after the first platinum-based treatment. In patients with
   incomplete response (IR), the disease either not did respond or
   progressed during treatment (refractory), or recurred within 6 months
   of treatment completion (resistant) [[61]4, [62]19]. Patients defined
   as IR in our study are also clinically referred to as
   ‘platinum-resistant’ [[63]20], with direct implications for treatment
   and prognosis. In the TCGA dataset, there were 292 patients classified
   as CR and 158 classified as IR. Table [64]1 describes the clinical
   characteristics of these patients. Chemo-response was the most
   significant prognostic factor for survival in multivariable analysis by
   Cox proportional hazards regression (p-value < 10^−14), and patients
   with IR had a significantly decreased median survival compared to CR
   patients (Fig. [65]1) [[66]18].

Table 1.

   Clinical data from TCGA patients
                                CR   IR  p-value*
   Number of Patients           292 158
   Age (Avg.)                   60  59.6 N.S.
   Grade                                 N.S.
   Grade 1                      4   1
   Grade 2                      35  18
   Grade 3                      246 135
   Stage                                 p < 0.01
   Stage I                      10  3
   Stage II                     19  1
   Stage III                    224 123
   Stage IV                     39  29
   Surgical outcome                      N.S.
   Optimal (<1 cm residual)     207 92
   Suboptimal (>1 cm residual)  52  57
   Optimal Treatment                     p < 0.001
   Optimal (Surgery + 6 cycles) 179 66
   Suboptimal                   113 92
   [67]Open in a new tab

   *Multivariable analysis of TCGA clinical variables: Only FIGO stage and
   optimal treatment (including optimal surgery AND 6 cycles of
   platinum-based chemotherapy) were independently associated with
   chemo-response in serous OVCA

Fig. 1.

   Fig. 1
   [68]Open in a new tab

   Survivorship by chemo-response in serous OVCA TCGA data. Chemo-response
   was the most significant factor in the multivariable analysis for
   survival. Complete responders (CR) have a median survival 2 years
   greater than IR

Gene signature and prediction analysis

   We previously identified a 422-gene signature that is robustly
   associated with chemo-response [[69]18]. To assess predictive
   performance of this signature, we applied the ‘Classification for
   MicroArrays’ (CMA) to TCGA serous OVCA data. CMA is a statistical tool
   designed to construct and evaluate classifiers (or prediction models)
   derived from microarray experiments using a large number of standard
   methods [[70]21] and the R environment for statistical computing
   ([71]www.r-project.org) [[72]22].

   Of the different methods available in the CMA package [[73]21] to
   perform the analysis, nine methods consistently handle missing values,
   lower number of samples, and compute AUCs without reporting any errors:
   random forest [[74]23], least absolute shrinkage and selection operator
   (Lasso) [[75]24], Elastic Net [[76]24], prediction analysis for
   microarrays (PAM) [[77]25], diagonal discriminant analysis [[78]26],
   partial least squares (PLS) [[79]27], PLS - random forest [[80]27],
   penalized logistic regression [[81]28], and PLS - logistic regression
   [[82]27]. We used these nine methods for the rest of the study to
   compare the predictive performance of all of the different datasets and
   for both the complete and optimized models. Two other available
   methods, linear and quadratic schrinkage, could not compute AUC.
   Fisher’s discriminant analysis could not handle more variables than
   subjects; neural networks was unstable/difficult to tune and interpret;
   k-nearest neighbors and support vector machines could not tune and
   evaluate AUCs.

   Initially, all 422 genes associated with chemo-response in serous
   ovarian cancer [[83]18] were utilized to construct prediction models,
   termed 422-gene prediction models. To assess how accurately the groups
   (CR and IR) were predicted, and to avoid over-fitting, cross-validation
   was used (internal validation of the classifier) [[84]29]. The
   predictive performance was computed with corrections for TCGA
   batch-effect and to account for two other variables independently
   associated with chemo-response in serous OVCA (FIGO stage
   classification and optimal treatment, Table [85]1) [[86]10].
   Sensitivity, specificity and AUC of the predictor/classifier were also
   calculated. For each of the AUC measurements, we also computed a 95 %
   confidence interval (CI) to compare different models and different
   methods of classification. To illustrate the performance of the
   predictor in classifying chemo-response, a receiver operating
   characteristic (ROC) curve was generated. These analyses also
   facilitated comparison of the performance of the predictor models
   across independent serous OVCA datasets and assessed how consistently
   the models predicted chemo-response in OVCA patients based on
   sensitivity, specificity, misclassification rate, and AUC. Finally, we
   identified which patients were more likely to be misclassified and the
   clinical characteristics that were associated with misclassification.

Selection of most informative genes of prediction models

   We focused on the selection of informative genes, because the
   composition of prediction models is paramount for their performance
   [[87]9]. The selection process was performed with all available methods
   in the software package: two-sample t-test; Welch modification of the
   t-test; Wilcoxon rank sum test; F-test; Kruskal-Wallis test;
   “moderated” t and F test, respectively, using the package ‘limma’ in R
   statistics; one-step Recursive Feature Elimination (RFE) in combination
   with the linear support vector machines (SVM); random forest variable
   importance measure; least absolute shrinkage and selection operator (or
   Lasso); the regularized regression method or elastic net;
   component-wise boosting; and ad-hoc “Golub” criterion [[88]21]. Using
   the gene selection tool, each gene was ranked depending on its relative
   importance in prediction models. These genes were ordered based on
   their rank and their relative ‘weight’ in the prediction process, and
   the prediction model analysis was applied by including only those genes
   that had been ranked at least once (one ‘hit’) by each method. These
   models, containing only the 34 selected and more informative genes,
   were termed 34-gene prediction models and comprised the optimized gene
   set as compared to the complete gene set.

Data retrieval for replication and validation analyses

   Validation and replication of the prediction models was performed using
   datasets in the Gene Expression Omnibus (GEO) and the European
   Bioinformatics Institute, part of the European Molecular Biology
   Laboratory (EMBL-EBI), that contain gene expression paired with
   treatment response data (Table [89]2). Databases were downloaded in
   their raw state to maximize platform and annotation information, and
   then data were normalized. Response to therapy variables were coded to
   make outcomes comparable with TCGA: CR and IR. Also, patients that
   underwent optimal debulking (with largest residual disease of <1 cm)
   and completed six cycles of platinum-based therapy were considered to
   have ‘optimal treatment’. Lesser treatments were considered suboptimal.
   This analysis was performed because optimal treatment and FIGO stage of
   disease were also significantly and independently associated with
   chemo-response in TCGA (Table [90]1). Both clinical variables were
   collected, when available, and assessed for association to
   chemo-response in these new datasets in order to account for them in
   the prediction analysis. Also, batch-effect, if available, was
   accounted for to correct for any bias, as was also performed in the
   initial prediction model using the TCGA dataset.

Table 2.

   Publicly available GEO datasets of patients with serous OVCA used for
   validation/replication of prediction models
   Repositories Number of patients Study Names References