Abstract
Background
Ovarian cancer (OV) is the most lethal gynecological cancer in women.
We aim to develop a generalized, individualized immune prognostic
signature that can stratify and predict overall survival for ovarian
cancer.
Methods
The gene expression profiles of ovarian cancer tumor tissue samples
were collected from 17 public cohorts, including 2777 cases totally.
Single sample gene set enrichment (ssGSEA) analysis was used for the
immune genes from ImmPort database to develop an immune-based
prognostic score for OV (IPSOV). The signature was trained and
validated in six independent datasets (n = 519, 409, 606, 634, 415,
194).
Findings
The IPSOV significantly stratified patients into low- and high-immune
risk groups in the training set and in the 5 validation sets (HR range:
1.71 [95%CI: 1.32–2.19; P = 4.04 × 10^−5] to 2.86 [95%CI: 1.72–4.74;
P = 4.89 × 10^−5]). Further, we compared IPSOV with nine reported
ovarian cancer prognostic signatures as well as the clinical
characteristics including stage, grade and debulking status. The IPSOV
achieved the highest mean C-index (0.625) compared with the other
signatures (0.516 to 0.602) and clinical characteristics (0.555 to
0.583). Further, we integrated IPSOV with stage, grade and debulking,
which showed improved prognostic accuracy than clinical characteristics
only.
Interpretation
The proposed clinical-immune signature is a promising biomarker for
estimating overall survival in ovarian cancer. Prospective studies are
needed to further validate its analytical accuracy and test the
clinical utility.
Fund
This work was supported by National Key Research and Development
Program of China, National Natural Science Foundation of China and
Natural Science Foundation of the Jiangsu Higher Education Institutions
of China.
Keywords: Ovarian cancer, Immune, Gene expression, Prognostic signature
List of Abbreviations
OV ovarian cancer
IPSOV immune-based prognostic signature for OV
HR hazard ratio
CI confidence interval
GEO Gene Expression Omnibus
TCGA The Cancer Genome Atlas ovarian cancer
FIGO International Federation of Gynecology and Obstetrics
ssGSEA single sample gene set enrichment analysis
KEGG Kyoto Encyclopedia of Genes and Genomes
GO Gene Ontology
BP biological process
MF molecular function
CC cellular component
FDR False-Discovery Rate
C-index concordance statistic
RMS restricted mean survival
[35]Open in a new tab
1. Introduction
Ovarian cancer (OV) is the most lethal gynecological cancer among
women, with >14,000 estimated new deaths in United States, 2018
[[36]1]. Biomarkers especially gene expression in tumor tissues are
reliably related to cancer prognosis and survival [[37][2], [38][3],
[39][4]]. Thus, identification of the subset of patients with worse
survival and higher mortality is needed for additional clinical
therapy. The availability of large-scale public cohorts with gene
expression data and the well-developed biological database bring the
opportunity to identify a more generalized prognostic signature with
biological background for ovarian cancer.
Immune system has been shown to be a determining factor during cancer
initiation and progression [[40]5,[41]6]. Various evidence have proved
that ovarian cancer are immunogenic tumors [[42][7], [43][8], [44][9]]
and immunotherapy is strongly pursued through targeting on the immune
checkpoints [[45]10,[46]11]. In addition, previous study has
preliminary confirmed the prognostic value of immune system in ovarian
cancer [[47]12]. Thus, the immune based prognostic signature remains a
potential to be applied in ovarian cancer.
In this study, we integrated multiple cohorts with gene expression
which contained 2777 cases totally to develop and validate an
individualized gene-set based prognostic signature for ovarian cancer
from immune-related genes. With sufficient validation of 5 independent
datasets, we proved the model stability and reliability. Further, to
leverage the complementary value of clinical characteristics, we
performed an integrated analysis to improve the predicted accuracy for
overall survival.
2. Methods
2.1. Study population and eligibility criteria
We retrospectively collected the ovarian cancer gene expression
profiles of frozen primary tumor tissue samples from 17 public
microarray datasets, including 16 cohorts from the Gene Expression
Omnibus (GEO) repository and 1 from The Cancer Genome Atlas ovarian
cancer (TCGA-OV) cohort. All patients had received primary debulking
surgery. Only patients with available follow-up time, status and gene
expression data were included. The main outcome of our study was
overall survival. Staging was assessed in accordance to the
International Federation of Gynecology and Obstetrics (FIGO) stage
system. Optimal debulking was defined as ≤1 cm of gross residual
disease while sub-optimal debulking was defined as >1 cm of residual
disease. We did not limit histology type, FIGO stage, grade, debulking
and neoadjuvant chemotherapy status for sample collection. The sample
quality control details were described in Table S1.
The study characteristics of the 17 public ovarian cancer datasets were
described in Table S1. Finally, 2777 ovarian cancer cases were included
in our study. The study design and workflow were provided in [48]Fig.
1.
Fig. 1.
[49]Fig. 1
[50]Open in a new tab
Flowchart of the study. 17 public ovarian cancer datasets containing
2777 cases were included and categorized into 6 independent datasets
according to the microarray platform. We developed the IPSOV in the
training set and validated in the other 5 datasets. Further, we
integrated IPSOV with stage, grade and debulking status to improve the
prognostic value.
2.2. Gene expression data preprocessing
For GEO datasets, all microarray data and clinical information were
download from GEO repository ([51]https://www.ncbi.nlm.nih.gov/geo/).
To increase the statistical power and leverage cohorts with small
sample size sufficiently, we combined the cohorts with the same
microarray platform. Only the platforms with sample size ≥ 150 were
included in the study.
For TCGA cohort, due to the samples with available RNA-Sequencing gene
expression quantification were relatively small (n = 374), we collected
Affymetrix Human Genome U133A Array microarray data instead (n = 519).
All level-2 data were download from the Genomic Data Commons Data
Portal ([52]https://portal.gdc.cancer.gov/).
Finally, six independent datasets were included according to the
different platforms from GEO and TCGA ([53]Table 1). Entrez IDs were
used to represent genes across different platform. In the independent
validation phase, if multiple probe sets correspond to the same Entrez
ID in the validation sets, the one with the highest mean signal was
selected as the expression level of the corresponding gene [[54]13].
Table 1.
Summary of the eight datasets included in the study.
Dataset Platform Sample size Included cohorts[55]^a
Training set Affymetrix Human Genome U133A Array 519 TCGA
Validation set 1 Affymetrix Human Genome U133A Array 409 [56]GSE14764,
[57]GSE23554, [58]GSE26712, [59]GSE3149
Validation set 2 Affymetrix Human Genome U133 Plus 2.0 Array 606
[60]GSE18520, [61]GSE19829, [62]GSE26193, [63]GSE30161, [64]GSE63885,
[65]GSE9891
Validation set 3 Agilent-014850 Whole Human Genome Microarray 4x44K
G4112F 634 [66]GSE17260, [67]GSE32062, [68]GSE53963, [69]GSE73614
Validation set 4 Operon human v3 ~35 K 70-mer two-color oligonucleotide
microarrays 415 [70]GSE13876
Validation set 5 ABI Human Genome Survey Microarray Version 2 194
[71]GSE49997
[72]Open in a new tab
^a
We used ComBat to adjust the batch effects between different cohorts
within the same platform. Gene expression values of all probes were
adjusted in each dataset, respectively.
The batch effects between different datasets within the same platform
were adjusted by ComBat method [[73]14]. Gene expression values of all
probes were adjusted in each dataset, respectively. We also adjusted
the batch effects within the TCGA cohort according to the tissue source
site code. All the gene expression levels had been logarithmic
transformed before batch effects adjustment.
2.3. Immune-related genes definition
We constructed a prognostic signature from the immune-related genes,
which were downloaded from the ImmPort database
([74]http://www.immport.org) [[75]15]. It includes 17 immune categories
according to different molecular function, such as antimicrobials,
cytokine, interleukins, T-cell receptor signaling pathway, B-cell
receptor signaling pathway, TNF family receptors.
2.4. Development of the immune-based prognostic signature for OV (IPSOV)
We performed a three-stage strategy to develop the signature. First, we
used the TCGA cohort as the training set to screen the survival related
probes. 1905 microarray probes for 1109 immune genes were included
totally. The Cox proportional-hazards model was used to assess their
association with overall survival with adjustment for age, stage, grade
and debulking status. We excluded the probes with adjusted P values
larger than 0.05.
To estimate the population specific immune infiltration, we used single
sample gene set enrichment analysis (ssGSEA) that define a enrichment
score to represent the degree of absolute enrichment of a gene set in
each sample within a given dataset [[76]16]. Normalized enrichment
scores could be calculated for each immune category. The ssGSEA
analysis were performed in R package GSVA.
In the last step, we develop a prognostic signature named the
immune-based prognostic signature for OV (IPSOV) to combine the effects
of each immune category in the training set. The coefficients of each
category were determined by multivariable Cox regression model:
[MATH: IPSOV=∑i=1
kβiSi :MATH]
, where S[i] is the ssGSEA score for ith immune category.
2.5. Validation of the IPSOV
To get a uniform cutoff value to categorize the patients into low-risk
and high-risk group, we performed a normalization for the gene
expression values in each dataset with mean value = 0 and standard
deviation (SD) = 1. After developing the signature, the prognostic
score of IPSOV was further calculated in the 5 validation datasets,
respectively. In the multivariable Cox regression, age, FIGO stage,
grade, debulking status and study site were included as covariates.
2.6. Pathway enrichment analysis for the molecular function
To further understand the gene function in IPSOV, we performed a
pathway enrichment analysis for the genes based on Kyoto Encyclopedia
of Genes and Genomes (KEGG) and Gene Ontology (GO) database including
biological process (BP), molecular function (MF) and cellular component
(CC). The P values were adjusted by False-Discovery Rate (FDR) method
for multiple comparison. All the analysis were performed by R package
clusterProfiler [[77]17].
2.7. Existing prognostic signatures for comparison
To evaluate the survival classification and prediction ability of
IPSOV, we retrospectively collected 9 published prognostic signatures
for comparison, which included from 7 genes to 300 genes (Table S7).
Continuous prognostic scores were calculated for each signature. The P
values of continuous scores in univariable Cox model and the overall
concordance statistic (C-index) were compared in the six datasets.
2.8. Statistical analysis
Continuous variables were summarized as mean ± SD, and categorized
variables were described by frequency (n) and proportion (%). In the
survival analysis, associations between characteristics and overall
survival were evaluated by Cox proportional hazard models. Kaplan-Meier
survival curves were drawn and compared among subgroups using log-rank
tests. The C-index and restricted mean survival (RMS) curve were
estimated using R package survival and survRM2, while C-index was
compared by R package compareC [[78]18].
Statistical analyses were performed using R version 3.4.0 (The R
Foundation). P values were two-sided, and P < 0.05 was considered
statistically significant.
3. Results
3.1. Development and definition of the IPSOV
A total of 2777 patients with ovarian cancer (mean age ± SD,
59.7 ± 11.8; 12.8% early-stage [I, II], 87.2% late-stage [III, IV];
43.5% optimal debulking, 56.5% sub-optimal debulking) from 6
independent datasets were included in the analysis ([79]Table 2). In
the training set, 156 probes of 129 genes from 15 immune categories
were associated with overall survival among the 1905 immune-related
microarray probes. Further, we used these probes to perform the ssGSEA
analysis for each patient to calculate the immune enrichment score of
each immune category. The IPSOV was defined as the combined effect of
scores in different categories using the coefficients generated from
multivariable Cox regression model (Table S3). The median score in the
training set (−0.23) was used as the cutoff to stratify patients into
low- and high-immune risk group across all the datasets.
Table 2.
Demographic and clinic characteristic descriptions for ovarian cancer
patients in different datasets.
Characteristics[80]^a Training set Validation set 1 Validation set 2
Validation set 3 Validation set 4 Validation set 5
Number of samples 519 409 606 634 415 194
Median survival time (month) (95% CI) 44.5 (41–48.3) 51 (45–65) 41.9
(37.8–47.8) 53 (49–62) 24 (21−30) >49[81]^b
Number of Death (%) 284 (54.9) 232 (56.7) 346 (57.1) 364 (57.1) 302
(72.7) 57 (29.3)
Age (years) 59.8 ± 11.4 – 62.9 ± 10.6 62.6 ± 11.2 57.9 ± 12.2
57.6 ± 11.8
Histology type (%)
Serous 520 (100) 96 (88.8) 516 (85.5) 534 (83.8) 415 (100) 171 (88.1)
Endometroid 0 6 (5.5) 41 (6.8) 66 (10.3) 0 0
Clear cell 0 2 (1.8) 20 (3.3) 37 (5.8) 0 0
Mucinous 0 0 9 (1.4) 0 0 0
Other 0 4 (3.7) 17 (2.8) 0 0 23 (11.8)
FIGO stage (%)
I 15 (2.9) 8 (4.1) 41 (7.4) 32 (5.0) – 0
II 25 (4.8) 2 (1.0) 111 (20.0) 23 (3.6) – 9 (4.6)
III 396 (76.7) 164 (84.1) 364 (65.8) 462 (72.6) – 154 (79.3)
IV 80 (15.5) 21 (10.7) 37 (6.6) 119 (18.7) – 31 (15.9)
Grade (%)
1 5 (0.9) 9 (4.0) 32 (5.8) 26 (4.0) – 0
2 64 (12.5) 86 (38.9) 80 (14.6) 204 (32.0) – 0
3 439 (86.2) 125 (56.5) 394 (72.0) 324 (50.8) – 143 (100)
4 1 (0.1) 1 (0.2) 41 (7.4) 83 (13.0) – 0
Debulking status (%)
Optimal 337 (73.1) 193 (54.0) 209 (66.7) 271 (51.4) – –
Sub-optimal 124 (26.9) 164 (45.9) 104 (33.2) 256 (48.5) – –
[82]Open in a new tab
^a
Sum of frequency numbers may not equal to the total sample size due to
missing values.
^b
The median survival time is incalculable due to the mortality at the
last follow-up time is <50%.
3.2. Validation of the IPSOV
The IPSOV significantly stratified patients into low- and high-immune
risk groups in the training set (hazard ratio [HR] = 1.82; 95%
confidence interval [CI]: 1.43–2.30; P = 9.19 × 10^−7) and in the 5
validation sets (HR range: 1.66 [95%CI: 1.27–2.15; P = 1.84 × 10^−4] to
2.36 [95%CI: 1.38–4.00; P = 0.001]) (Table S4). In addition, it
remained as an independent prognostic factor in the multivariable Cox
model, after adjusting for age, stage, grade and debulking status (HR
range: 1.71 [95%CI: 1.32–2.19; P = 4.04 × 10^−5] to 2.86 [95%CI:
1.72–4.74; P = 4.89 × 10^−5]) ([83]Fig. 2). In the meta-analysis for
all datasets, the high-immune risk group showed a 1.96-fold higher risk
than the low-immune risk group (HR = 1.96; 95%CI: 1.73–2.23;
P = 2.37 × 10^−25) (Fig. S1). The IPSOV distribution with survival
status in the combined dataset was shown in [84]Fig. 3. Significant RMS
time ratio (1.24 to 1.88) was observed in the 6 datasets, respectively
([85]Table 3).
Fig. 2.
[86]Fig. 2
[87]Open in a new tab
Kaplan-Meier survival analyses of IPSOV. (a) Patients are stratified
into low- (red) and high-immune (blue) risk groups with a cutoff of the
median value in the training set. (b-f) Further, the prognostic
signature IPSOV is validated in five independent validation sets. (For
interpretation of the references to color in this figure legend, the