Abstract
Detection of prognostic factors associated with patients’ survival
outcome helps gain insights into a disease and guide treatment
decisions. The rapid advancement of high-throughput technologies has
yielded plentiful genomic biomarkers as candidate prognostic factors,
but most are of limited use in clinical application. As the price of
the technology drops over time, many genomic studies are conducted to
explore a common scientific question in different cohorts to identify
more reproducible and credible biomarkers. However, new challenges
arise from heterogeneity in study populations and designs when jointly
analyzing the multiple studies. For example, patients from different
cohorts show different demographic characteristics and risk profiles.
Existing high-dimensional variable selection methods for survival
analysis, however, are restricted to single study analysis. We propose
a novel Cox model based two-stage variable selection method called
“Cox-TOTEM” to detect survival-associated biomarkers common in multiple
genomic studies. Simulations showed our method greatly improved the
sensitivity of variable selection as compared to the separate
applications of existing methods to each study, especially when the
signals are weak or when the studies are heterogeneous. An application
of our method to TCGA transcriptomic data identified essential survival
associated genes related to the common disease mechanism of five
Pan-Gynecologic cancers.
Subject terms: Statistics, Bioinformatics
Introduction
Many biomedical studies aim to understand the progression of a disease
and identify prognostic factors of patients’ survival time after
standard treatment. Traditional well-known clinical prognostic factors
for diseases such as cancer often provide poor prognosis and
prediction^[36]1. The rapid advancement of high-throughput technology
in the past two decades has generated enormous amount of genomic data
at different levels (e.g. genetic variants, gene expression and DNA
methylation) and enabled the tailoring of medical treatment to
individual molecular characteristics of each patient^[37]2. Detection
of prognostic genomic biomarkers is useful for the selection of
patients who will likely benefit from a specific clinical intervention
in precision medicine. Over years, various prognostic genomic
biomarkers were reported in the literature but a majority of them are
of limited use outside research^[38]3. As the price of technology
becomes more affordable, many genomic studies are conducted to explore
a common scientific question in different cohorts or populations. For
example, the Pan-Cancer Atlas initiated by The Cancer Genome Atlas
(TCGA) consortium studied the multi-platform molecular profiles
spanning 33 cancer types and identified common somatic mutations or
other genetic variations across multiple tumor lineages^[39]4,[40]5.
Numerous clinical trials were conducted to detect prognostic molecular
markers for survival in different cancer types^[41]6–[42]9. Compared to
a single study, discovery of biomarkers from multiple studies is more
reproducible and credible and shows stronger evidence of a true
association with potential for clinical utility^[43]10. However, the
intrinsic population heterogeneity that exists in different cohorts has
created a barrier for an effective integration of multiple studies. For
example, high inter-tumor and intra-tumor heterogeneity exhibited in
different tumor types present heterogenous patient populations with
different risk profiles in Pan-cancer analysis^[44]11,[45]12.
Additional difficulty arises when the time to event outcomes are
censored at different time points due to different study durations. Due
to these challenges, no methods have been developed to detect
survival-associated biomarkers from multiple studies while accounting
for the heterogeneity in study populations and designs.
Standard models for the analysis of survival data with censoring
include the nonparametric Kaplan–Meier method^[46]13, the
semi-parametric Cox proportional hazards model, abbreviated as the Cox
model^[47]14, and other parametric regression models^[48]15. Due to the
high-dimensional nature of genomic data (i.e., the number of features
greatly exceeds the sample size), there is a serious collinearity issue
in fitting a prediction model with limited sample size so that these
survival models do not directly apply. Classical model selection
methods such as the best subset selection using AIC^[49]16 or
BIC^[50]17 criterion suffer from an NP-hard combinational problem in
the presence of a large number of features. There are two major
approaches in modern variable selection for high dimensional data. The
first approach directly applies popular regularization methods such as
Lasso^[51]18 and elastic net^[52]19. These regularization methods have
been adopted to survival analysis in recent years based on the Cox
model^[53]20–[54]22 or parametric models^[55]23,[56]24 to detect
features associated with the survival outcome. The second approach
implements a two-stage procedure by first reducing the dimension of
feature space from high to moderate size via sure independence
screening (SIS; sure screening or screening for short) methods and then
applying regularization methods to refine the final pool of features.
First introduced by Fan et al.^[57]25, SIS comprises a series of
methods that select features based on their marginal associations with
the response. Zhao et al.^[58]26 proposed a Cox SIS procedure by
fitting marginal Cox models to reduce the dimension in the analysis of
censored data. The two-stage procedure has advantages in computational
efficiency and algorithmic stability, and has been widely applied in
genomic data analysis. However, all the aforementioned variable
selection methods are restricted to single study analysis.
In this paper, we propose a novel Cox model based two-stage variable
selection method for the detection of survival associated biomarkers
common in multiple genomic studies. In the first stage, we extend the
screening procedure for the linear model in Ma et al.^[59]27 to perform
sure screening with multiple studies in high-dimensional Cox
regression. In the second stage, we penalize the partial
log-likelihoods with a group lasso penalty across multiple studies to
select the final set of features in all studies simultaneously. The
proposed method is statistically attractive in that it allows the
studies to have different signal strengths, baseline hazard rates and
censoring distributions, and effectively integrates the information
from multiple heterogeneous studies. In addition, the method is also
computationally favorable by implementing a fast marginal screening in
the first stage and solving the regularization problem in the second
stage with an efficient ADMM algorithm^[60]28. To the best of our
knowledge, the proposed method is the first to address the
high-dimensional variable selection problem in survival models with
multiple studies. We also extend other regularization and two-stage
methods to a multiple study version to compare our method under various
simulation scenarios and demonstrate our method’s strength, especially
when the signals are weak or when the studies are heterogeneous. An
application of our method in TCGA Gynecologic and breast
(Pan-Gynecologic or Pan-Gyn) cancer transcriptomic data detects
essential survival associated genes related to the common disease
mechanism of Pan-Gyn cancers.
The paper is structured as follows. In Sect. [61]2, we discuss the
variable selection in the Cox model with both single and multiple
studies, and introduce our novel two-stage variable selection
methodology for multiple studies. Section [62]3 evaluates the
performance of our method under four different simulation scenarios. In
Sect. [63]4, we analyze TCGA Pan-Gyn cancer with the proposed
methodology. Discussion and final conclusion are presented in
Sect. [64]5.
Methods
Variable selection in Cox model with single study
Suppose there are n subjects for right censored survival data. The
outcome of the ith (
[MATH: 1≤i≤n
:MATH]
) subject is
[MATH: Oi=(Yi,Δi) :MATH]
where
[MATH: Yi :MATH]
is the observed survival time and
[MATH: Δi :MATH]
is the censoring indicator. When right censoring occurs, censoring time
[MATH: Ci :MATH]
comes before the true survival time
[MATH: Ti :MATH]
. In this case, we only observe
[MATH:
Yi=Ci<
/mi> :MATH]
and
[MATH: Δi=0
:MATH]
. When the subject is not censored, we observe the true survival time
[MATH:
Yi=Ti<
/mi> :MATH]
with
[MATH: Δi=1
:MATH]
. The Cox model is a popular semi-parametric survival model to
investigate the association between the survival outcome and
explanatory variables X such as the gene expression through modeling
the hazard function by
[MATH: λ(t|X)=λ0(t)exp(Xβ), :MATH]
1
where
[MATH: X=(x1,…
,xp) :MATH]
is the feature vector with p being the number of features,
[MATH: β=(β1,…
,βp)T :MATH]
is the corresponding regression coefficient vector and
[MATH: λ0(t) :MATH]
is the baseline hazard function. In the low-dimensional case where p is
smaller than n, Cox^[65]14 proposed maximizing the following partial
log-likelihood to estimate
[MATH: β :MATH]
:
[MATH: l(β)=∑i=1nΔilogexp(Xiβ)∑j∈R(Yi)exp(Xjβ), :MATH]
2
where the risk set R(t) is the set of indices j for the subject whose
outcome has not yet been observed at time t (i.e.,
[MATH:
Yj≥t
:MATH]
).
When there are far more features than the number of observations, a
reasonable assumption made in the variable selection literature is the
sparsity assumption that only a small set of features are true
predictors of the response, which is also adopted in this paper. The
following penalized partial log-likelihood is maximized to estimate
[MATH: β :MATH]
:
[MATH: pl(β)=l(β)-∑j=1
pPλ(|βj|), :MATH]
3
where the penalty term
[MATH: Pλ(.) :MATH]
can represent different types of penalties (popular choices include the
[MATH: L1 :MATH]
lasso penalty^[66]20 and the
[MATH: L1/L2 :MATH]
elastic net penalty^[67]22). The tuning parameter
[MATH: λ :MATH]
controls the balance between the strength of coefficients and the
penalty level and is usually chosen by cross validation. The final
model consists of the features with nonzero coefficient estimates.
The above regularization methods are generally sub-optimal when p grows
at an exponential rate of n (commonly seen in most genomic studies) due
to algorithmic stability and accuracy. A common practice is to perform
a two-stage procedure where we first reduce the dimension of the
feature set by screening out unimportant features that have no marginal
associations with the outcome and then apply regularization. Such
methods in the first stage possess a sure screening property that the
features remaining after screening still include the true set of
predictors with high probability^[68]25. Bühlmann et al.^[69]29 further
proposed a partial faithfulness assumption which states that a zero
marginal correlation implies a zero regression coefficient in linear
models to support the use of sure screening methods. In survival
analysis, the principled Cox sure independence screening procedure
(“PSIS”) was proposed to screen out covariates that have no association
with the survival outcome in marginal Cox models^[70]26.
Variable selection in the Cox model with multiple studies
Suppose we have data from K genomic studies where each study k (
[MATH: 1≤k≤K
:MATH]
) has
[MATH: nk :MATH]
observations. For study k, we assume the Cox model with the conditional
hazard function is given by
[MATH: λ(k)(t|X(k))=λ0(k)(t)exp(X(k)β(k)), :MATH]
4
where
[MATH: λ0(k)(t) :MATH]
is the study-specific baseline hazard function,
[MATH: β(k)=(β1(k),…,<
msubsup>βp(k))T :MATH]
is the study-specific regression coefficients, and
[MATH: X(k)=(x1(k),…,<
msubsup>xp(k)) :MATH]
is the features collected from the kth study. The goal of variable
selection with multiple studies is to identify the active predictors
[MATH: xj(k) :MATH]
with
[MATH: j∈M={j:βj(k)≠0for allk} :MATH]
. Here we assume that all studies have the same sparsity pattern (i.e.,
[MATH: βj(k)=0
:MATH]
for all k or
[MATH: βj(k)≠0
:MATH]
for all k). Under this assumption, we can accumulate potentially weak
signals from each study to yield a strong evidence of active
predictors. Note that because one can easily select common features
across studies, we assume that p is common across all studies so that
[MATH: xj(k) :MATH]
and
[MATH: xj(l) :MATH]
correspond to the same feature j from the studies k and l. The
extension to different p in different studies is straightforward and
will be discussed elsewhere.
The first general framework for variable selection with multiple
studies was proposed by Ma et al.^[71]27 where active predictors is
assumed to be common to all studies, but the signal strengths (i.e.,
magnitudes of
[MATH: |βj(k)| :MATH]
) of those active predictors can vary among the studies. Combining
marginal correlations between an outcome and each feature across
studies, Ma et al. (2020) proposed the extension of sure independence
screening to multiple studies in the linear models, the method known as
Two-Step Aggregation Sure Independence Screening or “TSA-SIS”. With
this framework of the same sparsity pattern with varied signal
strength, inclusion of multiple studies yields more evidence to remove
unimportant features during screening since a zero marginal correlation
for any study implies a zero regression coefficient for that feature by
the partial faithfulness assumption. The same framework is also adopted
for the integrative analysis of multiple high-dimensional -omics data
for identifying disease related biomarkers with consistent effects
across multiple studies with logistic regression^[72]30.
Our proposed methodology is a non-trivial extension of the sure
screening procedure for multiple studies of Ma et al.^[73]27 to the
multiple Cox proportional hazards models. Unlike simpler parametric
models such as linear and logistic regression models, the Cox model has
the baseline hazard function as an infinite dimensional function
parameter beyond the regression coefficients of interest. In linear and
logistic regression, heterogeneity of multiple studies only appears as
different strength of signals in
[MATH: βj(k),k=<
mn>1,…,K :MATH]
. In our survival models, heterogeneity also comes through different
study-specific baseline hazard functions
[MATH: λ0(k)(t),k=1<
mo>,…,K :MATH]
, in addition to signal strength in features. Another source of
heterogeneity is different censoring distributions across studies. When
right censoring occurs, the exact survival time is not observed but it
is only known to be larger than certain time. In the context of
multiple studies, for example, one study ends in one year while another
ends at five years. In this case, censored subjects in the former has
survival time more than one year while time to event in the latter is
only known to be larger than five years. In this paper, we address the
challenging issue of additional heterogeneity in the multiple Cox
models by the partial likelihood approach. To this end, we assume that
the survival time and censoring time are conditionally independent
given covariates. This is the key assumption that enables us to
separate multiple censoring distributions from the estimation of
study-specific baseline hazard functions and regression coefficients.
In the medical literature, the conditional independence assumption is
standard for a single study. Thus, it is reasonable to assume the same
for each of multiple studies in our variable selection problem.
In this paper, we propose the Cox model based TwO-sTage variable
sElection procedure for Multiple studies, namely “Cox-TOTEM”, to detect
survival associated biomarkers common in multiple genomic studies. In
the first stage, we extend the TSA-SIS procedure in the linear model to
perform sure screening in the Cox models with multiple studies by
utilizing the standardized coefficient estimates from marginal Cox
models. This procedure reduces the false negative errors (i.e., missing
the important features) and select the features with nonzero
coefficients in all studies. In addition to the extension of the sure
screening approach, we further refine the pool of features using the
penalized partial likelihood in the second stage. Specifically, we
propose a novel group lasso penalty in the partial log-likelihood of
the Cox models to obtain the final set of features. It is natural to
adopt a group lasso penalty by treating the coefficients of the same
feature from multiple studies as one group. Such group lasso penalty
will achieve the selection of a feature either in all studies or in
none of the studies (“all-in-or-all-out”), thus identifying a common
set of features across multiple studies. Below, we describe our
methodology in details.
The “Cox-TOTEM” method
Figure [74]1 shows a flowchart of the proposed two-stage method
Cox-TOTEM. In the first screening stage, we start by fitting a marginal
Cox model with a single covariate
[MATH: xj(k) :MATH]
to study k (i.e.,
[MATH: λ(k)(t|xj(k))=λ0(k)(t)exp(xj(k)βj(k)) :MATH]
). The marginal partial likelihood estimator
[MATH: β~j(k) :MATH]
for the jth feature in the kth study is a preliminary estimate used for
sure screening in the first stage. Define the observed information
matrix to be
[MATH: I(β~j(k)) :MATH]
(
[MATH: =1/var^(β~j(k)) :MATH]
). The corresponding standardized marginal coefficient estimator is
equal to
[MATH: z~j(k)=I(β~j(k))1/2β~j(k) :MATH]
. By the property of nonparametric maximum likelihood estimation,
[MATH: z~j(k) :MATH]
asymptotically follows the standard normal distribution when
[MATH: βj(k)=0
:MATH]
. As the extension of the TSA-SIS procedure to the Cox models, we
utilize
[MATH: |z~j(k)| :MATH]
to identify the studies with potential zero coefficients (i.e., weak
signals or noises with small
[MATH: |z~j(k)| :MATH]
) for each feature j in the first step:
[MATH: L^j={k;|z~j(k)|≤Φ-1(1-α1
msub>/2)}, :MATH]
5
where
[MATH: Φ :MATH]
is the CDF of the standard normal distribution. The choice of
[MATH: α1 :MATH]
determines the threshold to separate strong signals from weak signals.
This first step does not screen out any features, but instead helps
separate potential zero and nonzero coefficients for preparation of the
second step.
Figure 1.
[75]Figure 1
[76]Open in a new tab
Flowchart of the proposed two-stage method Cox-TOTEM. Screening
procedure applies to each feature j (
[MATH: 1≤j≤p
:MATH]
). “
[MATH: √ :MATH]
” means passed to the next step or stage, while “X” means not passed to
the next step or stage.
Let
[MATH: κ^j=|L^j| :MATH]
be the cardinality of
[MATH: L^j :MATH]
. The second step tests whether the aggregate effect of studies with
potential zero coefficients identified in the set
[MATH: L^j :MATH]
is strong enough for the jth feature to be retained in the screening
stage. Define the statistics
[MATH:
Uj=∑
k∈L^j
z~j(k)2 :MATH]
, which approximately follows a
[MATH: χκ^j2
:MATH]
distribution with degree of freedom
[MATH: κ^j :MATH]
. We retain the set of features with large
[MATH: Uj :MATH]
or with
[MATH: κ^j=0 :MATH]
:
[MATH: M^[1]={j∈[p];Uj>φκ^j-1
mn>(1-α2
msub>)orκ^j=0}, :MATH]
6
where
[MATH: [p]=1,2,…,p :MATH]
and
[MATH: φκ^j :MATH]
is the CDF of chi-square distribution with degree of freedom equal to
[MATH: κ^j :MATH]
. The key tuning parameter
[MATH: α2 :MATH]
determines how many features to retain in the screening stage. The
second step takes the sum of squares of
[MATH: z~j(k) :MATH]
from studies with potential zero coefficients as the test statistics of
the jth feature and performs the actual screening. If the aggregate
effect is strong enough, we will keep the feature and screen it out
otherwise. In this way, it potentially saves those important features
with weak signals in individual studies but strong aggregate effect.
Define
[MATH: d1=|M^[1]| :MATH]
, after the screening stage,
[MATH: d1 :MATH]
features common to all studies remain and are moved on to the second
stage.
Following screening, we include a group lasso penalty in the partial
log-likelihoods to select the final set of features common to all
studies in the regularization stage:
[MATH: β^=argmaxβ∑k=1
Kl(βM^[1](k))-λ∑j∈M^[1]||β→j||2,<
/mtr> :MATH]
7
where
[MATH: β→j=(βj(1),…,<
msubsup>βj(K)) :MATH]
and
[MATH: λ :MATH]
is the tuning parameter. Note that this group lasso penalty is
different from the regular group lasso penalty where the different
features form pre-defined groups, e.g. multiple genes form a
pathway^[77]31. Here we treat the coefficients of the same feature from
multiple studies as one group to identify a common set of biomarkers
associated with the survival. An ADMM algorithm is used to solve the
above optimization problem. Detailed steps of the algorithm can be
found in the Supplement. Define the set
[MATH: M^[2]={j∈[p];β^→j≠0→} :MATH]
and
[MATH: d2=|M^[2]| :MATH]
.
[MATH: M^[2] :MATH]
is the final set of common features we select.
The Cox-TOTEM algorithm can be summarized as below: [78]graphic file
with name 41598_2021_82332_Figa_HTML.jpg
Selection of tuning parameters
The proposed algorithm involves three tuning parameters:
[MATH: α1 :MATH]
and
[MATH: α2 :MATH]
in the screening stage and
[MATH: λ :MATH]
in the regularization stage. Since screening is an intermediate step,
the choice of
[MATH: α1 :MATH]
and
[MATH: α2 :MATH]
should be more conservative making as fewer false negative errors as
possible (i.e., high sensitivity) while retaining not too many false
positives (i.e., not too low specificity). A small
[MATH: α1 :MATH]
places stringent cutoff in selecting strong signals so more studies are
included in the second step leading to more false positive errors. On
the other hand, a large
[MATH: α1 :MATH]
tends to select very few studies causing more false negative errors. In
practice, a small
[MATH: α1 :MATH]
is generally recommended to reduce the more serious false negative
errors. The parameter
[MATH: α2 :MATH]
is the threshold used in the actual screening step which directly
determines how many features are retained. It serves the same role as
the retained model size d in the original sure independence screening
method^[79]25 or the significance level
[MATH: α :MATH]
in the PC-simple algorithm^[80]29. In practice, we recommend setting
[MATH:
α2=0.05
:MATH]
, the popular significance level used in hypothesis testing. We
performed a sensitivity analysis using simulation and showed that
[MATH:
α1=10-4 :MATH]
and
[MATH:
α2=0.05
:MATH]
achieved a good balance between sensitivity and specificity in the
screening stage (see Table [81]S1 in the Supplement). When the number
of features p becomes even larger (e.g.
[MATH:
p∼10,000
:MATH]
as in our real data application) and we need to remove more features
during screening, we also recommend more stringent cutoff of
[MATH:
α2=0.01
:MATH]
. Screening is by nature a fast step, we generally suggest users follow
the aforementioned guideline but not use any computationally intensive
data splitting or resampling procedures in choosing
[MATH: α1 :MATH]
and
[MATH: α2 :MATH]
unless otherwise specified. To determine the optimal value of
[MATH: λ :MATH]
in the regularization stage, we propose a multi-study cross validation
procedure by applying K-fold cross-validation within each study and
utilizing individual survival prediction as the evaluation criteria.
Details of the scheme can be found in the Supplement.
Methods for comparison
Since no existing methods specifically look at the variable selection
problem in survival model with multiple studies, we compare our method
to the multiple study extension of the Cox model specific screening and
regularization methods in single study. With the target index set
[MATH: M :MATH]
, we screen out a feature when any study has a zero marginal
coefficient for that feature. Thus, the multiple study extension of the
original PSIS^[82]26 for sure screening in the Cox model can be
regarded as MinPSIS, which ranks the features by the minimum absolute
standardized marginal coefficients in all studies. Likewise, the
multiple study extension of the regularization methods such as CoxLasso
(lasso penalty in Cox model) or CoxNet (elastic net penalty in Cox
model) take the intersection of features identified in each study,
abbreviated as InterCoxLasso or InterCoxNet. In the simulation, we
compare our method to a total of four methods, including two-stage
methods “MinPSIS-InterCoxLasso” and “MinPSIS-InterCoxNet”, as well as
direct regularization methods “InterCoxLasso” and “InterCoxNet”. For
the retained model size d in the screening stage of other two-stage
methods, we suggest more conservative cutoff of
[MATH: d=n :MATH]
to mitigate the missing of true signals. For regularization methods, we
use cross-validation to select the optimal tuning parameter as
implemented in the R package “glmnet”^[83]22,[84]32.
Simulation
Simulation setting
We conducted simulations to demonstrate the strength of Cox-TOTEM as
compared to other methods under various scenarios.
For each study k, we generated
[MATH: n=100 :MATH]
observations, each with a survival outcome, consisting of an observed
censored survival time
[MATH: Yi(k) :MATH]
and a censoring indicator
[MATH: Δi(k) :MATH]
, and a vector of
[MATH: p=2000 :MATH]
features
[MATH: Xi(k)=(xi1(k),…,<
msubsup>xip(k)) :MATH]
.
[MATH: Xi(k) :MATH]
was randomly sampled from a multivariate normal distribution
[MATH: N2000(0,Σ(k)) :MATH]
, where
[MATH: Σjj(k)=1
:MATH]
and
[MATH: Σjj′
msup>(k)=0.<
msup>5|j-j′
| :MATH]
for
[MATH:
1≤j≠j′≤p :MATH]
. The true survival time
[MATH: Ti(k) :MATH]
for each individual i was generated from the Cox model with
study-specific baseline hazard function
[MATH: λ0(k) :MATH]
and regression coefficients
[MATH: β(k) :MATH]
to be described below. The censoring time
[MATH: Ci(k) :MATH]
was generated from the exponential distribution
[MATH: Ci(k)∼exp(r(k)) :MATH]
with the study-specific parameter
[MATH: r(k) :MATH]
described below. We only assumed right censoring so the observed
censored survival time
[MATH: Yi(k) :MATH]
was the minimum of the true survival time and censoring time:
[MATH: Yi(k)=min(Ti(k),Ci(k)) :MATH]
, and the censoring indicator
[MATH: Δi(k)=1
:MATH]
when
[MATH: Ci(k)>Ti(k) :MATH]
and 0 otherwise. The above scheme was applied to all
[MATH: K=5 :MATH]
studies and the simulations were repeated for
[MATH: B=100 :MATH]
times.
We assumed only
[MATH: s=10 :MATH]
features were true predictors with nonzero coefficients and the other
features had zero coefficients in all studies. The ten features with
nonzero coefficients were evenly distributed among the p variables with
equal space. We considered the following four different scenarios for a
complete evaluation of our method:
* Homogeneous strong (Homo-S) we assumed strong signals in all
studies. Let
[MATH:
μj=-1 :MATH]
for the first five features with nonzero coefficients and
[MATH:
μj=1
:MATH]
for the next five features, and let
[MATH:
μj=0
:MATH]
for the rest of features. For homogeneous study effects, we set
[MATH: βj(1)=βj(2)=…=
mo>βj(K)=μ<
mi>j :MATH]
for
[MATH:
1≤j≤p
:MATH]
. We also set the study-specific baseline hazard function
[MATH: λ0(1)=λ0(2)=…=
mo>λ0(K)=1
:MATH]
and
[MATH: r(1)=r(2)=…=
r(K)=0.2
:MATH]
to be consistent across all studies.
* Homogeneous weak (Homo-W) the setting is the same as in Homo-S,
except for now we assumed weak signals in all studies. Let
[MATH:
μj=-0.5<
/mn> :MATH]
for the first five features with nonzero coefficients and
[MATH:
μj=0.5 :MATH]
for the next five features, and let
[MATH:
μj=0
:MATH]
for the rest of features.
* Heterogeneous strong (Hetero-S) we let
[MATH:
μj=-1 :MATH]
for the first five features with nonzero coefficients and
[MATH:
μj=1
:MATH]
for the next five features, and
[MATH:
μj=0
:MATH]
for the rest of features. To allow for between study variation in
magnitudes of coefficients but not in sparsity pattern, we randomly
sampled
[MATH: βj(k) :MATH]
from
[MATH: N(μj,<
mn>0.22) :MATH]
for
[MATH:
1≤k≤K
:MATH]
for the features with nonzero coefficients and
[MATH: βj(1)=βj(2)=…=
mo>βj(K)=μ<
mi>j=0 :MATH]
for the other features. In addition, we also allowed for variation
in baseline hazard functions and censoring distributions among
studies. More specifically, we randomly drew
[MATH: λ0(k) :MATH]
from
[MATH: exp(1) :MATH]
and
[MATH: r(k) :MATH]
from
[MATH: {0.1,0.3,
0.5} :MATH]
for
[MATH:
1≤k≤K
:MATH]
studies.
* Heterogeneous weak (Hetero-W) the setting is the same as in
Hetero-S, except we assumed weak signals with
[MATH:
μj=-0.5<
/mn> :MATH]
for the first five features with nonzero coefficients and
[MATH:
μj=0.5 :MATH]
for the next five features.
In addition to the case with
[MATH: n=100 :MATH]
and
[MATH: p=2000 :MATH]
, we also conducted simulations of higher dimension with
[MATH: n=200 :MATH]
and
[MATH:
p=10,000
:MATH]
and slightly weaker signals (
[MATH: |μj|=0.7 :MATH]
for the true predictors in strong scenarios and
[MATH: |μj|=0.35 :MATH]
in weak scenarios) and evaluated the performance under the same four
scenarios. In all simulations, we used
[MATH:
α1=1e-4 :MATH]
and
[MATH:
α2=0.05
:MATH]
(and for
[MATH:
p=10,000
:MATH]
, we used
[MATH:
α2=0.01
:MATH]
) in the screening stage and chose an optimal
[MATH: λ :MATH]
via the proposed cross-validation scheme in the regularization stage.
To benchmark the performance of variable selection, we evaluated both
the sensitivity and specificity as well as the average number of
features selected. In addition, we also included a plot of sensitivity
and specificity with varying values of tuning parameters in all methods
for a fair comparison independent of the tuning parameter selection.
After model selection, we fit a Cox model to all remaining features in
each study and computed the sum squared errors (SSE)
[MATH:
∑j=1p∑k=1K(β^j(k)-βj(k))2 :MATH]
, where
[MATH: β^j(k) :MATH]
is the Cox model coefficient estimates for jth feature in kth study, to
assess the parameter estimation. A smaller SSE suggests more accurate
parameter estimation.
Simulation results
Table [85]1 shows the variable selection and parameter estimation
performance with
[MATH: n=100 :MATH]
and
[MATH: p=2000 :MATH]
. As compared to the other four methods, Cox-TOTEM greatly improves the
sensitivity with almost no decrease in specificity in all four
scenarios. Such gain in sensitivity is most noticeable when the signals
are weak or when studies are more heterogeneous. This shows the
advantage of our method in aggregating the information of multiple
studies and saving those features with weak signals in one or more
studies. In most cases, Cox-TOTEM retains a majority of the true
signals and has more accurate parameter estimation than the other
methods. Figure [86]2 shows the plots of mean sensitivity (left) and
mean specificity (right) against the scaled
[MATH: λ :MATH]
values for all methods in the four scenarios. Increasing
[MATH: λ :MATH]
value decreases the sensitivity and increases the specificity for all
methods. With varying
[MATH: λ :MATH]
values, the curve of Cox-TOTEM sits above the other methods in the
sensitivity plot while almost indistinguishable from the others in the
specificity plot, which shows the advantage of our method regardless of
the selection of the best tuning parameter. Note that in general the
two-stage methods (“MinPSIS-InterCoxLasso” and “MinPSIS-InterCoxNet”)
outperform the one-stage regularization methods (“InterCoxLasso” and
“InterCoxNet”) as expected, encouraging the use of two-stage variable
selection methods in practice. The results look very consistent in the
case of
[MATH: n=200 :MATH]
and
[MATH:
p=10,000
:MATH]
, where Cox-TOTEM has the greatest sensitivity among all in
heterogeneous and weak scenarios (see Table [87]S2 in the Supplement).
Table 1.
Comparison of variable selection and parameter estimation under the
four different scenarios with
[MATH: n=100 :MATH]
,
[MATH: p=2000 :MATH]
and
[MATH: s=10 :MATH]
true predictors.
Simulation scenarios Methods Sensitivity Specificity Average number of
features selected SSE
Homo-S Cox-TOTEM 0.985 (0.004) 0.999 (0) 12.26 (0.194) 4.227 (0.269)
MinPSIS-InterCoxLasso 0.856 (0.018) 1 (0) 9.4 (0.173) 11.36 (1.02)
MinPSIS-InterCoxNet 0.881 (0.013) 0.997 (0.001) 14.36 (0.244) 10.79
(0.772)
InterCoxLasso 0.37 (0.018) 1 (0) 3.7 (0.183) 37.421 (0.775)
InterCoxNet 0.53 (0.014) 1 (0) 5.31 (0.143) 30.823 (0.684)
Homo-W Cox-TOTEM 0.977 (0.049) 0.997 (0) 16.6 (0.385) 3.321 (0.203)
MinPSIS-InterCoxLasso 0.464 (0.028) 0.999 (0) 5.17 (0.347) 7.646
(0.311)
MinPSIS-InterCoxNet 0.618 (0.019) 0.997 (0) 13.07 (0.384) 6.804 (0.214)
InterCoxLasso 0.049 (0.007) 1 (0) 0.49 (0.073) 12.032 (0.073)
InterCoxNet 0.143 (0.011) 1 (0) 1.43 (0.11) 11.094 (0.117)
Hetero-S Cox-TOTEM 0.968 (0.006) 0.998 (0) 13.64 (0.261) 7.969 (0.437)
MinPSIS-InterCoxLasso 0.723 (0.023) 1 (0) 7.98 (0.226) 20.758 (1.27)
MinPSIS-InterCoxNet 0.774 (0.015) 0.997 (0.001) 13.41 (0.26) 18.986
(0.943)
InterCoxLasso 0.258 (0.017) 1 (0) 2.58 (0.167) 45.579 (0.716)
InterCoxNet 0.393 (0.015) 1 (0) 3.93 (0.146) 39.842 (0.698)
Hetero-W Cox-TOTEM 0.885 (0.01) 0.993 (0) 22.68 (0.66) 14.163 (0.163)
MinPSIS-InterCoxLasso 0.211 (0.017) 1 (0) 2.93 (0.241) 13.118 (0.276)
MinPSIS-InterCoxNet 0.367 (0.016) 0.997 (0) 10.01 (0.4) 11.813 (0.251)
InterCoxLasso 0.037 (0.006) 1 (0) 0.37 (0.06) 15.738 (0.123)
InterCoxNet 0.083 (0.008) 1 (0) 0.83 (0.075) 14.932 (0.154)
[88]Open in a new tab
Mean results of 100 replications are reported with standard errors
shown in the parentheses.
Figure 2.
Figure 2
[89]Open in a new tab
Plot of mean sensitivity (left) and mean specificity (right) against
the scaled
[MATH: λ :MATH]
values for (A) Homo-S; (B) Homo-W; (C) Hetero-S; (D) Hetero-W scenarios
with
[MATH: n=100 :MATH]
and
[MATH: p=2000 :MATH]
. Shades denote the standard deviations. The original
[MATH: λ :MATH]
values are rescaled in all methods for fair comparison.
Application to TCGA transcriptomic data
Data description
We then applied our method to transcriptomic data from TCGA project
studying the gene expression profile of four gynecological cancer types
plus breast cancer (together known as Pan-Gyn cancers). Gynecologic
cancers and breast cancer share a variety of generic characteristics:
arising from similar embryonic origins, development all influenced by
female hormone, among others^[90]33. Previous studies in TCGA
identified shared molecular features including protein, miRNA, RNA and
DNA methylation among these five cancer types^[91]33,[92]34. The
purpose of this application is to identify the survival associated
genes common in these five cancer types, which might provide more
insights into the commonalities and infer the underlying common
mechanism of Pan-Gyn cancer in women.
We retrieved both the RNA-seq data and the clinical data containing
survival outcomes of the five Pan-Gyn cancer types including high-grade
serous ovarian cystadenocarcinoma (OV), uterine corpus endometrial
carcinoma (UCEC), cervical squamous cell carcinoma and endocervical
adenocarcinoma (CESC), uterine carcinosarcoma (UCS), and invasive
breast carcinoma (BRCA) from TCGA data repository on Broad GDAC
Firehose ([93]https://gdac.broadinstitute.org/). Table [94]2 shows the
total sample size and the censoring proportion of each cancer type. The
censoring distributions vary greatly across the different cancer types,
implying the heterogeneity in censoring patterns among studies commonly
seen in real datasets. The processed RNA-seq data from TCGA include the
TPM (Transcripts Per Kilobase Million) values. We first merged the five
datasets by matching the gene symbols and implementing quantile
normalization to remove any batch effect among the different studies.
Genes with mean TPM less or equal to 5 were filtered out and 11998
genes remained. The data was also log2 transformed to be ready for the
rest of the analysis.
Table 2.
Overview of the samples from the five Pan-Gyn cancer types.
OV UCEC CESC UCS BRCA
Sample size 304 170 301 57 981
Censoring proportion
[MATH: 39% :MATH]
[MATH: 81% :MATH]
[MATH: 76% :MATH]
[MATH: 39% :MATH]
[MATH: 89% :MATH]
[95]Open in a new tab
Results
We applied our method as well as the two-stage methods
“MinPSIS-InterCoxLasso” and “MinPSIS-InterCoxNet” to the Pan-Gyn
example. Considering the relative high dimension, we set
[MATH:
α2=0.01
:MATH]
in the screening stage of our method and retained 268 genes after
screening. The regularization stage further refined the pool to a final
set of 29 genes. On the other hand, “MinPSIS-InterCoxNet” and
“MinPSIS-InterCoxLasso” only detected 15 and one genes, respectively.
Table [96]3 lists the top five genes with the largest absolute
coefficient estimates from the 29 genes identified by Cox-TOTEM. None
of these genes were selected by the other two competing methods. When
fitting a Cox model using these genes in each cancer type, the p-values
vary in all cancer types and are small in typically more than one
cancer types. This shows the superiority of Cox-TOTEM in selecting
genes associated with survival but with possibly weak signals in some
studies. PPAP2C, a member of the phosphatidic acid phosphatase family,
has been found to be associated with endometrioid carcinoma and ovarian
cancer metastasis^[97]35,[98]36. SPRR2E is part of the human epidermal
differentiation complex found to be related to more than 20 different
cancers including breast and endometrial cancers^[99]37,[100]38. When
we split the samples into a group with high SPRR2E expression (
[MATH: >median :MATH]
) and a group with low SPRR2E expression (
[MATH: <median :MATH]
) and compare their Kaplan–Meier curves in the five studies
(Fig. [101]3A), we see a clear separation of two curves in UCEC, CESC
and BRCA, but not in the other two studies, consistent with the results
presented in Table [102]3. A complete list of the 29 genes identified
by Cox-TOTEM together with their marginal Cox model results can be
found in Table [103]S3 of the Supplement.
Table 3.
List of the top five genes from the 29 genes selected by Cox-TOTEM with
the largest absolute coefficient estimates and their corresponding
coefficient estimates and p-values (in parentheses) when fitting a Cox
model in each cancer type.
Cox model coefficient estimate (p-value) MinPSIS-InterCoxLasso
MinPSIS-InterCoxNet
OV UCEC CESC UCS BRCA
PPAP2C
[MATH: - :MATH]
.214 (.007**)
[MATH: - :MATH]
.291 (
[MATH: .131·
:MATH]
)
[MATH: - :MATH]
.130 (.244) .155 (.455)
[MATH: - :MATH]
.074 (.594) N.S. N.S.
SLC19A1 .240 (.006**) .180 (.368)
[MATH: - :MATH]
.232 (.089*) .158 (.521) .115 (.330) N.S. N.S.
SPRR2E
[MATH: - :MATH]
.058 (.619) 2.002 (.009**)
[MATH: - :MATH]
.300 (.095*) .072 (.654) .128 (.079*) N.S. N.S.
EIF4E3
[MATH: - :MATH]
.177 (.041*)
[MATH: - :MATH]
.079 (.632)
[MATH: - :MATH]
.234 (
[MATH: .119·
:MATH]
)
[MATH: - :MATH]
.240 (.215)
[MATH: - :MATH]
.201 (.074*) N.S. N.S.
IFRD2
[MATH: - :MATH]
.133 (
[MATH: .118·
:MATH]
)
[MATH: - :MATH]
.274 (.237) .349 (.029*)
[MATH: - :MATH]
.188 (.473) .129 (.197) N.S. N.S.
[104]Open in a new tab
None of those genes were selected by MinPSIS-InterCoxLasso or
MinPSIS-InterCoxNet
Significant code:
[MATH: <0.01 :MATH]
**,
[MATH: <0.1 :MATH]
*,
[MATH:
<0.15·<
/mrow> :MATH]
; N.S.: not selected
Figure 3.
[105]Figure 3
[106]Open in a new tab
(A) Comparison of Kaplan–Meier curves between samples with high SPRR2E
expression (
[MATH: >median :MATH]
) and those with low SPRR2E expression (
[MATH: <median :MATH]
) in the five studies. (B) Top pathways enriched with the 29 survival
associated genes identified by Cox-TOTEM; x-axis: − log10(p-value),
y-axis: pathway names.
Since the underlying truth is unknown for real data, we also performed
a pathway enrichment analysis of the 29 Pan-Gyn cancer survival
associated genes identified from Cox-TOTEM using GO,
KEGG^[107]39,[108]40 and Reactome databases for more biological
insight. Figure [109]3B shows the top 12 enriched pathways sorted by
their corresponding − log10 p-values from the Fisher’s exact test. Out
of these top pathways, we found pathways of important biological
processes specific to female physiology such as GO:BP ovulation cycle
and GO:BP embryo implantation. The enrichment in these GO pathways
might reflect similar origins and development mechanism among the five
Pan Gyn cancer types and validated our gene findings^[110]33,[111]34.
Discussion
In this paper, we proposed a Cox model based two-stage variable
selection method called Cox-TOTEM to identify survival associated
biomarkers with multiple genomic studies. In the first stage, we
applied a two-step aggregation screening procedure by utilizing the
standardized coefficient estimates from marginal Cox models in each
study and testing whether each feature’s aggregate effect from multiple
studies is strong enough to be retained. In the second stage, we
started with the set of features after screening and employed a group
lasso penalty to the partial log-likelihoods in Cox models to select
the features simultaneously in all studies. From a meta-analytic
perspective, the method improves the accuracy of variable selection in
high dimension by borrowing information from multiple studies. Such
advantage has also been seen in other literature when combining
multiple studies for more accurate and robust results in classification
and clustering^[112]41,[113]42. In addition, the method is
computationally favorable by greatly reducing the dimension via fast
screening in the first stage and applying the efficient ADMM algorithm
in the second stage. Simulations with four scenarios to describe
cross-study patterns showed the method greatly improved the sensitivity
especially when the signal strengths were weak or when studies were
heterogeneous. A real application of the method to the RNA-seq data
from TCGA identified important genes associated with survival in five
Pan-Gyn cancer types. These genes were enriched in pathways specific to
female physiology implying the common disease mechanism behind these
cancer types.
Few methods have looked at the survival analysis problem when combining
multiple high-dimensional data sets, our proposed method is one of the
first to target at this significant but overlooked problem with the
focus on variable selection. One important merit of our method is that
we allow the studies to have different signal strengths for the same
features, as well as different baseline hazard rates and censoring
distributions. Such study heterogeneity is commonly seen in the
biomedical literature when researchers performed survival analysis in
multiple cohorts using Cox models^[114]43,[115]44. In this paper, we
only considered the selection of continuous covariates (e.g. gene
expression) in the simulations and real data analysis, for a more
general application with multi-omics data, the model can also include
ordinal variables (e.g. genotype) or binary variables (e.g. DNA
methylation). In case where the proportional hazards assumption does
not hold in some studies, the method can be readily extended to perform
variable selection in other parametric or nonparametric survival
models. The performance of such extended methods needs to be further
explored in both simulations and real data applications.
Other meta-analysis methods in the differential expression analysis of
transcriptomic studies also consider heterogeneous sparsity pattern
across studies, i.e. the set of associated biomarkers is different in
different genomic studies^[116]45. In cases when biomarkers uniquely
associated with survival in specific studies are of interest, we can
include alternative regularization methods, e.g. sparse group
lasso^[117]46, in the second stage to encourage selection of
study-specific survival associated biomarkers. In addition, the
directionality of effect size across studies is another major
consideration in meta-analysis in practice, for example, some studies
have positive coefficients thus increase the risk of failure while the
others decrease the risk for the same features. Such phenomenon has
also been observed in our real data example. The method can be modified
to accommodate such scenarios, details of which is beyond the scope of
this paper and left for future work.
The current method uses the MLE estimates of coefficients from the
marginal Cox model in the screening stage and applies a group lasso
penalty to the partial log-likelihoods in the regularization stage.
Considering the fact that some signals may be jointly associated with
the survival outcome but not marginally, we may consider using MLE in a
full model with sparsity restriction to take the joint effects of
features in the screening process^[118]47. In addition to group lasso,
other types of group penalties such as group bridge or group
SCAD^[119]48 can also be applied. An R package, CoxTOTEM, is publicly
available at [120]https://github.com/kehongjie/CoxTOTEM to implement
our method.
Supplementary Information
[121]Supplementary Information 1.^ (198.4KB, pdf)
Acknowledgements