Abstract Detection of prognostic factors associated with patients’ survival outcome helps gain insights into a disease and guide treatment decisions. The rapid advancement of high-throughput technologies has yielded plentiful genomic biomarkers as candidate prognostic factors, but most are of limited use in clinical application. As the price of the technology drops over time, many genomic studies are conducted to explore a common scientific question in different cohorts to identify more reproducible and credible biomarkers. However, new challenges arise from heterogeneity in study populations and designs when jointly analyzing the multiple studies. For example, patients from different cohorts show different demographic characteristics and risk profiles. Existing high-dimensional variable selection methods for survival analysis, however, are restricted to single study analysis. We propose a novel Cox model based two-stage variable selection method called “Cox-TOTEM” to detect survival-associated biomarkers common in multiple genomic studies. Simulations showed our method greatly improved the sensitivity of variable selection as compared to the separate applications of existing methods to each study, especially when the signals are weak or when the studies are heterogeneous. An application of our method to TCGA transcriptomic data identified essential survival associated genes related to the common disease mechanism of five Pan-Gynecologic cancers. Subject terms: Statistics, Bioinformatics Introduction Many biomedical studies aim to understand the progression of a disease and identify prognostic factors of patients’ survival time after standard treatment. Traditional well-known clinical prognostic factors for diseases such as cancer often provide poor prognosis and prediction^[36]1. The rapid advancement of high-throughput technology in the past two decades has generated enormous amount of genomic data at different levels (e.g. genetic variants, gene expression and DNA methylation) and enabled the tailoring of medical treatment to individual molecular characteristics of each patient^[37]2. Detection of prognostic genomic biomarkers is useful for the selection of patients who will likely benefit from a specific clinical intervention in precision medicine. Over years, various prognostic genomic biomarkers were reported in the literature but a majority of them are of limited use outside research^[38]3. As the price of technology becomes more affordable, many genomic studies are conducted to explore a common scientific question in different cohorts or populations. For example, the Pan-Cancer Atlas initiated by The Cancer Genome Atlas (TCGA) consortium studied the multi-platform molecular profiles spanning 33 cancer types and identified common somatic mutations or other genetic variations across multiple tumor lineages^[39]4,[40]5. Numerous clinical trials were conducted to detect prognostic molecular markers for survival in different cancer types^[41]6–[42]9. Compared to a single study, discovery of biomarkers from multiple studies is more reproducible and credible and shows stronger evidence of a true association with potential for clinical utility^[43]10. However, the intrinsic population heterogeneity that exists in different cohorts has created a barrier for an effective integration of multiple studies. For example, high inter-tumor and intra-tumor heterogeneity exhibited in different tumor types present heterogenous patient populations with different risk profiles in Pan-cancer analysis^[44]11,[45]12. Additional difficulty arises when the time to event outcomes are censored at different time points due to different study durations. Due to these challenges, no methods have been developed to detect survival-associated biomarkers from multiple studies while accounting for the heterogeneity in study populations and designs. Standard models for the analysis of survival data with censoring include the nonparametric Kaplan–Meier method^[46]13, the semi-parametric Cox proportional hazards model, abbreviated as the Cox model^[47]14, and other parametric regression models^[48]15. Due to the high-dimensional nature of genomic data (i.e., the number of features greatly exceeds the sample size), there is a serious collinearity issue in fitting a prediction model with limited sample size so that these survival models do not directly apply. Classical model selection methods such as the best subset selection using AIC^[49]16 or BIC^[50]17 criterion suffer from an NP-hard combinational problem in the presence of a large number of features. There are two major approaches in modern variable selection for high dimensional data. The first approach directly applies popular regularization methods such as Lasso^[51]18 and elastic net^[52]19. These regularization methods have been adopted to survival analysis in recent years based on the Cox model^[53]20–[54]22 or parametric models^[55]23,[56]24 to detect features associated with the survival outcome. The second approach implements a two-stage procedure by first reducing the dimension of feature space from high to moderate size via sure independence screening (SIS; sure screening or screening for short) methods and then applying regularization methods to refine the final pool of features. First introduced by Fan et al.^[57]25, SIS comprises a series of methods that select features based on their marginal associations with the response. Zhao et al.^[58]26 proposed a Cox SIS procedure by fitting marginal Cox models to reduce the dimension in the analysis of censored data. The two-stage procedure has advantages in computational efficiency and algorithmic stability, and has been widely applied in genomic data analysis. However, all the aforementioned variable selection methods are restricted to single study analysis. In this paper, we propose a novel Cox model based two-stage variable selection method for the detection of survival associated biomarkers common in multiple genomic studies. In the first stage, we extend the screening procedure for the linear model in Ma et al.^[59]27 to perform sure screening with multiple studies in high-dimensional Cox regression. In the second stage, we penalize the partial log-likelihoods with a group lasso penalty across multiple studies to select the final set of features in all studies simultaneously. The proposed method is statistically attractive in that it allows the studies to have different signal strengths, baseline hazard rates and censoring distributions, and effectively integrates the information from multiple heterogeneous studies. In addition, the method is also computationally favorable by implementing a fast marginal screening in the first stage and solving the regularization problem in the second stage with an efficient ADMM algorithm^[60]28. To the best of our knowledge, the proposed method is the first to address the high-dimensional variable selection problem in survival models with multiple studies. We also extend other regularization and two-stage methods to a multiple study version to compare our method under various simulation scenarios and demonstrate our method’s strength, especially when the signals are weak or when the studies are heterogeneous. An application of our method in TCGA Gynecologic and breast (Pan-Gynecologic or Pan-Gyn) cancer transcriptomic data detects essential survival associated genes related to the common disease mechanism of Pan-Gyn cancers. The paper is structured as follows. In Sect. [61]2, we discuss the variable selection in the Cox model with both single and multiple studies, and introduce our novel two-stage variable selection methodology for multiple studies. Section [62]3 evaluates the performance of our method under four different simulation scenarios. In Sect. [63]4, we analyze TCGA Pan-Gyn cancer with the proposed methodology. Discussion and final conclusion are presented in Sect. [64]5. Methods Variable selection in Cox model with single study Suppose there are n subjects for right censored survival data. The outcome of the ith ( [MATH: 1in :MATH] ) subject is [MATH: Oi=(Yi,Δi) :MATH] where [MATH: Yi :MATH] is the observed survival time and [MATH: Δi :MATH] is the censoring indicator. When right censoring occurs, censoring time [MATH: Ci :MATH] comes before the true survival time [MATH: Ti :MATH] . In this case, we only observe [MATH: Yi=Ci< /mi> :MATH] and [MATH: Δi=0 :MATH] . When the subject is not censored, we observe the true survival time [MATH: Yi=Ti< /mi> :MATH] with [MATH: Δi=1 :MATH] . The Cox model is a popular semi-parametric survival model to investigate the association between the survival outcome and explanatory variables X such as the gene expression through modeling the hazard function by [MATH: λ(t|X)=λ0(t)exp(Xβ), :MATH] 1 where [MATH: X=(x1,,xp) :MATH] is the feature vector with p being the number of features, [MATH: β=(β1,,βp)T :MATH] is the corresponding regression coefficient vector and [MATH: λ0(t) :MATH] is the baseline hazard function. In the low-dimensional case where p is smaller than n, Cox^[65]14 proposed maximizing the following partial log-likelihood to estimate [MATH: β :MATH] : [MATH: l(β)=i=1nΔilogexp(Xiβ)jR(Yi)exp(Xjβ), :MATH] 2 where the risk set R(t) is the set of indices j for the subject whose outcome has not yet been observed at time t (i.e., [MATH: Yjt :MATH] ). When there are far more features than the number of observations, a reasonable assumption made in the variable selection literature is the sparsity assumption that only a small set of features are true predictors of the response, which is also adopted in this paper. The following penalized partial log-likelihood is maximized to estimate [MATH: β :MATH] : [MATH: pl(β)=l(β)-j=1 pPλ(|βj|), :MATH] 3 where the penalty term [MATH: Pλ(.) :MATH] can represent different types of penalties (popular choices include the [MATH: L1 :MATH] lasso penalty^[66]20 and the [MATH: L1/L2 :MATH] elastic net penalty^[67]22). The tuning parameter [MATH: λ :MATH] controls the balance between the strength of coefficients and the penalty level and is usually chosen by cross validation. The final model consists of the features with nonzero coefficient estimates. The above regularization methods are generally sub-optimal when p grows at an exponential rate of n (commonly seen in most genomic studies) due to algorithmic stability and accuracy. A common practice is to perform a two-stage procedure where we first reduce the dimension of the feature set by screening out unimportant features that have no marginal associations with the outcome and then apply regularization. Such methods in the first stage possess a sure screening property that the features remaining after screening still include the true set of predictors with high probability^[68]25. Bühlmann et al.^[69]29 further proposed a partial faithfulness assumption which states that a zero marginal correlation implies a zero regression coefficient in linear models to support the use of sure screening methods. In survival analysis, the principled Cox sure independence screening procedure (“PSIS”) was proposed to screen out covariates that have no association with the survival outcome in marginal Cox models^[70]26. Variable selection in the Cox model with multiple studies Suppose we have data from K genomic studies where each study k ( [MATH: 1kK :MATH] ) has [MATH: nk :MATH] observations. For study k, we assume the Cox model with the conditional hazard function is given by [MATH: λ(k)(t|X(k))=λ0(k)(t)exp(X(k)β(k)), :MATH] 4 where [MATH: λ0(k)(t) :MATH] is the study-specific baseline hazard function, [MATH: β(k)=(β1(k),,< msubsup>βp(k))T :MATH] is the study-specific regression coefficients, and [MATH: X(k)=(x1(k),,< msubsup>xp(k)) :MATH] is the features collected from the kth study. The goal of variable selection with multiple studies is to identify the active predictors [MATH: xj(k) :MATH] with [MATH: jM={j:βj(k)0for allk} :MATH] . Here we assume that all studies have the same sparsity pattern (i.e., [MATH: βj(k)=0 :MATH] for all k or [MATH: βj(k)0 :MATH] for all k). Under this assumption, we can accumulate potentially weak signals from each study to yield a strong evidence of active predictors. Note that because one can easily select common features across studies, we assume that p is common across all studies so that [MATH: xj(k) :MATH] and [MATH: xj(l) :MATH] correspond to the same feature j from the studies k and l. The extension to different p in different studies is straightforward and will be discussed elsewhere. The first general framework for variable selection with multiple studies was proposed by Ma et al.^[71]27 where active predictors is assumed to be common to all studies, but the signal strengths (i.e., magnitudes of [MATH: |βj(k)| :MATH] ) of those active predictors can vary among the studies. Combining marginal correlations between an outcome and each feature across studies, Ma et al. (2020) proposed the extension of sure independence screening to multiple studies in the linear models, the method known as Two-Step Aggregation Sure Independence Screening or “TSA-SIS”. With this framework of the same sparsity pattern with varied signal strength, inclusion of multiple studies yields more evidence to remove unimportant features during screening since a zero marginal correlation for any study implies a zero regression coefficient for that feature by the partial faithfulness assumption. The same framework is also adopted for the integrative analysis of multiple high-dimensional -omics data for identifying disease related biomarkers with consistent effects across multiple studies with logistic regression^[72]30. Our proposed methodology is a non-trivial extension of the sure screening procedure for multiple studies of Ma et al.^[73]27 to the multiple Cox proportional hazards models. Unlike simpler parametric models such as linear and logistic regression models, the Cox model has the baseline hazard function as an infinite dimensional function parameter beyond the regression coefficients of interest. In linear and logistic regression, heterogeneity of multiple studies only appears as different strength of signals in [MATH: βj(k),k=< mn>1,,K :MATH] . In our survival models, heterogeneity also comes through different study-specific baseline hazard functions [MATH: λ0(k)(t),k=1< mo>,,K :MATH] , in addition to signal strength in features. Another source of heterogeneity is different censoring distributions across studies. When right censoring occurs, the exact survival time is not observed but it is only known to be larger than certain time. In the context of multiple studies, for example, one study ends in one year while another ends at five years. In this case, censored subjects in the former has survival time more than one year while time to event in the latter is only known to be larger than five years. In this paper, we address the challenging issue of additional heterogeneity in the multiple Cox models by the partial likelihood approach. To this end, we assume that the survival time and censoring time are conditionally independent given covariates. This is the key assumption that enables us to separate multiple censoring distributions from the estimation of study-specific baseline hazard functions and regression coefficients. In the medical literature, the conditional independence assumption is standard for a single study. Thus, it is reasonable to assume the same for each of multiple studies in our variable selection problem. In this paper, we propose the Cox model based TwO-sTage variable sElection procedure for Multiple studies, namely “Cox-TOTEM”, to detect survival associated biomarkers common in multiple genomic studies. In the first stage, we extend the TSA-SIS procedure in the linear model to perform sure screening in the Cox models with multiple studies by utilizing the standardized coefficient estimates from marginal Cox models. This procedure reduces the false negative errors (i.e., missing the important features) and select the features with nonzero coefficients in all studies. In addition to the extension of the sure screening approach, we further refine the pool of features using the penalized partial likelihood in the second stage. Specifically, we propose a novel group lasso penalty in the partial log-likelihood of the Cox models to obtain the final set of features. It is natural to adopt a group lasso penalty by treating the coefficients of the same feature from multiple studies as one group. Such group lasso penalty will achieve the selection of a feature either in all studies or in none of the studies (“all-in-or-all-out”), thus identifying a common set of features across multiple studies. Below, we describe our methodology in details. The “Cox-TOTEM” method Figure [74]1 shows a flowchart of the proposed two-stage method Cox-TOTEM. In the first screening stage, we start by fitting a marginal Cox model with a single covariate [MATH: xj(k) :MATH] to study k (i.e., [MATH: λ(k)(t|xj(k))=λ0(k)(t)exp(xj(k)βj(k)) :MATH] ). The marginal partial likelihood estimator [MATH: β~j(k) :MATH] for the jth feature in the kth study is a preliminary estimate used for sure screening in the first stage. Define the observed information matrix to be [MATH: I(β~j(k)) :MATH] ( [MATH: =1/var^(β~j(k)) :MATH] ). The corresponding standardized marginal coefficient estimator is equal to [MATH: z~j(k)=I(β~j(k))1/2β~j(k) :MATH] . By the property of nonparametric maximum likelihood estimation, [MATH: z~j(k) :MATH] asymptotically follows the standard normal distribution when [MATH: βj(k)=0 :MATH] . As the extension of the TSA-SIS procedure to the Cox models, we utilize [MATH: |z~j(k)| :MATH] to identify the studies with potential zero coefficients (i.e., weak signals or noises with small [MATH: |z~j(k)| :MATH] ) for each feature j in the first step: [MATH: L^j={k;|z~j(k)|Φ-1(1-α1/2)}, :MATH] 5 where [MATH: Φ :MATH] is the CDF of the standard normal distribution. The choice of [MATH: α1 :MATH] determines the threshold to separate strong signals from weak signals. This first step does not screen out any features, but instead helps separate potential zero and nonzero coefficients for preparation of the second step. Figure 1. [75]Figure 1 [76]Open in a new tab Flowchart of the proposed two-stage method Cox-TOTEM. Screening procedure applies to each feature j ( [MATH: 1jp :MATH] ). “ [MATH: :MATH] ” means passed to the next step or stage, while “X” means not passed to the next step or stage. Let [MATH: κ^j=|L^j| :MATH] be the cardinality of [MATH: L^j :MATH] . The second step tests whether the aggregate effect of studies with potential zero coefficients identified in the set [MATH: L^j :MATH] is strong enough for the jth feature to be retained in the screening stage. Define the statistics [MATH: Uj= kL^j z~j(k)2 :MATH] , which approximately follows a [MATH: χκ^j2 :MATH] distribution with degree of freedom [MATH: κ^j :MATH] . We retain the set of features with large [MATH: Uj :MATH] or with [MATH: κ^j=0 :MATH] : [MATH: M^[1]={j[p];Uj>φκ^j-1(1-α2)orκ^j=0}, :MATH] 6 where [MATH: [p]=1,2,,p :MATH] and [MATH: φκ^j :MATH] is the CDF of chi-square distribution with degree of freedom equal to [MATH: κ^j :MATH] . The key tuning parameter [MATH: α2 :MATH] determines how many features to retain in the screening stage. The second step takes the sum of squares of [MATH: z~j(k) :MATH] from studies with potential zero coefficients as the test statistics of the jth feature and performs the actual screening. If the aggregate effect is strong enough, we will keep the feature and screen it out otherwise. In this way, it potentially saves those important features with weak signals in individual studies but strong aggregate effect. Define [MATH: d1=|M^[1]| :MATH] , after the screening stage, [MATH: d1 :MATH] features common to all studies remain and are moved on to the second stage. Following screening, we include a group lasso penalty in the partial log-likelihoods to select the final set of features common to all studies in the regularization stage: [MATH: β^=argmaxβk=1 Kl(βM^[1](k))-λjM^[1]||βj||2,< /mtr> :MATH] 7 where [MATH: βj=(βj(1),,< msubsup>βj(K)) :MATH] and [MATH: λ :MATH] is the tuning parameter. Note that this group lasso penalty is different from the regular group lasso penalty where the different features form pre-defined groups, e.g. multiple genes form a pathway^[77]31. Here we treat the coefficients of the same feature from multiple studies as one group to identify a common set of biomarkers associated with the survival. An ADMM algorithm is used to solve the above optimization problem. Detailed steps of the algorithm can be found in the Supplement. Define the set [MATH: M^[2]={j[p];β^j0} :MATH] and [MATH: d2=|M^[2]| :MATH] . [MATH: M^[2] :MATH] is the final set of common features we select. The Cox-TOTEM algorithm can be summarized as below: [78]graphic file with name 41598_2021_82332_Figa_HTML.jpg Selection of tuning parameters The proposed algorithm involves three tuning parameters: [MATH: α1 :MATH] and [MATH: α2 :MATH] in the screening stage and [MATH: λ :MATH] in the regularization stage. Since screening is an intermediate step, the choice of [MATH: α1 :MATH] and [MATH: α2 :MATH] should be more conservative making as fewer false negative errors as possible (i.e., high sensitivity) while retaining not too many false positives (i.e., not too low specificity). A small [MATH: α1 :MATH] places stringent cutoff in selecting strong signals so more studies are included in the second step leading to more false positive errors. On the other hand, a large [MATH: α1 :MATH] tends to select very few studies causing more false negative errors. In practice, a small [MATH: α1 :MATH] is generally recommended to reduce the more serious false negative errors. The parameter [MATH: α2 :MATH] is the threshold used in the actual screening step which directly determines how many features are retained. It serves the same role as the retained model size d in the original sure independence screening method^[79]25 or the significance level [MATH: α :MATH] in the PC-simple algorithm^[80]29. In practice, we recommend setting [MATH: α2=0.05 :MATH] , the popular significance level used in hypothesis testing. We performed a sensitivity analysis using simulation and showed that [MATH: α1=10-4 :MATH] and [MATH: α2=0.05 :MATH] achieved a good balance between sensitivity and specificity in the screening stage (see Table [81]S1 in the Supplement). When the number of features p becomes even larger (e.g. [MATH: p10,000 :MATH] as in our real data application) and we need to remove more features during screening, we also recommend more stringent cutoff of [MATH: α2=0.01 :MATH] . Screening is by nature a fast step, we generally suggest users follow the aforementioned guideline but not use any computationally intensive data splitting or resampling procedures in choosing [MATH: α1 :MATH] and [MATH: α2 :MATH] unless otherwise specified. To determine the optimal value of [MATH: λ :MATH] in the regularization stage, we propose a multi-study cross validation procedure by applying K-fold cross-validation within each study and utilizing individual survival prediction as the evaluation criteria. Details of the scheme can be found in the Supplement. Methods for comparison Since no existing methods specifically look at the variable selection problem in survival model with multiple studies, we compare our method to the multiple study extension of the Cox model specific screening and regularization methods in single study. With the target index set [MATH: M :MATH] , we screen out a feature when any study has a zero marginal coefficient for that feature. Thus, the multiple study extension of the original PSIS^[82]26 for sure screening in the Cox model can be regarded as MinPSIS, which ranks the features by the minimum absolute standardized marginal coefficients in all studies. Likewise, the multiple study extension of the regularization methods such as CoxLasso (lasso penalty in Cox model) or CoxNet (elastic net penalty in Cox model) take the intersection of features identified in each study, abbreviated as InterCoxLasso or InterCoxNet. In the simulation, we compare our method to a total of four methods, including two-stage methods “MinPSIS-InterCoxLasso” and “MinPSIS-InterCoxNet”, as well as direct regularization methods “InterCoxLasso” and “InterCoxNet”. For the retained model size d in the screening stage of other two-stage methods, we suggest more conservative cutoff of [MATH: d=n :MATH] to mitigate the missing of true signals. For regularization methods, we use cross-validation to select the optimal tuning parameter as implemented in the R package “glmnet”^[83]22,[84]32. Simulation Simulation setting We conducted simulations to demonstrate the strength of Cox-TOTEM as compared to other methods under various scenarios. For each study k, we generated [MATH: n=100 :MATH] observations, each with a survival outcome, consisting of an observed censored survival time [MATH: Yi(k) :MATH] and a censoring indicator [MATH: Δi(k) :MATH] , and a vector of [MATH: p=2000 :MATH] features [MATH: Xi(k)=(xi1(k),,< msubsup>xip(k)) :MATH] . [MATH: Xi(k) :MATH] was randomly sampled from a multivariate normal distribution [MATH: N2000(0,Σ(k)) :MATH] , where [MATH: Σjj(k)=1 :MATH] and [MATH: Σjj(k)=0.< msup>5|j-j| :MATH] for [MATH: 1jjp :MATH] . The true survival time [MATH: Ti(k) :MATH] for each individual i was generated from the Cox model with study-specific baseline hazard function [MATH: λ0(k) :MATH] and regression coefficients [MATH: β(k) :MATH] to be described below. The censoring time [MATH: Ci(k) :MATH] was generated from the exponential distribution [MATH: Ci(k)exp(r(k)) :MATH] with the study-specific parameter [MATH: r(k) :MATH] described below. We only assumed right censoring so the observed censored survival time [MATH: Yi(k) :MATH] was the minimum of the true survival time and censoring time: [MATH: Yi(k)=min(Ti(k),Ci(k)) :MATH] , and the censoring indicator [MATH: Δi(k)=1 :MATH] when [MATH: Ci(k)>Ti(k) :MATH] and 0 otherwise. The above scheme was applied to all [MATH: K=5 :MATH] studies and the simulations were repeated for [MATH: B=100 :MATH] times. We assumed only [MATH: s=10 :MATH] features were true predictors with nonzero coefficients and the other features had zero coefficients in all studies. The ten features with nonzero coefficients were evenly distributed among the p variables with equal space. We considered the following four different scenarios for a complete evaluation of our method: * Homogeneous strong (Homo-S) we assumed strong signals in all studies. Let [MATH: μj=-1 :MATH] for the first five features with nonzero coefficients and [MATH: μj=1 :MATH] for the next five features, and let [MATH: μj=0 :MATH] for the rest of features. For homogeneous study effects, we set [MATH: βj(1)=βj(2)==βj(K)=μ< mi>j :MATH] for [MATH: 1jp :MATH] . We also set the study-specific baseline hazard function [MATH: λ0(1)=λ0(2)==λ0(K)=1 :MATH] and [MATH: r(1)=r(2)== r(K)=0.2 :MATH] to be consistent across all studies. * Homogeneous weak (Homo-W) the setting is the same as in Homo-S, except for now we assumed weak signals in all studies. Let [MATH: μj=-0.5< /mn> :MATH] for the first five features with nonzero coefficients and [MATH: μj=0.5 :MATH] for the next five features, and let [MATH: μj=0 :MATH] for the rest of features. * Heterogeneous strong (Hetero-S) we let [MATH: μj=-1 :MATH] for the first five features with nonzero coefficients and [MATH: μj=1 :MATH] for the next five features, and [MATH: μj=0 :MATH] for the rest of features. To allow for between study variation in magnitudes of coefficients but not in sparsity pattern, we randomly sampled [MATH: βj(k) :MATH] from [MATH: N(μj,< mn>0.22) :MATH] for [MATH: 1kK :MATH] for the features with nonzero coefficients and [MATH: βj(1)=βj(2)==βj(K)=μ< mi>j=0 :MATH] for the other features. In addition, we also allowed for variation in baseline hazard functions and censoring distributions among studies. More specifically, we randomly drew [MATH: λ0(k) :MATH] from [MATH: exp(1) :MATH] and [MATH: r(k) :MATH] from [MATH: {0.1,0.3, 0.5} :MATH] for [MATH: 1kK :MATH] studies. * Heterogeneous weak (Hetero-W) the setting is the same as in Hetero-S, except we assumed weak signals with [MATH: μj=-0.5< /mn> :MATH] for the first five features with nonzero coefficients and [MATH: μj=0.5 :MATH] for the next five features. In addition to the case with [MATH: n=100 :MATH] and [MATH: p=2000 :MATH] , we also conducted simulations of higher dimension with [MATH: n=200 :MATH] and [MATH: p=10,000 :MATH] and slightly weaker signals ( [MATH: |μj|=0.7 :MATH] for the true predictors in strong scenarios and [MATH: |μj|=0.35 :MATH] in weak scenarios) and evaluated the performance under the same four scenarios. In all simulations, we used [MATH: α1=1e-4 :MATH] and [MATH: α2=0.05 :MATH] (and for [MATH: p=10,000 :MATH] , we used [MATH: α2=0.01 :MATH] ) in the screening stage and chose an optimal [MATH: λ :MATH] via the proposed cross-validation scheme in the regularization stage. To benchmark the performance of variable selection, we evaluated both the sensitivity and specificity as well as the average number of features selected. In addition, we also included a plot of sensitivity and specificity with varying values of tuning parameters in all methods for a fair comparison independent of the tuning parameter selection. After model selection, we fit a Cox model to all remaining features in each study and computed the sum squared errors (SSE) [MATH: j=1pk=1K(β^j(k)-βj(k))2 :MATH] , where [MATH: β^j(k) :MATH] is the Cox model coefficient estimates for jth feature in kth study, to assess the parameter estimation. A smaller SSE suggests more accurate parameter estimation. Simulation results Table [85]1 shows the variable selection and parameter estimation performance with [MATH: n=100 :MATH] and [MATH: p=2000 :MATH] . As compared to the other four methods, Cox-TOTEM greatly improves the sensitivity with almost no decrease in specificity in all four scenarios. Such gain in sensitivity is most noticeable when the signals are weak or when studies are more heterogeneous. This shows the advantage of our method in aggregating the information of multiple studies and saving those features with weak signals in one or more studies. In most cases, Cox-TOTEM retains a majority of the true signals and has more accurate parameter estimation than the other methods. Figure [86]2 shows the plots of mean sensitivity (left) and mean specificity (right) against the scaled [MATH: λ :MATH] values for all methods in the four scenarios. Increasing [MATH: λ :MATH] value decreases the sensitivity and increases the specificity for all methods. With varying [MATH: λ :MATH] values, the curve of Cox-TOTEM sits above the other methods in the sensitivity plot while almost indistinguishable from the others in the specificity plot, which shows the advantage of our method regardless of the selection of the best tuning parameter. Note that in general the two-stage methods (“MinPSIS-InterCoxLasso” and “MinPSIS-InterCoxNet”) outperform the one-stage regularization methods (“InterCoxLasso” and “InterCoxNet”) as expected, encouraging the use of two-stage variable selection methods in practice. The results look very consistent in the case of [MATH: n=200 :MATH] and [MATH: p=10,000 :MATH] , where Cox-TOTEM has the greatest sensitivity among all in heterogeneous and weak scenarios (see Table [87]S2 in the Supplement). Table 1. Comparison of variable selection and parameter estimation under the four different scenarios with [MATH: n=100 :MATH] , [MATH: p=2000 :MATH] and [MATH: s=10 :MATH] true predictors. Simulation scenarios Methods Sensitivity Specificity Average number of features selected SSE Homo-S Cox-TOTEM 0.985 (0.004) 0.999 (0) 12.26 (0.194) 4.227 (0.269) MinPSIS-InterCoxLasso 0.856 (0.018) 1 (0) 9.4 (0.173) 11.36 (1.02) MinPSIS-InterCoxNet 0.881 (0.013) 0.997 (0.001) 14.36 (0.244) 10.79 (0.772) InterCoxLasso 0.37 (0.018) 1 (0) 3.7 (0.183) 37.421 (0.775) InterCoxNet 0.53 (0.014) 1 (0) 5.31 (0.143) 30.823 (0.684) Homo-W Cox-TOTEM 0.977 (0.049) 0.997 (0) 16.6 (0.385) 3.321 (0.203) MinPSIS-InterCoxLasso 0.464 (0.028) 0.999 (0) 5.17 (0.347) 7.646 (0.311) MinPSIS-InterCoxNet 0.618 (0.019) 0.997 (0) 13.07 (0.384) 6.804 (0.214) InterCoxLasso 0.049 (0.007) 1 (0) 0.49 (0.073) 12.032 (0.073) InterCoxNet 0.143 (0.011) 1 (0) 1.43 (0.11) 11.094 (0.117) Hetero-S Cox-TOTEM 0.968 (0.006) 0.998 (0) 13.64 (0.261) 7.969 (0.437) MinPSIS-InterCoxLasso 0.723 (0.023) 1 (0) 7.98 (0.226) 20.758 (1.27) MinPSIS-InterCoxNet 0.774 (0.015) 0.997 (0.001) 13.41 (0.26) 18.986 (0.943) InterCoxLasso 0.258 (0.017) 1 (0) 2.58 (0.167) 45.579 (0.716) InterCoxNet 0.393 (0.015) 1 (0) 3.93 (0.146) 39.842 (0.698) Hetero-W Cox-TOTEM 0.885 (0.01) 0.993 (0) 22.68 (0.66) 14.163 (0.163) MinPSIS-InterCoxLasso 0.211 (0.017) 1 (0) 2.93 (0.241) 13.118 (0.276) MinPSIS-InterCoxNet 0.367 (0.016) 0.997 (0) 10.01 (0.4) 11.813 (0.251) InterCoxLasso 0.037 (0.006) 1 (0) 0.37 (0.06) 15.738 (0.123) InterCoxNet 0.083 (0.008) 1 (0) 0.83 (0.075) 14.932 (0.154) [88]Open in a new tab Mean results of 100 replications are reported with standard errors shown in the parentheses. Figure 2. Figure 2 [89]Open in a new tab Plot of mean sensitivity (left) and mean specificity (right) against the scaled [MATH: λ :MATH] values for (A) Homo-S; (B) Homo-W; (C) Hetero-S; (D) Hetero-W scenarios with [MATH: n=100 :MATH] and [MATH: p=2000 :MATH] . Shades denote the standard deviations. The original [MATH: λ :MATH] values are rescaled in all methods for fair comparison. Application to TCGA transcriptomic data Data description We then applied our method to transcriptomic data from TCGA project studying the gene expression profile of four gynecological cancer types plus breast cancer (together known as Pan-Gyn cancers). Gynecologic cancers and breast cancer share a variety of generic characteristics: arising from similar embryonic origins, development all influenced by female hormone, among others^[90]33. Previous studies in TCGA identified shared molecular features including protein, miRNA, RNA and DNA methylation among these five cancer types^[91]33,[92]34. The purpose of this application is to identify the survival associated genes common in these five cancer types, which might provide more insights into the commonalities and infer the underlying common mechanism of Pan-Gyn cancer in women. We retrieved both the RNA-seq data and the clinical data containing survival outcomes of the five Pan-Gyn cancer types including high-grade serous ovarian cystadenocarcinoma (OV), uterine corpus endometrial carcinoma (UCEC), cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), uterine carcinosarcoma (UCS), and invasive breast carcinoma (BRCA) from TCGA data repository on Broad GDAC Firehose ([93]https://gdac.broadinstitute.org/). Table [94]2 shows the total sample size and the censoring proportion of each cancer type. The censoring distributions vary greatly across the different cancer types, implying the heterogeneity in censoring patterns among studies commonly seen in real datasets. The processed RNA-seq data from TCGA include the TPM (Transcripts Per Kilobase Million) values. We first merged the five datasets by matching the gene symbols and implementing quantile normalization to remove any batch effect among the different studies. Genes with mean TPM less or equal to 5 were filtered out and 11998 genes remained. The data was also log2 transformed to be ready for the rest of the analysis. Table 2. Overview of the samples from the five Pan-Gyn cancer types. OV UCEC CESC UCS BRCA Sample size 304 170 301 57 981 Censoring proportion [MATH: 39% :MATH] [MATH: 81% :MATH] [MATH: 76% :MATH] [MATH: 39% :MATH] [MATH: 89% :MATH] [95]Open in a new tab Results We applied our method as well as the two-stage methods “MinPSIS-InterCoxLasso” and “MinPSIS-InterCoxNet” to the Pan-Gyn example. Considering the relative high dimension, we set [MATH: α2=0.01 :MATH] in the screening stage of our method and retained 268 genes after screening. The regularization stage further refined the pool to a final set of 29 genes. On the other hand, “MinPSIS-InterCoxNet” and “MinPSIS-InterCoxLasso” only detected 15 and one genes, respectively. Table [96]3 lists the top five genes with the largest absolute coefficient estimates from the 29 genes identified by Cox-TOTEM. None of these genes were selected by the other two competing methods. When fitting a Cox model using these genes in each cancer type, the p-values vary in all cancer types and are small in typically more than one cancer types. This shows the superiority of Cox-TOTEM in selecting genes associated with survival but with possibly weak signals in some studies. PPAP2C, a member of the phosphatidic acid phosphatase family, has been found to be associated with endometrioid carcinoma and ovarian cancer metastasis^[97]35,[98]36. SPRR2E is part of the human epidermal differentiation complex found to be related to more than 20 different cancers including breast and endometrial cancers^[99]37,[100]38. When we split the samples into a group with high SPRR2E expression ( [MATH: >median :MATH] ) and a group with low SPRR2E expression ( [MATH: <median :MATH] ) and compare their Kaplan–Meier curves in the five studies (Fig. [101]3A), we see a clear separation of two curves in UCEC, CESC and BRCA, but not in the other two studies, consistent with the results presented in Table [102]3. A complete list of the 29 genes identified by Cox-TOTEM together with their marginal Cox model results can be found in Table [103]S3 of the Supplement. Table 3. List of the top five genes from the 29 genes selected by Cox-TOTEM with the largest absolute coefficient estimates and their corresponding coefficient estimates and p-values (in parentheses) when fitting a Cox model in each cancer type. Cox model coefficient estimate (p-value) MinPSIS-InterCoxLasso MinPSIS-InterCoxNet OV UCEC CESC UCS BRCA PPAP2C [MATH: - :MATH] .214 (.007**) [MATH: - :MATH] .291 ( [MATH: .131· :MATH] ) [MATH: - :MATH] .130 (.244) .155 (.455) [MATH: - :MATH] .074 (.594) N.S. N.S. SLC19A1 .240 (.006**) .180 (.368) [MATH: - :MATH] .232 (.089*) .158 (.521) .115 (.330) N.S. N.S. SPRR2E [MATH: - :MATH] .058 (.619) 2.002 (.009**) [MATH: - :MATH] .300 (.095*) .072 (.654) .128 (.079*) N.S. N.S. EIF4E3 [MATH: - :MATH] .177 (.041*) [MATH: - :MATH] .079 (.632) [MATH: - :MATH] .234 ( [MATH: .119· :MATH] ) [MATH: - :MATH] .240 (.215) [MATH: - :MATH] .201 (.074*) N.S. N.S. IFRD2 [MATH: - :MATH] .133 ( [MATH: .118· :MATH] ) [MATH: - :MATH] .274 (.237) .349 (.029*) [MATH: - :MATH] .188 (.473) .129 (.197) N.S. N.S. [104]Open in a new tab None of those genes were selected by MinPSIS-InterCoxLasso or MinPSIS-InterCoxNet Significant code: [MATH: <0.01 :MATH] **, [MATH: <0.1 :MATH] *, [MATH: <0.15·< /mrow> :MATH] ; N.S.: not selected Figure 3. [105]Figure 3 [106]Open in a new tab (A) Comparison of Kaplan–Meier curves between samples with high SPRR2E expression ( [MATH: >median :MATH] ) and those with low SPRR2E expression ( [MATH: <median :MATH] ) in the five studies. (B) Top pathways enriched with the 29 survival associated genes identified by Cox-TOTEM; x-axis: − log10(p-value), y-axis: pathway names. Since the underlying truth is unknown for real data, we also performed a pathway enrichment analysis of the 29 Pan-Gyn cancer survival associated genes identified from Cox-TOTEM using GO, KEGG^[107]39,[108]40 and Reactome databases for more biological insight. Figure [109]3B shows the top 12 enriched pathways sorted by their corresponding − log10 p-values from the Fisher’s exact test. Out of these top pathways, we found pathways of important biological processes specific to female physiology such as GO:BP ovulation cycle and GO:BP embryo implantation. The enrichment in these GO pathways might reflect similar origins and development mechanism among the five Pan Gyn cancer types and validated our gene findings^[110]33,[111]34. Discussion In this paper, we proposed a Cox model based two-stage variable selection method called Cox-TOTEM to identify survival associated biomarkers with multiple genomic studies. In the first stage, we applied a two-step aggregation screening procedure by utilizing the standardized coefficient estimates from marginal Cox models in each study and testing whether each feature’s aggregate effect from multiple studies is strong enough to be retained. In the second stage, we started with the set of features after screening and employed a group lasso penalty to the partial log-likelihoods in Cox models to select the features simultaneously in all studies. From a meta-analytic perspective, the method improves the accuracy of variable selection in high dimension by borrowing information from multiple studies. Such advantage has also been seen in other literature when combining multiple studies for more accurate and robust results in classification and clustering^[112]41,[113]42. In addition, the method is computationally favorable by greatly reducing the dimension via fast screening in the first stage and applying the efficient ADMM algorithm in the second stage. Simulations with four scenarios to describe cross-study patterns showed the method greatly improved the sensitivity especially when the signal strengths were weak or when studies were heterogeneous. A real application of the method to the RNA-seq data from TCGA identified important genes associated with survival in five Pan-Gyn cancer types. These genes were enriched in pathways specific to female physiology implying the common disease mechanism behind these cancer types. Few methods have looked at the survival analysis problem when combining multiple high-dimensional data sets, our proposed method is one of the first to target at this significant but overlooked problem with the focus on variable selection. One important merit of our method is that we allow the studies to have different signal strengths for the same features, as well as different baseline hazard rates and censoring distributions. Such study heterogeneity is commonly seen in the biomedical literature when researchers performed survival analysis in multiple cohorts using Cox models^[114]43,[115]44. In this paper, we only considered the selection of continuous covariates (e.g. gene expression) in the simulations and real data analysis, for a more general application with multi-omics data, the model can also include ordinal variables (e.g. genotype) or binary variables (e.g. DNA methylation). In case where the proportional hazards assumption does not hold in some studies, the method can be readily extended to perform variable selection in other parametric or nonparametric survival models. The performance of such extended methods needs to be further explored in both simulations and real data applications. Other meta-analysis methods in the differential expression analysis of transcriptomic studies also consider heterogeneous sparsity pattern across studies, i.e. the set of associated biomarkers is different in different genomic studies^[116]45. In cases when biomarkers uniquely associated with survival in specific studies are of interest, we can include alternative regularization methods, e.g. sparse group lasso^[117]46, in the second stage to encourage selection of study-specific survival associated biomarkers. In addition, the directionality of effect size across studies is another major consideration in meta-analysis in practice, for example, some studies have positive coefficients thus increase the risk of failure while the others decrease the risk for the same features. Such phenomenon has also been observed in our real data example. The method can be modified to accommodate such scenarios, details of which is beyond the scope of this paper and left for future work. The current method uses the MLE estimates of coefficients from the marginal Cox model in the screening stage and applies a group lasso penalty to the partial log-likelihoods in the regularization stage. Considering the fact that some signals may be jointly associated with the survival outcome but not marginally, we may consider using MLE in a full model with sparsity restriction to take the joint effects of features in the screening process^[118]47. In addition to group lasso, other types of group penalties such as group bridge or group SCAD^[119]48 can also be applied. An R package, CoxTOTEM, is publicly available at [120]https://github.com/kehongjie/CoxTOTEM to implement our method. Supplementary Information [121]Supplementary Information 1.^ (198.4KB, pdf) Acknowledgements