Abstract Cancer prognosis is of essential interest, and extensive research has been conducted searching for biomarkers with prognostic power. Recent studies have shown that both omics profiles and histopathological imaging features have prognostic power. There are also studies exploring integrating the two types of measurements for prognosis modeling. However, there is a lack of study rigorously examining whether omics measurements have independent prognostic power conditional on histopathological imaging features, and vice versa. In this article, we adopt a rigorous statistical testing framework and test whether an individual gene expression measurement can improve prognosis modeling conditional on high-dimensional imaging features, and a parallel analysis is conducted reversing the roles of gene expressions and imaging features. In the analysis of The Cancer Genome Atlas (TCGA) lung adenocarcinoma and liver hepatocellular carcinoma data, it is found that multiple individual genes, conditional on imaging features, can lead to significant improvement in prognosis modeling; however, individual imaging features, conditional on gene expressions, only offer limited prognostic power. Being among the first to examine the independent prognostic power, this study may assist better understanding the “connectedness” between omics profiles and histopathological imaging features and provide important insights for data integration in cancer modeling. Keywords: cancer prognosis, independent prognostic power, omics profiles, histopathological imaging features 1. Introduction In cancer research and practice, prognosis is of essential interest. Extensive statistical modeling has been conducted, and yet there is still much room for additional research [[28]1,[29]2]. Multiple families of biomarkers have been used in cancer prognosis modeling. In the past decades, with the development of profiling techniques, omics data have been extensively used and shown to be effective. For example, in [[30]3], a signature composed of 21 gene expressions is used for modeling breast cancer prognosis. In [[31]4], hsa-mir-155 and hsa-let-7a-2 microRNAs are found as prognostic for lung cancer. In [[32]5], the prognostic power of methylated RASSF1A and/or APC serum DNA for breast cancer is identified. Such findings are biologically highly sensible as cancer is a genetic disease and the development and progression of cancer are strongly affected by molecular changes. Imaging techniques have been extensively used in cancer practice. In particular, for definitive diagnosis, biopsies are usually ordered, and pathologists review the slides of representative sections of tissues. The histopathological imaging data generated in this process directly reflect the histological organization and morphological characteristics of tumor cells and their surrounding tumor microenvironment. More recently, beyond diagnosis, histopathological imaging data have also been used for modeling other cancer outcomes/phenotypes. For example, Harpole et al. [[33]6] showed that the levels of tumor cell dedifferentiation are associated with survival outcomes. Automated histopathological imaging analysis systems have been shown to be effective in the prognostic determination of various malignancies, including breast cancer [[34]7], neuroblastoma [[35]8], lymphoma [[36]9], nonsmall cell lung cancer [[37]10], precancerous lesions in the esophagus [[38]11], and others. Omics profiles and histopathological images are biologically connected and contain overlapping information. In particular, properties of tumors and their microenvironment, as described in histopathological images, are highly regulated by molecular changes. Multiple statistical modelings have been conducted on their interconnections. For example, Yu et al. [[39]10] found the correlation of quantitative histopathological features with TP53 mutation in lung adenocarcinoma. Cooper et al. [[40]12] demonstrated that PDGFRA, EGFR, and MDM2 amplifications are associated with imaging features such as greater minor axis length, area, and circularity in glioblastoma. On the other hand, it is also possible that omics profiles and histopathological images may contain independent information. Tissues used to generate omics measurements are usually heterogeneous and mixtures of different components, making the observed omics measurements represent an aggregation of distinct cells [[41]12]. Histopathological images, with extremely high spatial resolutions, also contain information on tissue relationships and characteristics of the spatial organizations of tumor cells. Features of histopathological images, beyond regulated by molecular changes, may also be affected by personal immune system, microenvironment, and other factors. There are a few recent studies integrating omics measurements and imaging features and showing that such a data integration can lead to improved prognosis modeling for breast cancer [[42]13], brain tumor [[43]14], and nonsmall cell lung cancer [[44]15]. Such studies seem to suggest that there is independent information in omics and histopathological imaging data. Our literature review suggests that there is a lack of study rigorously quantifying whether omics profiles contain independent information, conditional on histopathological images, on cancer prognosis, and vice versa. It is noted that the aforementioned and other data integration studies do not suggest which type of measurement has independent information. In addition, they are mostly estimation-based and do not have a rigorous statistical inference framework. Some studies rely on prescreening to accommodate high data dimensionality and can be statistically limited. This study is conducted to directly tackle these problems. It can advance from the existing literature in multiple aspects. Specifically, it delivers an analysis pipeline which has a rigorous statistical inference framework and can show whether individuals of a type of measurement have additional prognostic power conditional on the other type of measurement. It can provide solid evidence on whether data integration is needed in modeling and clinical practice. In addition, data analysis may also provide further insights into two deadly cancers. 2. Data The Cancer Genome Atlas (TCGA) is one of the largest and most comprehensive cancer projects, and is organized by the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI) [[45]16]. It has published high quality omics data and histopathological images for 33 types of cancer. In this study, we analyze data on lung adenocarcinoma (LUAD) and liver hepatocellular carcinoma (LIHC), both of which have high mortality rates and pose increasing public concerns. They also have relatively larger sample sizes, which is critical for making reliable findings. The proposed analysis can be extended to other cancer types straightforwardly. For omics measurements, we consider gene expressions, which have a central role in cancer prognosis modeling. The response of interest is overall survival time, which is right censored. It is noted that other types of omics measurements, for example protein expressions and microRNAs, and other prognosis outcomes can be analyzed in a similar manner. 2.1. mRNA Gene Expression Measurements We download the mRNA gene expression data from cBioPortal ([46]http://cbioportal.org), which have been measured using the Illumina Hiseq2000 RNA Sequencing Version 2 analysis platform, and processed and normalized using the RSEM software. We refer to the literature [[47]17,[48]18] and TCGA website for more detailed information on data collection and processing. Data are available on 25,031 gene expression measurements. As the number of cancer-related genes is not expected to be large, we conduct a simple prescreening to reduce the dimensionality and also improve the stability of estimation. Specifically, the top 5000 genes with the largest variances are selected for downstream analysis. 2.2. Histopathological Imaging Features We download the whole-slide histopathology images directly from the TCGA website ([49]https://portal.gdc.cancer.gov), which are in the svs format. These tissue slides are formalin-fixed, paraffin-embedded slides with which the cell morphology is well-preserved and thus appropriate for image feature recognition. They are captured at 20× or 40× magnification by the Aperio medical scanner. To extract imaging features, we conduct the following three-step preprocessing. The overall flowchart is provided in [50]Figure 1. Figure 1. [51]Figure 1 [52]Open in a new tab Flowchart of extracting imaging features. Step 1: whole-slide histopathology images are cropped into small subimages of 500 × 500 pixels, and 20 subimages are then randomly selected. Step 2: Imaging features are extracted using CellProfiler [[53]19] for each subimage. Step 3: For each patient, features are averaged. In the first step, we process the downloaded svs files using the Openslide Python library [[54]20]. Specifically, to make data in an appropriate format for morphological feature extraction, each svs image is cropped into small subimages of 500 × 500 pixels, and each subimage is then saved in the tiff format. Among them, subimages that contain mostly (>50%) background white space are filtered out. From the remaining subimages, twenty representative ones are randomly selected for each svs image to avoid potential subjective bias and reduce computational cost [[55]21]. In the second step, we use CellProfiler [[56]19] to extract features from each subimage. CellProfiler is an open source cell image analysis platform developed by the Broad Institute. Specifically, image colors are separated based on hematoxylin staining and eosin staining, and converted to grayscale to extract regional features, such as cell and tissue texture and granularity. These are “classic” image processing features, which have been examined in a large number of imaging studies. It is noted that they are not expected to be specific histopathological features, and cannot be recognized by pathologist. More specifically, texture describes a set of metrics calculated in CellProfiler to quantify the perceived texture of histopathological images, and includes information on the spatial arrangement of color or intensities in images. Granularity describes the size of how well the structure element fits in images. Then, cell nuclei are identified and segmented to collect the cell-level features, such as cell size, shape, distribution, and texture of nuclei. Other cell features such as regional occupation, area fraction, and neighboring architecture are also captured. The above process generates 832 features for each subimage. A further screening is conducted to exclude irrelevant features such as file size and execution information. Finally, for each subimage, a total of 772 features are available for analysis. In the third step, for each slide, the average feature values over twenty representative subimages are computed. When a subject has multiple slides, the average values over multiple slides are further computed for downstream analysis. The above preprocessing is applied to both LUAD and LIHC data. There are missing values in imaging feature data. Subjects with more than 25% missing imaging features are excluded. The remaining missing values are imputed using sample medians. Remark 1. We adopt CellProfiler to extract imaging features, which is a popular choice in recent literature. The feature names are automatically provided by CellProfiler. The extracted features represent objective attributes of histopathological images, including the area and perimeter of nucleus and cytoplasm, mean and standard deviation of these measures, and other general image attributes. For each patient, multiple slides and subimages may be available. To extract as much information as possible, we consider multiple slides and subimages simultaneously, and the average values are used to summarize information. A closer examination suggests that they are not explicitly associated with specific histopathological findings, such as cellularity, atypia, anaplasia, nuclear pleomorphism, and some others. However, these attributes have been shown to be associated with pathological stages [[57]7,[58]10] and prognosis [[59]21], and also been used in multiple existing cancer modeling studies [[60]13,[61]15]. For example, Yu et al. [[62]22] show that some of these features are associated with specific subtypes of lung cancer. In this study, our goal is to rigorously quantify independent information in omics profiles and histopathological images for cancer prognosis, as opposed to focusing on specific interpretation of imaging features. As such, these features are sensible for our analysis, although they may not have simple, explicit biological interpretations. 2.3. Available Data After data matching, we obtain records on 316 and 358 subjects for LUAD and LIHC, respectively, with 5000 gene expression measurements, 772 imaging features, and survival time. For LUAD, there are 103 deaths during follow-up, with survival times ranging from 0 to 214.77 months (median 6.03 months). For LIHC, there are 121 deaths during follow-up, with survival times ranging from 0 to 120.73 months (median 19.25 months). Summary information of the analyzed subjects is provided in [63]Table 1. Table 1. Sample characteristics. Characteristic LUAD LIHC Sample size 316 358 Age at diagnosis: median (range) 66 (39–88) 61 (16–90) Follow-up: median (range) 6.03 (0–214.77) 19.25 (0–120.73) Vital status: n (%) Alive 213 (67.4%) 233 (65.0%) Deceased 103 (32.6%) 125 (35.0%) Sex: n (%) Male 144 (45.6%) 242 (67.6%) Female 172 (54.4%) 116 (32.4%) Cancer stage: n (%) I 180 (57.0%) 166 (46.4%) II 69 (18.7%) 82 (22.9%) III 41 (13%) 83 (23.2%) IV 21 (6.6%) 5 (1.4%) NA 5 (1.6%) 22 (6.1%) [64]Open in a new tab 3. Methods For describing cancer survival, we use the accelerated failure time (AFT) model. This model has been extensively adopted in cancer studies with high-dimensional variables because of its lucid interpretations and low computational cost [[65]23]. For each gene expression, we consider the prognosis model with its effect along with all imaging features. A statistically rigorous test is then conducted on the gene expression’s effect, which can suggest whether this particular gene has independent information for prognosis conditional on the imaging features. Then a parallel analysis is conducted, reversing the roles of gene expressions and imaging features. With a special emphasis on omics and imaging features, clinical factors are not included in the prognosis models. Consider n independent subjects. For the ith subject, denote [MATH: xi=(< /mo>xi1, xx2 , , xip) :MATH] and [MATH: zi=(< /mo>zi1, zi2 , , ziq) :MATH] as the p- and q-dimensional vectors of gene expressions and imaging features, [MATH: Ti :MATH] and [MATH: Ci :MATH] as the logarithm of the event and censoring times. With right censoring, we observe [MATH: (yi =min(Ti,Ci),δi=I(Ti< /mi>Ci) ,xi,zi) :MATH] . First consider the analysis with one gene expression and all imaging features. The analysis with one imaging feature and all gene expressions can be conducted in a similar manner and will not be reiterated. For the jth gene expression, consider the following AFT model: [MATH: Ti=α+< /mo>xijβj+ziθj+εi :MATH] (1) where [MATH: α :MATH] is the intercept, [MATH: βj :MATH] and [MATH: θj= (θj1 ,,θjq)< /mrow> :MATH] are the unknown coefficients for the jth gene expression and all imaging features, and [MATH: εi :MATH] is the random error. A statistical test [MATH: H0: < mi>βj=0 vs. H1: < mi>βj0 :MATH] can reveal whether [MATH: xij :MATH] is independently associated with [MATH: Ti :MATH] given [MATH: zi :MATH] . Here, loosely speaking, a smaller p value can indicate a stronger association/more prognostic power. The analysis is challenging with the high dimensionality of imaging features, which makes the “standard” estimation and inference techniques not applicable. To tackle this problem, we consider a high-dimensional inference approach [[66]24] recently developed under a related but simpler context. Specifically, for estimation, the weighted least squares approach is adopted. Assume that data have been sorted according to [MATH: yi :MATH] ’s from the smallest to the largest. Then we have the Kaplan–Meier weights defined as [MATH: w1=δ1n, wi=yini+1j=1< /mn>i1< mrow>(nj< /mrow>nj+1)δj,i=2,,n, :MATH] where a further normalization is conducted to make [MATH: i=1< /mn>nwi=n :MATH] . Let [MATH: W=diag{ w1 ,w< /mi>2,w3,,wn} :MATH] , [MATH: x·j :MATH] , [MATH: z·k :MATH] , and [MATH: y :MATH] denote the vectors composed of [MATH: xij :MATH] ’s, [MATH: zik :MATH] ’s, and [MATH: yi :MATH] ’s, which are weighted normalized such that [MATH: 1Wx·j=0, 1Wz·k=0, and 1Wy=0 :MATH] , and [MATH: Z=(< mi mathvariant="bold-italic">z·1,,z·q) :MATH] . With the high data dimensionality, regularized estimation is needed. In addition, not all imaging features are expected to be associated with survival, posing a variable selection problem. Consider the weighted penalized estimation: [MATH: {(β˜j*,θ˜j)=argmin< msub>βj,θR1+q{12n< msup>W12(yx·jβjZθj)22+λ< /mi>0k=1< /mn>q|θjk|}b˜j =argminbjRq< /mi>{1< mn>2nW12(x·jZbj)22+λ< /mi>1k=1< /mn>q|bjk|}, :MATH] (2) where [MATH: ·2 :MATH] denotes the L[2] norm, [MATH: bj= (bj1,,bjq) :MATH] is the unknown coefficient vector, and [MATH: λ0 :MATH] , [MATH: λ1 :MATH] > 0 are data-dependent tuning parameters for Lasso penalties. With [MATH: θ˜j :MATH] and [MATH: b˜j :MATH] , the final estimate [MATH: β˜j :MATH] of [MATH: βj :MATH] is defined as [MATH: β˜j=(x˜.j Wx.j)1x˜.j W(yZθ˜j), :MATH] (3) where [MATH: x˜.j =x·jZb˜j :MATH] . It has been shown in [[67]24] that [MATH: n(β˜jβj) :MATH] is asymptotically normal with variance defined as [MATH: Σ˜(< mover accent="true">β˜j) =1n(x˜.j Wx.j)1Σ1˜(< mi mathvariant="bold-italic">x.jWx˜.j )1 , :MATH] where [MATH: Σ1˜ :MATH] is the sample variance based on [MATH: x˜ij(yixijβ˜j+ziθ˜j) :MATH] . With this asymptotic normal distribution, the test statistic for [MATH: H0: βj=0 :MATH] can be defined as [MATH: β˜j/Σ˜(< mover accent="true">β˜j) /n, :MATH] which follows Student’s T distribution. The unadjusted p value can then be obtained. When all gene expressions are considered together, to accommodate multiple comparisons, p values are adjusted using the voxel-level false discovery rate approach [[68]25]. An advantage of the above analysis is that it can be realized via simple coding. The most challenging step is the estimation in (2), which can be realized using the R function glmnet. The tuning parameters [MATH: λ0 :MATH] and [MATH: λ1 :MATH] are selected using the EBIC approach [[69]26]. To facilitate data analysis in and beyond this study, we have developed R code implementing the proposed approach. To illustrate its usage, we have also provided an example R file with the LUAD data. The code and data are publicly available at [70]http://www.github.com/shuanggema/TestLDHD as well as in [71]Supplementary Materials. Remark 2. The effectiveness of the AFT model for cancer prognosis modeling has been well tested. Penalization has been shown effective for screening out irrelevant variables and accommodating high dimensionality. As shown in [[72]24], the estimation (2) can effectively “single out” the effect of the one gene expression. It is noted that, as the gene expression effect is of particular interest, its coefficient is not subject to penalization in estimation. A “byproduct” of penalized estimation is that imaging features associated with prognosis, conditional on the one gene expression effect, are identified, which may assist understanding the associations between imaging features and prognosis as well as the associations between imaging features and gene expressions. The statistical distribution result for the test statistic has been rigorously proved in [[73]24]. The T distribution makes inference very easy. 4. Results 4.1. Identification of Gene Expressions with Independent Prognostic Power Conditional on Imaging Features We first apply the analysis approach described above to identify individual gene expressions that have independent prognostic power conditional on the high dimensional histopathological imaging features. With significance level 0.05 as cutoff, 85 genes are identified as significantly associated with prognosis for LUAD. Detailed estimation results are provided in the [74]Table S1. A quick literature search of PubMed suggests that many of the findings have sound biological basis. For top ten representative genes, we provide brief information and references on their associations with lung cancer prognosis in