Abstract mRNA expression profiles provide important insights on a diversity of biological processes involved in rectal carcinoma (RC). Our aim was to comprehensively map complex interactions between the mRNA expression patterns and the clinical traits of RC. We employed the integrated analysis of five microarray datasets and The Cancer Genome Atlas rectal adenocarcinoma database to identify 2118 consensual differentially expressed genes (DEGs) in RC and adjacent normal tissue samples, and then applied weighted gene co-expression network analysis to parse DEGs and eight clinical traits in 66 eligible RC samples. A total of 16 co-expressed gene modules were identified. The green-yellow and salmon modules were most appropriate to the pathological stage (R = 0.36) and the overall survival (HR =13.534, P = 0.014), respectively. A diagnostic model of the five pathological stage hub genes (SCG3, SYP, CDK5R2, AP3B2, and RUNDC3A) provided a powerful classification accuracy between localized RC and non-localized RC. We also found increased Secretogranin III (SCG3) expression with higher pathological stage and poorer prognosis in the test and validation set. The increased Homer scaffolding protein 2 (HOMER2) expression with the favorable survival prediction efficiency significantly correlated with the markedly reduced overall survival of RC patients and the higher pathological stage during the test and validation set. Our findings indicate that the SCG3 and HOMER2 mRNA levels should be further evaluated as predictors of pathological stage and survival in patients with RC. Keywords: rectal cancer, The Cancer Genome Atlas, weighted gene co-expression network analysis, prognosis, disease progression INTRODUCTION Accurate staging of rectal cancer (RC) is essential for carrying out precise therapies that increase the survival rate of patients [[30]1]. Previous studies have found no differences in specific clinicopathological risk factors between patients with a high or low risk of RC progression [[31]2]. Recent technological breakthroughs in genome-wide sequencing have shed new insights on deregulated mRNAs that have been identified and characterized in the past few years [[32]3]. Genetic biomarkers for RC, especially for modules having a strong correlation between genes with similar expression patterns, to predict the pathological stage and survival outcome, which could help to understand disease pathogenesis and provide personalized treatment, have been rarely reported. However, previous research based on examining genetic mutations and different expression patterns associating with colorectal carcinogenesis has largely ignored the relationship between the genes and clinical characteristics [[33]4, [34]5]. In addition, RC progression involves several critical stages, although most studies have only evaluated the differences between RC and adjacent normal tissue (ANT), regardless of the intermediate stages [[35]6]. Therefore, there is an urgency to include stage and prognostic predictive modules to the current staging system, which could be achieved by combining information on the clinical characteristics and validated gene-specific biomarkers. For these reasons, we aimed to establish comprehensive mRNA expression patterns of the module genes and clinical traits, especially those relating to multi-stage disease progression which directly affects the prognosis through weighted gene co-expression network analysis (WGCNA) [[36]7]. WGCNA results in the construction of free-scale gene co-expression networks that explore the relationships between the gene sets and clinical features [[37]8]. WGCNA groups prerogative genes into modules based on their co-expression similarities and analogous functions across a population of samples [[38]9–[39]11]. In the present work, we carried out the pooled analysis of RC mRNA raw microarrays and the network level analysis of The Cancer Genome Atlas (TCGA) rectal adenocarcinoma (READ) data to decipher the relationships between the module genes and clinical traits. To the best of our knowledge, we are the first group to use meta-analysis and WGCNA to identify modules displaying a nominal evidence association with RC clinical traits. By characterizing module content and topology, we identified clinical traits, modules, and network concepts that play important roles in the regulation of RC at the level of the differentially expressed genes (DEGs). Moreover, to define the prognostic value of the cancer-specific module, which was related to tumor progression, further analysis was performed to validate candidate markers by combining survival analysis with an independent validation cohort. RESULTS Identification of consensual DEGs in training and test sets in rectal cancer patients The flow chart outlining the methods used in this study is shown in [40]Supplementary Figure 1. In contrast to colon cancer, there was very limited mRNA expression data relating to rectal cancer. We investigated and manually curated four public datasets ([41]GSE12225, [42]GSE34472, [43]GSE35982, and [44]GSE75548) as the training set ([45]Supplementary Table 1) [[46]12–[47]15]. The training set calculated the six quantitative quality control (QC) measures by standardized mean ranks and principal component analysis (PCA) biplots within MetaQC (Table [48]1 and Figure [49]1A) [[50]16]. [51]GSE34472 was detected in low quality RC samples and was omitted from the training set [[52]17]. The training set finally consisted of 65 RC samples and 42 ANTs, and it had 8153 gene symbols in common. A total of 6603 gene symbol sets passed the filtering criteria. We identified 4091 mRNAs that were consistently DEGs using the moderated t test, Fisher’s method by summarizing -log(p-value) across studies and running 300 permutations for the meta-analysis to infer the P-values (Figure [53]1B) [[54]18]. As expected, hierarchical clustering of the three datasets using the remaining 4091 DEGs distinguished RC from ANT samples (Figure [55]2A). Table 1. Quality control results of RC in the different datasets. No. Study IQC EQC CQCg CQCp AQCg AQCp Rank 1 [56]GSE75548 3.92 2 1.8* 2.23 1.9* 1.13* 1.83 2 [57]GSE12225 4.74 2 0.68* 1.66* 1.08* 2.38 2.00 3 [58]GSE35982 5.62 2 0.02* 10.03 0* 0.07* 2.67 4 [59]GSE34472 1.3* 0.4* 0.24* 0.36* 0.56* 0.11* 3.50 [60]Open in a new tab RC, rectal cancer; GSE, GEO dataset; IQC, internal quality control indexes; EQC, external quality control indexes; CQCg and CQCp, consistency of differential expression quality control indexes for genes and pathways; AQCg and AQCp, accuracy quality control indexes for genes and pathways. * P-value not significant after Bonferroni correction. Figure 1. Meta-analysis of differentially expressed genes involved in rectal cancer by combining P-values. [61]Figure 1 [62]Open in a new tab (A) Principal component analysis (PCA) biplot of quality control measures in four RC studies. (B) The number of differentially expressed genes plotted as a function of false discovery rate (FDR) in the analysis of four different datasets and four different meta-analysis algorithms (maxP, Fisher roP and adaptively weighted statistic). Figure 2. Identification of consensus DEGs in the training and test sets of rectal cancer patients. [63]Figure 2 [64]Open in a new tab (A) Heat map and two-way hierarchical clustering based on 4091 DEGs that were differentially expressed between RC and ANT samples of the training set. ANT (green label) and RC (red label) samples fell into separate clusters. (B) The 2177 DEGs RC (red label) vs. ANT (green label) of the TCGA-READ test set. Each column represents a sample, whereas each row represents the mRNA expression level. The color scale represents the raw Z-score ranging from blue (low expression) to red (high expression). Dendrograms beside each heat map correspond to the hierarchical clustering of the 2177 DEGs by the expression level. (C) PCA plot showing complete, unsupervised separation of the 105 array samples into 95 RC (red) and 10 ANT (green) samples. (D) A Venn diagram showing the overlap of DEGs detected by the training and test sets. The raw level-3 RNAseq data from 20,501 mRNAs in 95 RC samples and 10 ANT samples were downloaded from TCGA [[65]1, [66]8, [67]10]. A total of 2177 DEGs were identified by linear models within the microarray analysis (LIMMA), among which 1092 were up-regulated and 1085 were down-regulated in RC versus ANT ([68]Supplementary Table 2). This mRNA signature allowed for the separation of RC samples from ANT samples in the 2-way hierarchical cluster (Figure [69]2B) and in the PCA plot (Figure [70]2C). A total of 2118 overlapping DEGs were chosen by both training and test sets (Figure [71]2D). Co-expression network construction and module preservation analysis We included 2118 DEGs from 70 RC patients with complete clinical traits and prognostic information of the TCGA-READ test set to construct co-expression networks via WGCNA. After four outlier samples were discarded, the connections between the genes in the gene network were in line with a scale-free network distribution when the soft threshold power β was set at 3 ([72]Supplementary Figure 2A-2D). The dynamic tree cut method identified modules with similar expression profiles (Figure [73]3A, 3B). After highly similar modules were merged (Figure [74]3B), a total of 16 co-expressed modules were identified, ranging in size from 11 to 623 genes, whereas the “grey” module was reserved for genes that were not co-expressed (Figure [75]3C). Figure 3. Network construction of the weighted co-expressed genes and their associations with clinical traits. [76]Figure 3 [77]Open in a new tab (A) Hierarchical clustering tree of the TCGA-READ samples based on the DEGs. Dendrogram tips are labeled with the TCGA-READ unique name and experiment identifier. In the hierarchical dendrogram, lower branches correspond to higher co-expression (height = Euclidean distance). Identical colors in the eight bands below the dendrogram depict the TCGA-READ clinical traits. (B) Heat map view of topological overlap of co-expressed genes in different modules. The heat map was generated from the topological overlap values between genes. The genes were grouped into modules labelled by a color code, which are given under the gene dendrogram on both sides. The topological overlap was high among genes of same module. (C) Module-trait relationships for age at diagnosis, gender, histological type, lymphatic invasion and pathologic stage. Numbers shown represent Pearson correlations between the modules and traits. P-values are in parentheses. Numbers on the color bar refer to the strength of the correlation in the table (red = 1, green = -1). (T: extent of the tumor; N: extent of spread to the lymph nodes; M: presence of metastasi). After comparing the TCGA-READ test set with the validation set ([78]GSE29621) [[79]19], the summary preservation statistics used in determining whether a reference network is found in another test network were visualized [[80]20]. The green-yellow and salmon modules were well preserved, with low median Rank statistics and Z-summary statistics larger than 10 ([81]Supplementary Figure 3) [[82]7, [83]21]. Identification of clinicopathological modules It is important to identify modules that have the most significant associations with clinical features. We sought to explore whether any of the groups of genes from each of the identified modules were correlated with the clinical variables of RC. The list of genes in each module is presented in [84]Supplementary Table 3. The magenta (94 genes) module yielded significant Pearson’s correlation coefficient (PCC) with the pathological stage (R = 0.32, P = 0.003) and the pathology N stage (R = 0.32, P = 0.008). The salmon (41 genes) module yielded significant PCC with the pathological stage (R = 0.25, P = 0.04) and pathology T stage (R = 0.33, P = 0.007). The red (126 genes) module yielded significant PCC with the pathological stage (R = 0.37, P = 0.002), the pathology T stage (R = 0.29, P = 0.02) and pathology N stage (R = 0.31, P = 0.01). The tan (44 genes) module yielded significant PCC with the pathological stage (R = 0.31, P = 0.01), the pathology T stage (R = 0.26, P = 0.03), pathology N stage (R = 0.38, P = 0.002), pathology M stage (R = 0.32, P = 0.009) and lymphatic invasion (R = 0.27, P = 0.03). The green-yellow (54 genes) module yielded significant PCC with the pathological stage (R = 0.38, P = 0.002), the pathology T stage (R = 0.34, P = 0.006), pathology N stage (R = 0.36, P = 0.006) and lymphatic invasion (R = 0.26, P = 0.04) (Figure [85]3C), which was used as the pathological stage module in subsequent analyses. These results suggested that the highly co-expressed genes in the same module have potential biological significance [[86]6]. Each module might represent specific clinical features of RC patients [[87]3, [88]22]. Identification of overall survival modules We also explored the significant associations of these modules to overall survival given their biological importance. We applied the Cox regression model in 66 RC patients with complete survival data to calculate the HRs and corresponding P-values for each dichotomized module (Table [89]2). The salmon module, which was defined as the overall survival module within the sixteen merged modules, was shown to have significant associations with the prognosis of RC patients. The upregulated expression of the genes within the salmon module indicated poor outcomes of overall survival (HR = 13.534, P = 0.014). From the Kaplan-Meier curves, we also found that the decreased expression of genes within the salmon module indicated better outcomes of RC patients in the salmon module (P = 0.019) (Figure [90]4A). Table 2. Correlation between gene co-expression modules and overall survival. Overall survival HR CI (95% Cl) P-value ME black 0.258 0.054-1.220 0.087 ME blue 2.075 0.596-7.224 0.252 ME brown 0.448 0.115-1.747 0.248 ME cyan 1.245 0.375-4.146 0.721 ME green 0.736 0.206-2.628 0.637 ME green-yellow 2.307 0.605-8.791 0.221 ME grey 1.232 0.376-4.041 0.731 ME magenta 1.080 0.324-3.600 0.900 ME midnight-blue 1.868 0.545-6.396 0.320 ME pink 0.913 0.274-3.046 0.883 ME purple 0.234 0.049-1.106 0.067 ME red 2.802 0.742-10.579 0.129 ME salmon 13.534 1.712-106.958 0.014 ME tan 4.242 0.912-19.725 0.065 ME turquoise 2.487 0.658-9.400 0.180 ME yellow 1.440 0.433-4.790 0.552 [91]Open in a new tab Figure 4. Enrichment analysis and sub-network genes of the green-yellow and salmon modules. [92]Figure 4 [93]Open in a new tab (A) Survival analysis based on the gene expression pattern in the salmon module. Impact of the expressed genes in the salmon module on the overall survival. (B) GO enrichment and KEGG analysis for the 54 module genes included in the green-yellow module. The original significance was transformed to “ – log[10](P)” to plot the curve. (C) Visualization of the network connections among the most connected genes in the green-yellow module. Lines connecting two genes represent an association between the genes, and the color of the connecting line represents the weighted value of the two genes in the network. The colors of the nodes corresponded to the weighted value of the modular assignments. Elliptic and sexangular nodes indicate upregulated or downregulated sub-network genes in the modules. The nodes with a bold circle represent network hub genes identified by further analysis. (D) GO enrichment and KEGG analysis for the 41 module genes included in the salmon module. (E) Visualization of the network connections among the most connected genes in the salmon module. The results of the survival analysis of the salmon module were consistent with their biological significance found by PCC. The salmon module was previously shown to correlate with the pathological stage (PCC = 0.25, P = 0.04) and the pathological T stage (PCC = 0.33, P = 0.007). Thus, the high expression of genes within the salmon module might represent a higher pathological stage and the T stage of RC with poor prognosis. Enrichment analysis and sub-network genes of green-yellow and salmon modules We focused on the green-yellow module and carried out enrichment analysis because this module correlated strongly with the clinical features of RC patients, especially with the pathological stage. However, other modules showed weaker correlations with the phenotypic characteristics of RC. Interestingly, the green-yellow module was significantly enriched for cell adhesion for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Figure [94]4B). We used the molecular complex detection (MCODE) algorithm ([95]http://apps.cytoscape.org/apps/mcode) to analyze a subset of the co-expression network (threshold = 0.4). When the node density cut-off was set at 2, the node score cut-off at 0.2, the k-core at 2, and the maximum depth at 100, the rank 1 cluster was identified. We found that there were 15 stage sub-network genes in this cluster (CHGB, SCG3, SYP, SNAP25, SCN3B, C19orf30, LRRC55, FMO2, CDK5R2, AP3B2, FAM123C, CTXN2, RUNDC3A, IL1F5, and MATN3) (Figure [96]4C). The GO and KEGG pathway enrichment analysis of the genes within the salmon module showed that cancer (P = 0.046) and calcium-mediated signaling (P = 0.039) pathways were significantly affected when the condition of P < 0.05 was applied (Figure [97]4D). We utilized the MCODE algorithm to obtain seven sub-network genes which were defined as overall survival hub genes in the rank 1 cluster (GLT25D2, LRRTM1, ARMC4, CSMD3, SUSD4, HOMER2, and PTPRZ1) (Figure [98]4E). Further screening and identification of pathological stage candidate biomarkers by ANOVA, survival and ROC curve analysis ANOVA was carried out to determine the 15 pathological stage sub-network genes that were expressed separately and differentially. Meanwhile, the difference between localized RC (pathological stages I and II) and non-localized RC (pathological stages III and IV) by an independent t test for each of the 15 sub-network hub genes was determined. As a result, seven pathological stage hub genes (SCG3, SYP, SNAP25, CDK5R2, AP3B2, FAM123C, and RUNDC3A) that had a P-value less than 0.05 were designated to be significantly expressed ([99]Supplementary Figure 4). To further investigate whether the seven pathological stage hub genes correlated with the survival of RC patients, Kaplan-Meier analysis revealed that the five pathological stage hub genes (SCG3, SYP, CDK5R2, AP3B2, and RUNDC3A) with the highest levels significantly correlated with the markedly reduced overall survival of RC patients (Figure [100]5A), suggesting the important roles of five pathological stage hub genes in progress and prognosis of RC patients. Five pathological stage hub genes provided a high classification accuracy between localized RC and non-localized RC, which was estimated using receiver operating characteristic (ROC) curve analysis. AUC values for five pathological stage hub genes were greater than 0.64 in the TCGA-READ test set (Figure [101]5B). We could attain a best performance on accuracy by a univariate linear regression model built on a panel of the combined 5 pathological stage hub genes (AUC = 0.744): risk score = 0.369 × SCG3 + 0.296 × SYP + 0.128 × RUNDC3A -0.073 × CDK5R2 + 0.115 × AP3B2 -3.55 (Figure [102]5B). Figure 5. Further screening and validation of pathologic stage candidate biomarkers by survival and ROC curve analyses. [103]Figure 5 [104]Open in a new tab (A) Survival analysis of the five hub genes in the TCGA-READ test set. For the survival curves of the patients in different groups, solid red lines represent the high expression of hub genes and solid black lines represent the low expression of hub genes. (B) ROC analysis of the five hub genes in the TCGA- READ test set. Receiver operating characteristic (ROC) curve and area under the curve (AUC) statistics to evaluate the diagnostic efficiency of the hub genes in the TCGA-READ test set, which distinguished between localized and non-localized RC patients. (C) Boxplots of SCG3 mRNA expression across different pathologic stages in the validation set. (D) Survival analysis of SCG3 mRNA expression in the validation set. (E) ROC analysis of SCG3 mRNA expression in the independent validation set. Thus, five pathological stage hub genes were selected as the candidate genes for further validation. ANOVA, survival and ROC curve analysis were conducted to validate the five pathological stage hub genes in the validation set ([105]GSE29621). Only SCG3 with a P-value less than 0.05 were declared to be significantly expressed in four pathological stages of RC (ANOVA test, P = 0.043) (Figure [106]5C). In addition, we carried out a single-gene survival analysis in order to reveal which genes were most significantly associated with overall survival. Unfortunately, we only found an increased expression of the SCG3 (HR 2.381 [95 % CI 1.01 – 5.88], P = 0.043) with poor prognosis (Figure [107]5D). SCG3, which was chosen as the pathological stage candidate biomarker, provided a high classification accuracy between localized RC and non-localized RC, which was estimated using ROC curve analysis (Figure [108]5E). Further screen and validation of novel overall survival candidate biomarkers To further narrow the overall survival candidate biomarkers that harbor great significance among the declared seven overall survival hub genes, we chose to use the survival and ROC curve analysis to summarize the expression patterns of the hub genes. Survival analysis was performed for each overall survival hub gene. The patients were dichotomized into two equal groups by the expression level of the overall survival hub genes. Survival analysis for the seven overall survival hub genes in the salmon module was conducted for 93 RC patients, and we found that only HOMER2 significantly associated with patient overall survival in the TCGA-READ set ([109]Supplementary Figure 5). Kaplan-Meier analysis revealed that the high HOMER2 level significantly correlated with the markedly reduced overall survival of RC patients (HR 7 [95% CI 1.03 − 52.82], P = 0.028) (Figure [110]6A), suggesting the important roles of HOMER2 in the prognosis of RC patients. The sensitivity and specificity of the HOMER2 expression level on the survival outcome was assessed by ROC curve analysis, and the area under the ROC curve was used to evaluate survival prediction efficiency of HOMER2. The AUC value of HOMER2 was 0.673, which was close to 0.7 (Figure [111]6B). Thus, HOMER2 was chosen as the overall survival candidate biomarker for further validation to determine whether there was a significant correlation between HOMER2 and the prognosis of RC patients. ANOVA, survival and ROC curve analysis were carried out to confirm HOMER2 in the validation set ([112]GSE29621). Only HOMER2 had a P-value less than 0.05 and was determined to be significantly differentially expressed in the four pathological stages of RC (ANOVA test, P = 0.008) (Figure [113]6C). In addition, we carried out a single-gene survival analysis to reveal which genes were most significantly associated with overall survival. Unfortunately, we only found the increased expression of HOMER2 (HR 2.632 [95 % CI 1.163 – 6.840], P = 0.015) to associate with poor prognosis (Figure [114]6D). Figure 6. Further screening and validation of novel overall survival candidate biomarkers by survival and ROC curve analyses. [115]Figure 6 [116]Open in a new tab (A) Survival analysis of HOMER2 mRNA expression in the TCGA-READ test set. For survival curves of patients in different groups, solid red lines represent high expression and solid black lines represent low expression. (B) ROC analysis of HOMER2 mRNA expression in the TCGA-READ test set. Receiver operating characteristic (ROC) curve and area under the curve (AUC) statistics to evaluate the diagnostic efficiency of HOMER2 mRNA expression in the TCGA-READ test set, which distinguished between high risk and low risk RC patients. (C) Boxplots of HOMER2 mRNA expression across different pathologic stages in the validation set. (D) Survival analysis of HOMER2 mRNA expression in the validation set. (E) ROC analysis of HOMER2 mRNA expression in the independent validation set to distinguish between high risk and low risk RC patients. The sensitivity and specificity of the expression level of HOMER2 on the survival outcome were assessed by ROC curve analysis, and AUC was used to evaluate survival prediction efficiency of HOMER2. The AUC value of HOMER2 was 0.685 (Figure [117]6E). DISCUSSION Gene signatures identified from genome-based assays are known to contribute to RC stratification [[118]23]. Numerous studies have defined in part the gene signatures predicting the survival outcome or recurrence of RC [[119]24–[120]34]. In this study, six mRNA datasets were subjected to meta-analysis and WGCNA to identify hub genes associating with clinical characteristics as well as RC progression and prognosis [[121]4, [122]5, [123]35]. We identified sixteen distinct modules from 2118 consensus genes that passed the meta-analysis filtering criteria and TCGA-READ validation for WGCNA [[124]36]. Among the identified modules, the magenta, red, tan, and green-yellow modules associated with the pathological stage, especially the green-yellow module which had the strongest correlation with the pathological stage [[125]12, [126]37, [127]38]. In addition, the brown module correlated with the age of diagnosis. Correlations were also found between the red, tan, and green-yellow modules and pathological stages T and N, suggesting that the highly co-expressed genes within the same module were of similar biological significance. Given that the biological significance of these modules might be related to the potential clinical manifestations, we performed Cox regression analysis for each module to determine their survival outcomes [[128]1, [129]11, [130]39, [131]40]. Poor outcomes were found for the high expression group of the salmon module, which was consistent with previous results indicating that the high expression of the salmon module might represent a higher pathological stage and the T stage [[132]2]. The green-yellow and salmon modules were found to correlate with the pathological stage and overall survival, respectively. Thus, we selected the green-yellow module for the subsequent analysis because this module is likely to represent tumor staging characteristics more accurately [[133]41]. Cell adhesion pathways suggested by GO and KEGG were over-represented in the green-yellow module. To some extent, these results also partially explained the increasing stages of RC patients. By means of one-way ANOVA and an independent sample t test, seven pathological stage hub genes (SCG3, SYP, SNAP25, CDK5R2, AP3B2, FAM123C, and RUNDC3A) could effectively distinguish non-localized RC from localized RC. Furthermore, the expression levels of five pathological stage hub genes (SCG3, SYP, CDK5R2, AP3B2, RUNDC3A) significantly associated with both the pathological stage and overall survival, but only SCG3 mRNA expression was replicated in the validation set. The diagnosis model of the combined five pathological stage hub genes, which provided a high classification accuracy, might be good biomarkers for distinguishing between localized and non-localized RC (Figure [134]5B). SCG3, a member of the multifunctional granin family, played a key role in secretory granule biogenesis, which involves the cellular uptake of endogenous and exogenous toxins [[135]42]. SCG3 was identified as the most sensitive and specific marker for circulating tumor cells in small cell lung cancer and was indicative of a worse survival outcome. As is well known, response to standard chemotherapy is important in determining survival. SCG3 was also evident in patients with poor response to standard chemotherapy [[136]24]. There is potential mechanistic association between SCG3 expression in tumors and response to platinum-based therapy or topoisomerase II inhibitors [[137]24]. As RE-1 silencing transcription factor is a transcriptional repressor in cancer, higher expression of SCG3 mRNA may increase the aggressive potential of the tumor or reduce the drug sensitivity of RE-1 silencing transcription factor depleted tumors. For the first time, we show SCG3 as a biomarker of RC pathological stage and prognosis. Further analysis identified the overall survival candidate biomarkers from the salmon module, thus demonstrating the significant association with survival in the test and validation sets. Among the seven hub genes related to overall survival, increased HOMER2 expression associated with the increased pathological stage and poor survival of RC patients. Moreover, such expression may be useful in evaluating the survival risk because the AUC, which was used to evaluate the survival prediction efficiency of HOMER2, was 0.673 and 0.685 in the test and validation set, respectively. HOMER2 gene encodes a member of the homer family of dendritic proteins and regulates group 1 metabotrophic glutamate receptor function. HOMER2 is also a promising biomarker for cancer prognosis. HOMER2 is known to associate with overall survival and disease-free survival in early stage non-small cell lung cancer [[138]43]. HOMER2 which was identified as a binding partner of MYO18B, interacted with the C-terminal region of MYO18B, a candidate tumor suppressor gene involved in the pathogenesis of human cancers including colorectal cancer [[139]44]. Additionally, methylation of HOMER2 was reported to be a valuable biomarker which significantly discriminated CRC patients from controls [[140]45]. In addition, we tentatively estimated between-studies heterogeneity in effect sizes of twelve hub genes ([141]Supplementary Figure 6), especially SCG3 and HOMER2 (Figure [142]7A, 7B). Only one gene (SUSD4) had significant between-studies heterogeneity (I^2 = 67.32%) and was ejected because we did not have sufficient information available to explore heterogeneity ([143]Supplementary Table 1). Figure 7. SCG3 and HOMER2 are overexpressed in rectal cancer. [144]Figure 7 [145]Open in a new tab (A) Forest plot of SCG3 expression across all training and test pooled analysis datasets. (B) Forest plot of HOMER2 expression across all training and test pooled analysis datasets. The x-axis is the standardized mean difference between rectal cancer (RC) and adjacent normal tissue (ANT) on a log2 scale. Thus, a value of 1 signifies a 2-fold difference in gene expression between cancer and normal. To our knowledge, this is the first study investigating the relationship between the co-expression network of clinical traits and genes in patients with RC [[146]46]. In summary, we identified sixteen gene co-expression modules from five RC datasets using meta-analysis and WGCNA [[147]47]. We associated a number of these network modules to RC clinicopathological variables, as well as to overall survival, and uncovered the gene expression signature associated with RC pathology stage and overall survival. We identified several significant pathways, as well as five potential pathological stage hub genes and seven overall survival hub genes using MCODE. Utilizing the TCGA-READ dataset, we constructed a diagnosis model of five mRNA signatures, and the validation cohort confirmed that the panel may be a useful biomarker for prognosis in RC. Of greatest interest, we found that the increased SCG3 and HOMER2 expression associated with the increased pathological stage and poor survival in the test and validation set among RC patients, providing two useful markers of RC and suggesting that it may help to identify those with more aggressive disease. Nevertheless, multicenter randomized controlled studies and in vivo and in vitro experiments are still required to evaluate the possible application of the molecular signatures for survival prediction and to characterize the hub genes functionally for the application of this approach in specific clinical settings. MATERIALS AND METHODS Data collection, preprocessing, and normalization A public microarray repository was curated to search through PubMed, Gene Expression Omnibus (GEO) (accession numbers [148]GSE75548, [149]GSE34472, [150]GSE35982, [151]GSE12225, and [152]GSE29621), ArrayExpress (accession number E-GEOD-34472), and TCGA-READ datasets up to September 2016. Only initial experimental studies that screened different mRNAs from RC and adjacent normal tissue samples in humans were included. After writing off the duplicated datasets, the combined datasets ([153]GSE75548, [154]GSE34472, [155]GSE35982, [156]GSE12225, and TCGA-READ) containing 168 RC and 60 normal rectal tissue samples were generated ([157]Supplementary Table 1). The raw datasets were preprocessed individually using the LIMMA software package with log[2] transformation and annotated by converting different probe IDs to the respective gene symbols. Duplicate gene expression values were averaged. Integrated analysis of the gene expression datasets of the training set To make the five microarray datasets derived from the five different platforms compatible for data analysis, we used the MetaOmics software package ([158]http://www.pitt.edu/∼tsengweb/MetaOmicsHome.htm) to integrate and analyze the GEO datasets [[159]48]. Firstly, the MetaQC software package, which provided a quantitative and objective quality control tool for determining the inclusion/exclusion criteria for the meta-analysis, eliminated the [160]GSE34472 dataset [[161]48, [162]49]. Thus, the training set included three GEO datasets ([163]GSE75548, [164]GSE35982, and [165]GSE12225) after Quality Control. Secondly, to distinguish the DEGs between RC and ANT, the MetaDE software package limited the mean and standard deviation (SD) filter thresholds, which were set at 10% to filter minor changes in gene expression levels [[166]50]. Considering the different stringencies of the various methods, Fisher’s method was favored for the meta-analysis (Figure [167]1B). For Fisher’s method, the modified t test and the permutation method (nPermutations = 300) were used to extrapolate the P-values [[168]51]. P-values less than 0.05 were considered statistically significant for the DEGs. The heatmaps illustrating the DEG patterns were also generated [[169]50]. Integrated-signature gene analysis of TCGA-READ as the test set and consensus screening of the DEGs To further limit analyses to genes common to all datasets, the results of the DEG training set were validated in the TCGA-READ dataset, which was considered as the test set. The TCGA-READ mRNA and clinical data (level 3) of the corresponding patients (RC and ANT) were downloaded from the TCGA data portal (up to May 20, 2016). The TCGA-READ DEGs were analyzed using an empirical Bayes approach within the LIMMA software package. The DEGs of the test set with a |log[2] fold change (FC)| ≥ 0.5 and an adjusted P-value less than 0.05 were selected for subsequent analysis. A total of 2177 gene symbols of the test set passed the filtering criteria. We then created an overlapping gene set by selecting common official gene symbols in both training and test sets, resulting in a total of 2118 intersecting genes. The consensus DEGs, which ensured that RC and ANT samples were well characterized, were chosen for WGCNA [[170]10, [171]37, [172]52]. Preprocessing of clinical information Clinical information obtained from the TCGA-READ dataset (American Joint Committee on Cancer pathological TNM stage, gender, age at initial pathological diagnosis and histological type (mucinous adenocarcinoma or adenocarcinoma), especially vital status and time to last follow-up) for 70 patients was used after eliminating incomplete clinical traits and gene expression. Weighted gene co-expression network construction and module preservation analysis The TCGA-READ dataset with intact clinical features and prognostic information was selected for constructing the scale-free gene co-expression networks within the WGCNA software package [[173]8]. Firstly, the appropriate soft threshold power was automatically estimated and generated as described for the standard scale-free networks. In this case, the power of β, which was set at 3 (scale-free R^2 = 0.89), was auto-selected. Moreover, the weighted adjacency matrix was constructed using the power function ADJ[mn] = |COR[mn]|^β (COR[mn] = Pearson’s correlation between gene m and gene n; ADJ[mn] = adjacency between gene m and gene n). β was the soft thresholding parameter, which was used to transform adjacencies and correlations into a Topological Overlap Matrix (TOM), and then the corresponding dissimilarity (1-TOM) was calculated. Finally, module identification was carried out with the dynamic tree cut method by hierarchically clustering the genes using 1-TOM as the distance measure with a deep split value of 2 and a minimum size cut-off of 30 for the resulting dendrogram. Highly similar modules were marked by clustering and merged with a height cut-off of 0.25. Module preservation and quality statistics were computed using the module Preservation function (nPermutations = 200) within the WGCNA software package between the TCGA-READ test set and the [174]GSE29621 validation set ([175]Supplementary Table 1) [[176]20]. Identification of clinical feature modules, survival analysis and efficacy evaluation of pathological stage hub genes Module eigengenes (MEs), which are the first principal components in the PCA for each gene module, summarized the expression patterns of all genes into a single characteristic expression profile within a given module. The dynamic decision-making tree, node splitting method and cluster analysis of the squared Euclidean distance were used to identify MEs related to these clinical features, especially those involved in the progression and prognosis of RC. Spearman’s correlation analysis was carried out to confirm the object module, which was the most relevant module between the MEs and clinical traits. Depending on these, the module that had the highest Spearman’s correlation coefficient for the pathological stage and MEs in the object module was defined as the pathological stage module. Hub genes that had been selected in the pathological stage module were obtained using the MCODE algorithm plugin within the Cytoscape software package (version 3.4.0) [[177]53]. Survival analysis of individual modules and efficacy evaluation for survival hub genes For the single module survival analysis, the TCGA-READ test data was dichotomized around the median expression of each gene module. The overall survival was used as the survival endpoint, which was determined via the “survival” R software package. Cox regression analysis was performed to evaluate the hazard ratio (HR). Survival curves of the significant objected module, which was defined as the survival module, were constructed by the Kaplan-Meier method and compared by the log-rank test. Survival hub genes were obtained by MCODE. Survival curves and ROC analysis were designed to identify candidate markers. The results of the survival curves and ROC analysis were next verified on an independent validation set ([178]GSE29621) using the survival hub genes as the candidate marker input to predict the classes of prognosis in RC. Functional annotation and network visualization within pathological stage and survival modules The pathological stage and survival modules were functionally annotated based on the analytical results of their gene compositions. GO functional-related to specific Biological Processes and KEGG pathway enrichment analyses were performed for the object module by using the Database for Annotation, Visualization, and Integrated Discovery (DAVID) [[179]54]. P-values less than 0.05 were used as the cut-off value. Network visualization of the stage and survival modules were carried out within the Cytoscape software package. Moreover, there were strong correlations between the co-expressed module genes and their functions, and MCODE was utilized to provide deeper insights for hub genes. Statistical analysis Significance differences between stage groups were determined by analysis of variance (ANOVA) or the Student’s t test. P-values less than 0.05 were considered statistically significant. The preliminary relationships between the hub genes of stage modules were demonstrated by boxplot graphs. Survival curves were constructed by the Kaplan-Meier method and compared by the log-rank test for the expression level with the stage and survival hub genes for calculating the overall survival. All Cox regression models were tested based on Schoenfeld residuals to evaluate the hazard ratio (HR) for each module. Statistical analysis was performed using the R software package (version 3.2.3). ROC analysis was used to evaluate the diagnostic value on the outcome for the expression level of each stage and survival hub genes in RC via the code of Mihaly ([180]Supplementary File 2) [[181]55]. Using univariate linear logistic regression of the hub genes, a classification model of the combined hub genes was built to evaluate the discriminatory capacity of localized RC and non-localized RC. Finally, the results of the survival curves and ROC analysis were validated on an independent validation set ([182]GSE29621) using the pathological stage and survival hub genes as candidate biomarker input to predict the classes of pathological stage as well as survival of neoplastic progression and prognosis in RC patients. Because of the heterogeneity of hub genes within and among samples and datasets, the between-study heterogeneity was examined by the Cochran’s Q-test and I^2 statistic, with P values for heterogeneity by the I^2 value > 50% indicating substantial heterogeneity [[183]56]. We used 300 Monte Carlo permutation tests after combining P values ([184]Supplementary Figure 1) [[185]57], and a DerSimonian and Laird random effects model for meta-analysis by combing effect size to select hub genes in RC. SUPPLEMENTARY MATERIALS FIGURES AND TABLES [186]oncotarget-08-69594-s001.pdf^ (4.4MB, pdf) [187]oncotarget-08-69594-s002.docx^ (325.3KB, docx) [188]oncotarget-08-69594-s003.docx^ (20KB, docx) Acknowledgments