Abstract

   The current criteria for defining the recurrence risks of stage II
   colorectal cancer (CRC) are not robust; therefore, we aimed to explore
   novel gene signatures to predict recurrence risks and to reveal the
   underlying mechanisms of stage II CRC. First, the gene expression
   profiles of 124 patients with stage II CRC from The Cancer Genome Atlas
   (TCGA) database were obtained to screen differentially expressed genes
   (DEGs). A total of 202 DEGs, including 128 upregulated and 74
   downregulated, were identified in the recurrence group (n = 24)
   compared to the nonrecurrence group (n = 100). Furthermore, the top 5
   DEGs (ZNF561, WFS1, SLC2A1, MFI2, and PTGR1) were identified by random
   forest variable hunting, and four (ZNF561, WFS1, SLC2A1, and PTGR1)
   were selected to create a four‐gene recurrent model (GRM), with an area
   under the curve (AUC) of 0.882 according to the receiver operating
   characteristic curve, and the robust diagnostic effectiveness of the
   GRM was further validated with another gene expression profiling
   dataset ([42]GSE12032), with an AUC of 0.943. The diagnostic
   effectiveness of the GRM regarding recurrence was associated with poor
   disease‐free survival in all stages of CRC. In addition, gene ontology
   functional annotation and Kyoto Encyclopedia of Genes and Genomes
   pathway enrichment analyses revealed 18 enriched functions and 6
   enriched pathways. Four genes, ABCG2, CACNA1F, CYP19A1, and TF, were
   identified as hub genes by the protein‐protein interaction network,
   which further validated that these genes were correlated with a poor
   pathologic stage and overall survival in all stages of CRC. In
   conclusion, the GRM can effectively classify stage II CRC into groups
   of high and low risks of recurrence, thereby making up for the
   prognostic value of the traditional clinicopathological risk factors
   defined by the National Comprehensive Cancer Network guidelines. The
   hub genes may be useful therapeutic targets for recurrence. Thus, the
   GRM and hub genes could offer clinical value in directing
   individualized and precision therapeutic regimens for stage II CRC
   patients.

   Keywords: bioinformatics analysis, colorectal cancer, recurrence
   mechanisms, recurrence risks
     __________________________________________________________________

   This study finds indicators for predicting postoperative recurrence in
   patients with stage II colorectal cancer (CRC). Moreover, the gene
   recurrent model and hub genes potentially offer clinical values in
   directing individualized and precision therapeutic regimen be selection
   for stage II CRC.

   graphic file with name CAM4-9-179-g007.jpg

1. INTRODUCTION

   Colorectal cancer (CRC) is the third most frequent malignant tumor and
   the fourth leading cause of cancer‐related death worldwide.[43]1
   Survival and treatment in CRC are primarily dependent on the tumor
   stage at diagnosis. Radical resection of the tumor lesion is the
   foundation treatment for stage II CRC. However, postsurgery, 25%‐30% of
   stage II CRC patients develop recurrence within 5 years,[44]2
   contributing to mortality. To improve the survival of these patients,
   the American Society of Clinical Oncology, the National Comprehensive
   Cancer Network (NCCN) and the European Society for Medical Oncology
   recommend adjuvant chemotherapy for high‐risk patients with stage II
   CRC, and the risks are primarily defined by clinicopathological
   features, such as tumor size, the number of lymph nodes investigated,
   poorly differentiated histology, tumor perforation (T4), bowel
   obstruction and perforation, positive resection margins, and lymphatic
   and venous invasion.[45]3

   Unfortunately, using these abovementioned criteria, we found that a
   portion of low‐risk patients also experienced recurrence after the
   operation.[46]4 On the other hand, is there a portion of patients who
   were overtreated according to the abovementioned criteria? This concern
   highlights the lack of available biomarkers that can help detect the
   genuine high‐risk factors of recurrence for stage II CRC, which can
   improve the treatment accuracy of these patients.

   In recent years, rapid technological breakthroughs of genome‐wide
   sequencing have provided researchers with large expression datasets.
   With the popularization of big data analysis and the progress of
   bioinformatics technology, a series of biomarkers was identified for
   predicting the recurrence, metastasis, chemosensitivity, and prognosis
   of CRC. Moreover, the complex recurrence and metastatic processes of
   polygenic network collaboration were also analyzed by bioinformatics
   tools.[47]5, [48]6, [49]7 Practice suggested that the prediction value
   of a single biomarker was usually unfavorable, and multiple biomarkers
   jointly confirmed to be efficient and reliable.[50]8, [51]9

   In this study, we performed bioinformatical analyses based on
   high‐throughput RNA sequencing of CRC from the The Cancer Genome Atlas
   (TCGA) ([52]http://cancergenom-e.nih.gov/) to gain a panoramic view of
   expression patterns between recurrence and nonrecurrence patients with
   stage II CRC. Furthermore, differentially expressed genes (DEGs) were
   used to establish a model to predict the recurrence risk (gene
   recurrent model [GRM]) by random forest sequencing, and excellent
   diagnostic effectiveness was shown by receiver operating characteristic
   (ROC) curve analysis. Then, another gene expression profiling dataset
   was extracted from the GEO ([53]GSE12032) to further validate the
   robust diagnostic effectiveness of the GRM. In addition, gene ontology
   (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), protein‐protein
   interaction (PPI) network analyses, and hub gene selection were adopted
   to jointly analyze the underlying mechanism of recurrence. Finally, the
   results we obtained might be meaningful in guiding clinical practice
   and understanding the recurrence mechanisms of stage II CRC.

2. MATERIALS AND METHODS

2.1. Data collection

   The RNA‐Seq dataset of CRC, which includes the whole human
   transcriptome sequencing dataset and corresponding survival profiles,
   was downloaded from the TCGA database. All the data in the dataset used
   were pathological stage II (T[3‐4]N[0]M[0]) without postoperative
   adjuvant therapy and were followed up for at least 2 years. According
   to recurrence, the samples were divided into a recurrence group and a
   nonrecurrence group.

2.2. Data preprocessing and the identification of DEGs

   edgeR is an R package used for the analysis of DEGs[54]10 and was used
   in our study to screen the DEGs between the recurrence and
   nonrecurrence groups. The DEGs were identified with the following
   criteria: fold change |(FC)| ≥2 and P value <.01.

2.3. Random forest sequencing

   Random forest is a popular classification and regression method that
   has proven powerful for various prediction problems in biological
   studies. The mean decrease in gini (MDG), which is involved in the
   random forest algorithm, is used to rank the important indexes with
   DEGs. The MDG provides ways to quantify which index contributes most to
   classification accuracy. A higher MDG indicates that the degree of
   impurity arising from the category could be reduced farthest by one
   variable and thus suggests an important associated index. We divided
   our data into training (66%, n = 82) and testing (34%, n = 42) datasets
   by using the randomForest package of R software
   ([55]http://www.r-project.org).[56]11 We used the training dataset to
   develop the random forest model and then tested the model's performance
   with the testing dataset. The specific random forest model parameters
   were as follows: max features: auto, n estimators: 5000, min sample
   leaf: 1, and number of variables tried at each split: 2.

2.4. ROC analysis of the select recurrence‐related genes

   Receiver operating characteristic analyses are commonly used to
   evaluate the performance of disease diagnosis. In our study, the four
   genes selected from the top 5 genes ranked by the MDG were used as
   biomarkers for detecting recurrence and for constructing a recurrence
   risk model. The area under the curve (AUC) was used to demonstrate the
   accuracy of an individual gene and joint genes for predicting
   recurrence.

2.5. External exploration of the diagnostic effectiveness of the GRM

   To further validate the diagnostic effectiveness of the GRM, another
   gene expression profiling dataset was extracted from the GEO
   ([57]GSE12032) for analysis. GEO2R, a web tool that was applied to
   screen the DEGs by comparing two groups of samples in a GEO series, was
   applied to identify the DEGs between the recurrence and nonrecurrence
   groups of stage II CRC patients following the criteria fold change
   |(FC)| ≥1.5 and P value <.05. The patients with missing values were
   excluded. Furthermore, the identified DEGs were ranked by the MDG in
   the random forest algorithm, and the top 10 genes were selected. In
   addition, the diagnostic effectiveness of the GRM was validated by ROC
   analyses, and the AUC was used to demonstrate the accuracy of the GRM
   for predicting recurrence.

2.6. Further exploration of the clinical values of the GRM

   As the four genes in the GRM achieved robust predictive values of
   recurrence in stage II CRC and the risk of recurrence always affected
   disease‐free survival (DFS), we further validated the four genes by
   investigating their expression and relevance to DFS in all stages of
   CRC between the recurrence and nonrecurrence groups. Kaplan‐Meier
   survival curves were constructed using the expressions of the four
   genes from the TCGA transcriptional profiles as a threshold and
   compared by log‐rank analysis. All analyses were performed with
   GraphPad Prism 5.0 and SPSS version 19.0 (IBM), and P < .05 was
   considered statistically significant.

2.7. Functional annotation and pathway enrichment analyses

   The Database for Annotation, Visualization, and Integrated Discovery
   (DAVID) v.6.8 ([58]https://david.ncifcrf.gov)[59]12 was used to perform
   GO[60]13 functional annotation and KEGG[61]14 pathway enrichment
   analyses. The human genome was selected as the background list
   parameter, and a P value <.05 was set as statistically significant.

2.8. PPI network analysis and hub gene selection

   The Search Tool for the Retrieval of Interacting Genes (STRING,
   [62]http://www.string-db.org/) database was used to construct a PPI
   network. An interaction score >0.4 was considered statistically
   significant. Furthermore, the result of the PPI network was imported
   into the Cytoscape plugin to create network visualizations and
   subjected to centrality values analysis with CentiScaPe 2.2.

2.9. Further exploration of the clinical values of the four hub genes

   As the four hub genes that promote recurrence always highly correlated
   with pathologic stage and affected overall survival (OS), we further
   validated the four hub genes by investigating their expression in
   different pathologic stages and relevance to OS in all stages of CRC.
   Kaplan‐Meier survival curves were constructed using the expressions of
   the four hub genes from the TCGA transcriptional profiles as a
   threshold and compared by log‐rank analysis. All analyses were
   performed with GraphPad Prism 5.0 and SPSS version 19.0, and P < .05
   was considered statistically significant.

3. RESULTS

3.1. Identification of the DEGs

   Information on the 124 patients who met our research criteria was
   obtained from the TCGA database. After a 2‐year follow‐up, 24 patients
   experienced recurrence, and 100 patients did not experience recurrence.
   Moreover, 202 DEGs, including 128 upregulated and 74 downregulated,
   were identified in the recurrence group compared to the nonrecurrence
   group (Figure [63]1).

Figure 1.

   Figure 1
   [64]Open in a new tab

   Volcano plot of the DEGs between recurrence and nonrecurrence groups of
   stage II CRC. The x‐axis indicates the log2 fold change in gene
   expression, which was defined as the ratio of normalized value of gene
   expression detected in stage II CRC between recurrence and
   nonrecurrence groups. The y‐axis indicates the adjusted P values
   plotted in −log10. Red dots highlight genes upregulated in recurrence
   group (fold change >2, P value <.01). Green dots highlight genes
   downregulated in nonrecurrence group (fold change >2, P value <.01).
   CRC, colorectal cancer; DEGs, differentially expressed genes

3.2. Acquisition of the classified DEGs by the random forest classifier

   The randomForest package in R, which performs well in predicting
   whether variables are noise or not and evaluates the importance of the
   variable, was used. The 202 DEGs were ranked by the MDG with the random
   forest method, and the top 5 DEGs (ZNF561, WFS1, SLC2A1, MFI2, and
   PTGR1) were selected for construction of the ROC curve (Table [65]1).

Table 1.

   The TCGA top 5 DEGs for ROC construction were ranked by MDG with random
   forest method
   Gene   Description                             Mean Decrease Gini (MDG)
   ZNF561 Zinc finger protein 561                          17.608
   WFS1   Wolframin ER transmembrane glycoprotein          13.193
   SLC2A1 Solute carrier family 2 member 1                 11.726
   MFI2   Melanotransferrin                                9.878
   PTGR1  Prostaglandin reductase 1                        9.510
   [66]Open in a new tab

   Abbreviations: DEGs, differentially expressed genes; ROC, receiver
   operating characteristic; TCGA, The Cancer Genome Atlas.

3.3. Construction of a four‐GRM and further validation

   The ROC curve defined an optimal threshold to predict the recurrence
   risk of stage II CRC, and the AUC values of the ROC for ZNF561, WFS1,
   SLC2A1, MFI2, and PTGR1 were 0.7404, 0.7283, 0.6238, 0.6738, and
   0.7454, respectively (Figure [67]2A). To elevate the prediction
   efficiency, we explored the combination of two, three, four and five
   DEGs. The AUC values of the ROC curve with the combination of two DEGs
   for ZNF561+WFS1, ZNF561+SLC2A1, ZNF561+PTGR1, ZNF561+MFI2, WFS1+SLC2A1,
   WFS1+PTGR1, WFS1+MFI2, SLC2A1+PTGR1, SLC2A1+MFI2, and PTGR1+MFI2 were
   0.813, 0.680, 0.707, 0.643, 0.737, 0.731, 0.647, 0.661, 0.577, and
   0.628, respectively (Figure [68]2B). The AUC values of the ROC curve
   with the combination of three DEGs for ZNF561+WFS1+SLC2A1,
   ZNF561+WFS1+PTGR1, ZNF561+WFS1+MFI2, ZNF561+SLC2A1+PTGR1,
   ZNF561+SLC2A1+MFI2, ZNF561+PTGR1+MFI2, WFS1+SLC2A1+PTGR1,
   WFS1+SLC2A1+MFI2, WFS1+PTGR1+MFI2, and SLC2A1+PTGR1+MFI2 were 0.721,
   0.768, 0.778, 0.702, 0.663, 0.725, 0.813, 0.592, 0.788, and 0.625,
   respectively (Figure [69]2C). The AUC values of the ROC curve with the
   combination of four DEGs for ZNF561+WFS1+SLC2A1+MFI2,
   ZNF561+WFS1+SLC2A1+PTGR1, ZNF561+WFS1+PTGR1+MFI2, and
   WFS1+SLC2A1+PTGR1+MFI2 were 0.704, 0.882, 0.665, and 0.870,
   respectively (Figure [70]2D). The AUC value of the ROC curve with the
   combination of five DEGs for ZNF561+WFS1+SLC2A1+PTGR1+MFI2 was 0.652
   (Figure [71]2D). Among them, the four‐gene signature (ZNF561, WFS1,
   SLC2A1, and PTGR1), with an AUC of 0.882, exhibited the best
   performance for predicting recurrence and showed remarkable sensitivity
   and specificity when the cutoff value was 0.593 (Figure [72]2D).

Figure 2.

   Figure 2
   [73]Open in a new tab

   ROC curves of the top 5 DEGs sorted by AUC. Red line represents the
   sensitive curve, while black represents the identify line. The x‐axis
   indicates false positive rate, which is presented as “Specificity
   (1 − Sensitivity)”. The y‐axis indicates true positive rate, which is
   presented as “Sensitivity”. A, The individual diagnostic efficiency of
   ZNF561, WFS1, SLC2A1, MFI2, and PTGR1. B, The joint diagnostic
   efficiency of the combinations of the two DEGs in the top 5. C, The
   joint diagnostic efficiency of the combinations of the three DEGs in
   the top 5. D, The joint diagnostic efficiency of the combinations of
   the four or five DEGs in the top 5. E, Heatmap of the DEGs between
   recurrence and nonrecurrence groups of stage II CRC. Each column showed
   patient samples. N represented nonrecurrence. R represented recurrence.
   A hierarchical clustering analysis was performed, and patient
   information based on the expression of DEGs was mapped. Red represented
   upregulated genes. Green represented downregulated genes. F, The
   diagnostic efficiency of GRM in GEO date. AUC, area under the curve;
   CRC, colorectal cancer; DEGs, differentially expressed genes; GRM, gene
   recurrent model; ROC, receiver operating characteristic

   The transcriptome profiling data of 92 patients with stage II CRC in
   the [74]GSE12032 dataset, which includes 30 patients with recurrence
   and 62 patients without recurrence, were collected. After excluding the
   patients with missing values, we obtained 60 patients, including 16
   patients with recurrence and 44 patients without recurrence.
   Furthermore, 113 DEGs (63 upregulated DEGs and 50 downregulated DEGs)
   were identified in the recurrence group compared to the nonrecurrence
   group (Figure [75]2E). In addition, all the genes in the GRM were
   included in the top 10 DEGs that were ranked by the MDG with the random
   forest classifier (Table [76]2). These genes exhibited robust
   performance for predicting recurrence, with an AUC of 0.943, and showed
   remarkable sensitivity and specificity when the cutoff value was 0.619
   (Figure [77]2F).

Table 2.

   The top 10 DEGs of [78]GSE12032 were ranked by MDG with random forest
   method
   Gene     Description                               Mean Decrease Gini (MDG)
   ABCB5    Zinc finger protein 561                            1.405
   SLC2A1   Solute carrier family 2 member 1                   1.300
   CLDN6    Claudin 6                                          0.924
   ZNF561   Zinc finger protein 561                            0.808
   HIST1H4H Histone cluster 1 H4 family member h               0.773
   TMEM63A  Transmembrane protein 63A                          0.725
   PTGR1    Prostaglandin reductase 1                          0.615
   BCAR3    BCAR3 adaptor protein, NSP family member           0.551
   WFS1     Wolframin ER transmembrane glycoprotein            0.517
   C3orf42  Long intergenic nonprotein coding RNA 852          0.475
   [79]Open in a new tab

   Abbreviation: DEGs, differentially expressed genes.

3.4. The clinical values of the four genes in the GRM for all stages of CRC

   The expression levels of the four genes in the GRM between the
   recurrence and nonrecurrence groups were significantly different, and
   the P values for ZNF561, WFS1, SLC2A1, and PTGR1 were .004, <.0001,
   .0002, and .0002, respectively (Figure [80]3A). The survival analysis
   indicated that low ZNF561, PTGR1 expression and high WFS1, SLC2A1
   expression were associated with poor DFS, with P values of <.01,
   <.05,<.05, and <.05, respectively (Figure [81]3B).

Figure 3.

   Figure 3
   [82]Open in a new tab

   The clinical value of the top 4 genes in all stage CRC. A, Validation
   of the gene expression levels of ZNF561, WFS1, SLC2A1, and PTGR1
   between recurrence and nonrecurrence patients. B, Disease‐free survival
   analysis of the top 4 genes. CRC, colorectal cancer

3.5. Functional annotation and pathway enrichment analyses

   Based on the DAVID software, a total of 18 GO functions were enriched.
   Concerning the Molecular Function terms, the 202 DEGs were mostly
   enriched in oxygen binding, iron ion binding, heme binding,
   monooxygenase activity, oxidoreductase activity, steroid hydroxylase
   activity, G protein‐coupled receptor (GPCR) activity, oxygen
   transporter activity, structure of the cytoskeleton, and aromatase
   activity. Concerning the Biological Process terms, the DEGs were
   significantly enriched in the steroid metabolic process, oxygen
   transport, the GPCR signaling pathway, the sensory perception of smell,
   the drug metabolic process, and epidermis development. Concerning the
   Cellular Component terms, the DEGs were significantly enriched in
   hemoglobin complex and organelle membrane (Figure [83]4A; Table [84]3).

Figure 4.

   Figure 4
   [85]Open in a new tab

   DAVID analysis of DEGs: A, GO functional annotation of top 18
   enrichment terms. B, KEGG pathway enrichment analysis of top 6
   enrichment terms. The count of genes enriched in terms is indicated by
   the node size; the P value is shown by the color, the redder the color,
   the more significant it is. DAVID, Database for Annotation,
   Visualization, and Integrated Discovery; DEGs, differentially expressed
   genes; GO, gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes

Table 3.

   The 18 most significant enriched gene sets for recurrence features of
   stage II CRC from BP, MF, CC
   ID         Description                                  Count P‐Value
   Molecular Function
   GO:0019825 Oxygen binding                                 7   6.14E‐06
   GO:0005506 Iron ion binding                              10   1.96E‐05
   GO:0020037 Heme binding                                   9   6.04E‐05
   GO:0004497 Monooxygenase activity                         6   2.53E‐04
   GO:0016712 Oxidoreductase activity                        4   3.83E‐04
   GO:0008395 Steroid hydroxylase activity                   4   .00202095
   GO:0004930 G‐protein coupled receptor activity           16   .00415544
   GO:0005344 Oxygen transporter activity                    3   .00799984
   GO:0005200 Structural constituent of cytoskeleton         5   .02284408
   GO:0070330 Aromatase activity                             3   .02840488
   Biological Process
   GO:0008202 Steroid metabolic process                      5   7.68E‐04
   GO:0015671 Oxygen transport                               3   .00894403
   GO:0007186 G protein‐coupled receptor signaling pathway  17   .01335430
   GO:0007608 Sensory perception of smell                    6   .01900123
   GO:0017144 Drug metabolic process                         3   .02772735
   GO:0008544 Epidermis development                          4   .04905401
   Cellular Component
   GO:0005833 Hemoglobin complex                             3   .00574369
   GO:0031090 Organelle membrane                             5   .01005061
   [86]Open in a new tab

   Abbreviations: CRC, colorectal cancer; GO, gene ontology.

   In KEGG pathway enrichment analysis, the 202 DEGs were significantly
   enriched in six signaling pathways: steroid hormone biosynthesis,
   retinol metabolism, xenobiotic metabolism‐CYP450, chemical
   carcinogenesis, drug metabolism‐other enzyme, and drug
   metabolism‐CYP450 (Figure [87]4B; Table [88]4).

Table 4.

   The 6 most significant enriched pathways for recurrence features of
   stage II CRC from KEGG
   KEGG pathway id Description                                  Count P‐Value
   hsa00982        Drug metabolism—cytochrome P450              4     .04376667
   hsa00983        Drug metabolism—other enzymes                4     .01587008
   hsa05204        Chemical carcinogenesis                      5     .01355137
   hsa00980        Metabolism of xenobiotics by cytochrome P450 5     .0103847
   hsa00830        Retinol metabolism                           5     .00660703
   hsa00140        Steroid hormone biosynthesis                 8     4.00E‐06
   [89]Open in a new tab

   Abbreviations: CRC, colorectal cancer; KEGG, Kyoto Encyclopedia of
   Genes and Genomes.

3.6. PPI network construction and hub gene identification

   Based on the STRING database, a PPI network was constructed (Figure
   [90]5A). The network contained 63 nodes and 100 edges that were
   subjected to hub gene analysis with CentiScaPe 2.2. Four hub genes
   evaluated by the degree (≥3.14) and betweenness (≥223) were identified:
   ABCG2, CACNA1F, CYP19A1, and TF (Figure [91]5B; Table [92]5).

Figure 5.

   Figure 5
   [93]Open in a new tab

   PPI Network of DEGs in recurrence compared nonrecurrence. The nodes
   indicate the DEGs and the edges indicate the interactions between two
   genes. The yellow nodes indicating important were selected as hub
   genes. A, The hub genes identified and visualized by degree and
   betweenness. B, Centrality statistics for hub genes: ABCG2, CACNA1F,
   CYP19A1, TF. DEGs, differentially expressed genes; PPI, protein‐protein
   interaction

Table 5.

   The four hub genes in the protein‐protein interaction network
   Gene    Description                                    Degree Betweenness
   ABCG2   ATP ‐binding cassette subfamily G member 2      4.0      477.4
   CACNA1F Calcium voltage‐gated channel subunit alpha1 F  4.0      242.0
   CYP19A1 Cytochrome P450 family 19 subfamily A member 1  10.0     899.3
   TF      Transferrin                                     6.0      680.8
   [94]Open in a new tab

3.7. The clinical value of the four hub genes in all stages of CRC

   Using the TCGA transcriptional profiles in all stages of CRC, the
   expression levels of the four hub genes with different TNM stages were
   significantly different, and the high expression of ABCG2, CACNA1F,
   CYP19A1, and TF was associated with a poor TNM stage (Figure [95]6A).
   For the T stage (T[1+2] vs T[3+4]), the P values of ABCG2, CACNA1F,
   CYP19A1, and TF were <.05, <.05, .006, and <.05, respectively. For the
   N stage (N[0] vs N[1+2]), the P values of ABCG2, CACNA1F, CYP19A1, and
   TF were <.01, <.05, .003, and <.01, respectively. For the M stage (M[0]
   vs M[1]), the P values of ABCG2, CACNA1F, CYP19A1, and TF were <.01,
   <.01, <.01, and <.05, respectively. The high expression of ABCG2,
   CACNA1F, CYP19A1, and TF was also associated with poor OS (Figure
   [96]6B), with P values of <.05, <.05, <.01, and <.05, respectively.

Figure 6.

   Figure 6
   [97]Open in a new tab

   Clinical values of the four hub genes in all stage CRC. A, Validation
   of the gene expression levels of ABCG2, CACNA1F, CYP19A1, TF in
   different TNM stages patients. B, OS analysis of the four hub genes.
   CRC, colorectal cancer; OS, overall survival

4. DISCUSSION

   In our quest to develop a robust recurrence model of stage II CRC, we
   successfully developed the GRM, which achieved excellent predictive
   values of recurrence in stage II CRC. The underlying mechanisms of
   recurrence were also explored.

   Current high‐risk criteria for defining the recurrence of stage II CRC,
   which were addressed in the introduction, depend largely on
   clinicopathologic factors, leading to limitations in their prognostic
   abilities for this highly heterogeneous tumor, and the improvement in
   stage II CRC patient survival following adjuvant chemotherapy is less
   than 5% at 5 years.[98]15 This means that the administration of
   adjuvant chemotherapy to all patients in stage II is approximately 75%
   unnecessary and harmful. This narrow therapeutic index underscores the
   importance of identifying more appropriate biomarkers to detect the
   genuine high‐risk factors of recurrence for stage II CRC.

   Single biomarkers, such as CEA, CA199, miR‐21, miR‐181c, MTA3, S100A2,
   and ezrin,[99]5, [100]16, [101]17, [102]18, [103]20, [104]21 for the
   prognosis of recurrence in stage II CRC patients have been reported
   before. However, given that CRC tissues show complicated molecular and
   cellular heterogeneity, a single biomarker failed to reflect the
   genomic heterogeneity of the tumor; therefore, their prediction
   efficiency was unpowerful. Although multigene expression signatures
   have been reported, the number of genes that need to be tested is 13,
   31, or 120, which is uneconomical, and the signatures demonstrated poor
   feasibility and poor specificity.[105]22, [106]23, [107]24 One
   previously reported 8‐miRNA recurrence classifier is superior to
   currently used clinicopathological features, as well as NCCN criteria.
   Another prognostic mutation panel comprising five prognostic genes
   (APAF1, DIAPH2, NTNG1, USP7, and VAV2)[108]25 also showed superior
   prognostic accuracy to that of the American Joint Commission on Cancer
   classification (concordance index: 0.70 vs 0.54, respectively).
   However, the authors showed only the HR; therefore, its specific
   diagnostic performance is unclear. Miyake et al constructed a
   discriminator gene set that included 30 genes based on the expression
   data of 92 stage II CRC patients; however, despite its reported
   diagnostic effectiveness, its prediction accuracy was only
   77.4%.[109]26

   Under these circumstances, we explored the prediction efficiency of
   expression datasets from the TCGA database of stage II CRC patients who
   did not undergo postoperative adjuvant therapy and were followed up for
   at least 2 years. Using random forest variable hunting, we identified
   the top 5 DEGs. As the results revealed, different combinations of the
   top 5 DEGs showed different AUC values. The combination of the
   four‐gene signature (ZNF561, WFS1, SLC2A1, and PTGR1), with an AUC of
   0.882, exhibited the best performance for predicting recurrence and
   showed remarkable sensitivity and specificity when the cutoff value was
   0.593. Therefore, these four genes were selected to create the GRM and
   achieved excellent predictive values of recurrence, with an AUC of
   0.882 in stage II CRC. To further test and verify the diagnostic
   effectiveness of the GRM, we extracted another gene expression
   profiling dataset from the GEO ([110]GSE12032) for further validation.
   To our excitement, all the genes in the GRM were included in the top 10
   DEGs ranked by the MDG with the random forest classifier (Table [111]2)
   from the GEO database and exhibited better performance for predicting
   recurrence, with an AUC of 0.943, than the TCGA database. Furthermore,
   we and Miyake et al used the same gene expression profiling dataset.
   Our GRM contains only four genes but exhibited robust performance for
   predicting recurrence, with an AUC of 0.943, while the prediction
   accuracy of their discriminator gene set, which contains 30 genes, was
   only 77.4%. However, the other genes selected before by different teams
   for the prediction of recurrence in stage II CRC were obtained using
   different methods; therefore, their diagnostic performance cannot be
   directly compared with ours. However, the main methods we used to
   select the DEGs to establish the GRM were advanced, such as the random
   forest method, which takes advantage of two powerful machine‐learning
   techniques: bagging and random feature selection. As one of the most
   important advantages, accuracy mainly benefits from the complementation
   of the training set and the test set and is relatively robust to
   outliers and noise, which contribute to its superior performance over
   many other methods, and it is commonly used in genomic data analyses as
   an effective prediction tool.[112]27, [113]28

   The robust diagnostic effectiveness showed by the GRM in two
   independent databases highly emphasized its stability. Stability mainly
   depends on the high efficiency, robustness and reliability of the
   analytical methods we used, such as the random forest method and ROC
   curve analysis. Furthermore, the robust diagnostic effectiveness also
   encouraged us to further apply it in clinical practice, and the
   personalized prediction of recurrence by the GRM will help avoid
   undertreatment or overtreatment.

   Further exploration of the clinical values of the four genes in the GRM
   validated their effectiveness in the diagnosis of recurrence in all
   stages of CRC. The low expression of ZNF561 and PTGR1 and the high
   expression of WFS1 and SLC2A1 were associated with poor DFS. All of the
   above findings strongly prove the high recurrence diagnostic efficiency
   of our GRM.

   To explore the mechanism of recurrence in stage II CRC, we used an
   integrated bioinformatics analysis. In GO analysis, 18 functions were
   enriched and support evidence for the important roles of oxygen,
   epidermis development, hemoglobin, and GPCR.

   Hypoxia (low oxygen concentration) and ischemia (low hemoglobin
   concentration), which are caused by insufficient vascularization, are
   hallmarks of solid tumors (including CRC).[114]29, [115]30 Hypoxia acts
   as an off switch for the expression of several genes, such as vascular
   endothelial growth factor (VEGF) and epidermal growth factor receptor
   (EGFR),[116]31, [117]32 and could facilitate tumor development by
   epithelial‐to‐mesenchymal transition[118]33, [119]34; moreover,
   intratumoral hypoxia is associated with therapy resistance, metastasis,
   and a poor clinical outcome in CRC.[120]34 Under conditions of hypoxia,
   the transcription factor hypoxia‐inducible factor (HIF)[121]35 is
   activated immediately and strongly adapts to the environment by
   regulating transcriptional programs in erythropoiesis, angiogenesis and
   metabolism.[122]36 Then, it secretes a large number of
   angiogenesis‐related molecules, such as VEGF, to actualize
   angiogenesis, cell proliferation and protection against ischemic
   injury.[123]37

   Heme has important functions in transportation, catalysis, and electron
   and signaling transfer and serves as a prosthetic group in heme
   binding. Heme binding is widely observed in tumor lesions and cancer
   cells, and heme‐binding ability is necessary for DNA damage
   resistance[124]38 and represents an important biomarker of the
   proliferative status of cancers.[125]39 Hemoglobin, which is the most
   abundant heme‐binding protein associated with oxygen
   transportation,[126]40 indicates hypoxia, and the heme‐binding ability
   and hemoglobin yield exhibit correlative dependence and interplay.

   G protein‐coupled receptors comprise the largest superfamily of
   receptors involved in transmembrane‐initiated transduction pathways and
   can crosstalk (or transactivate) with EGFR, insulin/insulin‐like growth
   factor 1 receptors and other cell surface receptors[127]41 to mediate
   the proliferation,[128]42 angiogenesis,[129]43 and metastasis[130]44 of
   CRC.

   Among the mechanisms involved in the above biological processes, oxygen
   binding,[131]45 iron ion binding,[132]45 heme binding, oxygen
   transporter activity, oxygen transport, the hemoglobin complex,
   epidermis development, and the GPCR signaling pathway,[133]46 which
   were highlighted in the results of the GO functional annotation, are
   crucial in tumor development; therefore, targeting HIF, VEGF, iron ion
   binding, and GPCR may be good options for preventing the recurrence of
   stage II CRC.

   In KEGG analysis, the main enrichment pathways were xenobiotic
   metabolism (including drug and retinol)‐CYP450/other enzymes, chemical
   carcinogenesis and steroid hormone biosynthesis.

   Xenobiotic metabolism involves the metabolism of potentially harmful
   compounds that can enter the body together with food, environmental
   components, drugs or food additives. Xenobiotic metabolism enzymes
   include cytochrome P450 (CYP), the glutathione S‐transferase (GST)
   family, the uridine 50‐diphospho–glucuronosyltransferase
   (UDP‐glucuronosyltransferase‐UGT) superfamily, alcohol‐metabolizing
   enzymes, sulfotransferases, etc. Under normal circumstances, xenobiotic
   metabolism enzymes play the role of detoxification. However, they can
   also convert certain chemicals into highly toxic metabolites to trigger
   chemical carcinogenesis, which is known as ‘‘bioactivation’’.[134]47
   Different alleles of enzymes involved in xenobiotic metabolism
   contribute to CRC susceptibility[135]48; for example, CYP450 can
   increase the metabolic activity of procarcinogens, which include
   polycyclic aromatic hydrocarbons and heterocyclic amines, resulting in
   the production of potential carcinogens and eventually the development
   of CRC.[136]49 Another metabolic enzyme is GST, which plays a major
   role in detoxification and steroid hormone biosynthesis.[137]50

   Steroid hormone and its receptor have protective functions against the
   progression of CRC[138]51; therefore, the expression of steroid hormone
   in colon cancer tissues is lower than that in normal tissues,[139]52
   and the downregulation of steroid hormone expression in colorectal
   tissues is a cancer signal.

   Retinol inhibits CRC cell proliferation,[140]53 even inhibiting
   invasion through a retinoic acid receptor‐independent mechanism.[141]54
   In vivo, an impairment in hepatic and intestinal cytosolic retinol
   oxidation and retinoic acid formation by alcohol abuse can increase the
   risk of developing CRC, and retinol metabolism plays a very important
   role in this process.[142]55

   Based on the above mechanisms of the enriched pathways, we hypothesized
   that the recurrence of stage II CRC is likely due to “the breach of
   duty” of the xenobiotic metabolism enzymes, leading to the
   decompensation of retinal metabolism and the inhibition of steroid
   biosynthesis, and ultimately the development of recurrence. The
   metabolism of other potentially harmful compounds may also be involved
   in regulating xenobiotic metabolism enzymes while avoiding the invasion
   of xenobiotics and supplementing retinal and steroid hormones, which
   may be good options for preventing the recurrence of stage II CRC.

   To further highlight the hub genes that play the most important roles
   in recurrence and to explore their interactions, a PPI network was
   constructed, and four hub genes were selected. ABCG2, which is a
   membrane‐associated protein, can relieve oxidative stress and the
   inflammatory response by inhibiting the NF‐κB signaling pathway, and
   the high expression of ABCG2 in CRC tissues may represent feedback of
   the overoxidative reaction, which is associated with a poor
   prognosis.[143]56, [144]57 Most research has reported that inhibiting
   ABCG2, as a marker of chemoresistance in CRC, could enhance the
   efficacy of Hypericin‐mediated photodynamic therapy. CACNA1F is a
   voltage‐gated calcium channel that is mainly expressed in the human
   retina, but it has also been reported to be widely distributed outside
   the retina, including in the immune system.[145]58 It is well
   documented that CACNA1F plays roles in cell proliferation, migration,
   and apoptosis,[146]59 but it is rarely reported in CRC; therefore,
   further research on CACNA1F is needed. CYP19A1 is a member of the
   CYP450 superfamily that, as a monooxygenase, catalyzes drug metabolism
   and the synthesis of cholesterol, steroids and other lipids. Many
   researchers suggest that polymorphisms in CYP19A1 are related to CRC
   risk and may be influenced by estrogen through an inflammation‐related
   mechanism.[147]60, [148]61 The main function of the TF protein is to
   transport iron from the intestine, reticuloendothelial system, and
   liver parenchymal cells to all proliferating cells in the body. TF is
   an iron‐transporting protein that can transfer the iron absorbed by the
   intestinal mucosa to the bone marrow for hemoglobin formation in
   normoblasts.[149]62 A large number of studies indicate that TF is also
   a growth factor of all proliferated and cultured cells.[150]63
   Moreover, TF is synthesized for its own specific proliferation and
   differentiation in tumor tissue. Based on the biological properties of
   TF, TF in the feces is used as a blood marker for CRC screening, with a
   sensitivity and specificity of 92% and 72.0%, respectively.[151]64,
   [152]65

   However, our current study has the following limitations: (a) Because
   we were unable to obtain complete clinical data, we were unable to
   implement a comparison of predictive effectiveness between our GRM and
   high‐risk factors defined by the NCCN guidelines. (b) Our study is
   based on a mRNA evaluation from the TCGA and GEO databases; therefore,
   it is not as persuasive as the level of protein expression, and we did
   not use clinical samples to further verify its authenticity. (c) It is
   a retrospective study; therefore, the evidence level is imperfect. With
   regard to potential limitations, our GRM and hub genes relying on the
   advantages of excellent predictive values and reasonable statements
   should be validated in future, prospective, multicenter clinical
   trials. In addition, biomarkers that show promising predictive value
   for the survival benefit of adjuvant chemotherapy in stage II CRC
   patients should be discovered at the same time.

   In summary, our findings showed that the GRM can effectively classify
   stage II CRC patients into groups with high and low risks of
   recurrence, thereby making up for the prognostic value of the
   traditional clinicopathological risk factors defined by the NCCN
   guidelines. Moreover, various pathways and hub genes involved in the
   recurrence progression of stage II CRC were revealed, which may be
   useful therapeutic targets. Thus, the GRM and hub genes could offer
   clinical value in directing individualized and precision therapeutic
   regimens for stage II CRC patients.

5. CONCLUSION

   In conclusion, the GRM we established using stage II CRC data from the
   TCGA by random forest variable hunting showed robust diagnostic
   effectiveness and was further validated with GEO data, supporting its
   robust ability in the personalized prediction of recurrence. In
   addition, GO, KEGG, PPI network analyses, and hub gene selection
   revealed the underlying mechanism of recurrence to a certain extent.
   Thus, the GRM and hub genes could offer clinical value in directing
   individualized and precise therapeutic regimens for stage II CRC
   patients.

CONFLICT OF INTEREST

   The authors report no conflict of interest in this work.

ACKNOWLEDGMENTS