Abstract

Purpose

   This study aims to systematically analyze multi-omics data to explore
   new prognosis biomarkers in colon adenocarcinoma (COAD).

Materials and Methods

   Multi-omics data of COAD and clinical information were obtained from
   The Cancer Genome Atlas (TCGA). Univariate Cox analysis was used to
   select genes which significantly related to the overall survival.
   GISTIC 2.0 software was used to identify significant amplification or
   deletion. Mutsig 2.0 software was used to identify significant mutation
   genes. The 9-gene signature was screened by random forest algorithm and
   Cox regression analysis. [40]GSE17538 dataset was used as an external
   dataset to verify the predictive ability of 9-gene signature. qPCR was
   used to detect the expression of 9 genes in clinical specimens.

Results

   A total of 71 candidate genes are obtained by integrating genomic
   variation, mutation and prognostic data. Then, 9-gene signature was
   established, which includes HOXD12, RNF25, CBLN3, DOCK3, DNAJB13,
   PYGO2, CTNNA1, PTPRK, and NAT1. The 9-gene signature is an independent
   prognostic risk factor for COAD patients. In addition, the signature
   shows good predicting performance and clinical practicality in training
   set, testing set and external verification set. The results of qPCR
   based on clinical samples showed that the expression of HOXD12, RNF25,
   CBLN3, DOCK3, DNAJB13, and PYGO2 was increased in colon cancer tissues
   and the expression of CTNNA1, PTPRK, NAT1 was decreased in colon cancer
   tissues.

Conclusion

   In this study, 9-gene signature is constructed as a new prognostic
   marker to predict the survival of COAD patients.

   Keywords: COAD, multi-omics, 9-gene signature, prognosis biomarkers

Introduction

   Colon adenocarcinoma (COAD) is a common malignant tumor of the
   digestive system. It is the fourth leading cause of cancer-related
   death in the world.[41]^1^,[42]^2 Although much progress has been made
   in surgical and complementary therapy of COAD in recent years, the
   prognosis of patients with COAD is still poor.[43]^3 The main reason is
   that the pathogenesis of COAD is complex. Most patients are in the
   advanced stage at the time of diagnosis, losing the opportunity of
   surgical treatment.[44]^4 Therefore, in-depth research on COAD
   prognostic markers and potential drug targets to provide new means of
   diagnosis and treatment will greatly improve patients’ chances of
   survival.

   The formation and development of COAD are accompanied by complex and
   varied genetic molecular mechanisms. COADs with the same pathological
   characteristics may have different molecular pathogenesis. The
   prognosis of patients is also quite different from each other.
   Therefore, molecular markers show potential value in COAD diagnosis and
   prognosis. A variety of markers have been used to screen and diagnose
   COAD. For example, the level of carcinoembryonic antigen is closely
   related to lymph node metastasis and tumor stage in cancer. It can be
   used as a screening marker for COAD patients.[45]^5 Epidermal growth
   factor receptor (EGFR) and vascular endothelial growth factor (VEGF)
   are significantly correlated with KI-67 and prognosis in COAD.[46]^6 In
   addition, as far as miRNA biomarkers are concerned, the combination of
   miR-378, miR-199a and miR-92a in plasma is considered as an effective
   way to distinguish COAD cases from normal controls.[47]^7 However,
   these indicators are not suitable for clinical practice because of
   their low specificity and lack of clinical evidence. Therefore, there
   is an urgent need to accurately classify COAD patients and explore its
   molecular mechanism in depth to screen specific molecular markers. This
   requires a broader understanding of the heterogeneity of genome and
   transcriptome level.

   There had been plenty of previous studies trying to screen and
   construct a prognostic marker model for colon cancer. Smith et al[48]^8
   identified a metastasis gene expression profile derived from experiment
   to predict colon cancer recurrence and death. Gao et al[49]^9
   constructed cancer hallmark-based gene signature to predict recurrence
   and chemotherapy benefit of colorectal cancer patients. In addition,
   some recent studies have shown that VEGF could be used as a prognostic
   indicator, but cannot predict the response of advanced CRC to VEGF
   targeted therapy.[50]^10^,[51]^11 In addition, CEA level is an
   independent prognostic factor for colon cancer, and can be used with
   TNM staging of colon cancer.[52]^12^,[53]^13 As far as miRNA prognosis
   signatures are concerned, Rong et al identified 6-miRNA signature to
   predict overall survival;[54]^14 another study constructed 4
   lncRNA-miRNA prognostic signature for stage II colon cancer patients by
   integrating miRNA and lncRNA data15. However, no studies construct
   colon cancer prognostic signature using multiple omics data.

   The study establishes 9-gene signature by integrating genomic copy
   number variation, mutation and prognostic data. It has stable
   robustness in internal and external validation sets. Moreover, 9-gene
   signature is found to be involved in the important pathway and
   independent prognostic risk factors in COAD patients. In conclusion,
   the 9-gene signature constructed in this study can be used as a
   potential target for COAD treatment.

Materials and Methods

Data Acquisition and Processing

   Level-3 RNA-sequencing data, the clinicopathological、 SNP 、 copy number
   segments and survival data of patients with COAD were downloaded from
   the UCSC Xena browser ([55]https://xenabrowser.net/). The mutation
   Annotation Format (MAF) is downloaded from GDC client. The [56]GSE17538
   expression profile data and clinical follow-up information are
   downloaded from the GEO database. The samples of TCGA are randomly
   divided into two groups. The TCGA training set contains 226 samples and
   the test set contains 227 samples. As an external validation set, the
   [57]GSE17538 data set contains a total of 244 samples, including 6
   mouse samples, while among the 238 human samples, 38 samples recorded
   the survival status of NA, and finally used for follow-up analysis.

Univariate Cox Proportional Risk Regression Analysis

   As Guo, J et al[58]^16 univariate Cox proportional risk regression
   analysis is performed for each gene in this paper. The genes
   significantly related to the overall survival (OS) of patients in the
   training cohort are identified. p < 0.05 is selected as the threshold.

Copy Number Variation Data Analysis

   GISTIC is widely used to detect both broad and focal (potentially
   overlapping) recurring events. GISTIC 2.0[59]^17 software is used to
   identify the genes with significant amplification or deletion, with
   parameter thresholds of amplification, and deletion length greater than
   0.1 and p<0.05.

Gene Mutation Analysis

   Mutsig 2.0 software is used to identify significant mutation genes in
   the maf file of TCGA mutation data. The threshold is p < 0.05.

Construction of Prognostic Gene Signature

   In this study, prognostic genes with copy number amplification/deletion
   and mutation are selected. Random survival forest algorithm is used to
   sequence the importance of common genes.[60]^18^,[61]^19 The number of
   Monte Carlo iterations is set to 100 and the number of steps forward is
   set to 5. The genes with relative importance greater than 0.4 are
   identified as feature genes. Multivariate Cox regression analysis is
   further carried out to construct the following risk scoring model:
   RiskScore = ∑Expk * eHRk k=2. In the model, N represents the number of
   prognostic genes. Expk represents the expression value of prognostic
   genes. eHRk represents the estimated regression coefficient of genes in
   the multivariate Cox regression analysis.

Functional Enrichment Analysis

   Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG)
   pathway enrichment analysis is performed using the R package
   clusterprofiler[62]^20 to identify over-represented GO terms in KEGG
   pathway and three categories (biological processes, molecular function
   and cellular component). For this analysis, a false discovery rate
   (FDR) <0 0.05 is considered to denote statistical significance.

Gene Set Enrichment Analysis (GSEA)

   GSEA is performed by the JAVA program
   ([63]http://software.broadinstitute.org/gsea/downloads.jsp) using the
   MSigDB C2 Canonical pathways gene set collection, which contains 1320
   gene sets. Gene sets with a FDR value less than 0.05 after performing
   1000 permutations are considered to be significantly enriched.

Quantitative PCR (qPCR)

   Twenty specimens were obtained from colon cancer patients which had
   undergone surgery. Normal tissues were mean paired with tumor biopsies
   from the same patient. Among the 20 patients, 10 patients were in stage
   1, 6 patients were in stage 2, and 4 patients were in stage 3. None of
   these patients had undergone any therapy include chemotherapy or
   radiotherapy before surgery. We got the informed consent of every
   patient, and was approved by The Human Ethics Review Committee of the
   Shengjing hospital of China Medical University. This study is in
   accordance with the Declaration of Helsinki and got written informed
   consent from patients. And, this is a retrospective analysis, and has
   not been obtained prospectively and consecutively. Total RNA was
   extracted using TRIzol reagent (Invitrogen, Carlsbad, CA, USA)
   following the manufacturer’s instructions. Complementary DNA (cDNA) was
   synthesized from high-quality total RNA using PrimeScript™ RT Master
   Mix (No. RR036A, Takara Bio USA, Mountain View, CA, USA). Real-time
   qPCR was performed to validate gene expression using Power SYBR™ Green
   PCR Master Mix (No. A25742, Thermo Fisher Scientific, Waltham, MA, USA)
   on the 7900HT Fast Real-Time PCR System (Applied Biosystems, Foster
   City, CA, USA). Relative expression was calculated based on 2-ΔΔCt
   method.

Statistical Analysis

   A Kaplan-Meier (KM) curve is drawn when using the median risk score in
   each dataset as a cutoff to compare the survival risk between high-risk
   and low-risk groups. Multivariate Cox regression analysis is performed
   to test whether genetic markers are independent prognostic factors.
   Significance is defined as P < 0.05. All of these analyses are carried
   out in R 3.4.3.

Results

Preliminary Analysis of Multi-Omics Data to Obtain Prognostic Genes

Preprocessing Data Statistics

   The specific distribution of including variables is shown in [64]Table
   1. Univariate cox analysis was used to identify 1639 prognostic genes
   with p < 0.05 in TCGA training set ([65]Supplement Table 1). The
   information of the top 20 genes of the 1639 genes is shown in [66]Table
   2.

Table 1.

   Clinical Information Statistics
   Characteristics TCGA Training Datasets (n=226) TCGA Test Datasets
   (n=227) p value [67]GSE17538 (n=238)
   Age (years) ≤50 31 29 0.875 32
   >50 195 198 206
   Survival times (years) Median 1.81 1.97 – 2.38
   Mean 2.25 2.44 3.12
   Survival status Living 178 176 0.839 145
   Dead 48 51 55
   Gender Female 104 109 0.739 114
   Male 122 118 124
   Lymphatic invasion NO 125 128 0.695 –
   YES 81 75 –
   pathologic_T T1 6 5 0.769 –
   T2 35 43 –
   T3 157 153 –
   T4 28 25 –
   pathologic_N N0 127 139 0.454 –
   N1 54 52 –
   N2 45 36 –
   N3 0 0 –
   pathologic_M M0 165 168 0.956 –
   M1/MX 57 56 –
   AJCC stage Stage Ⅰ 34 42 0.332 28
   Stage Ⅱ 82 92 72
   Stage III 73 57 76
   Stage Ⅳ 30 32 56
   [68]Open in a new tab

Table 2.

   Information on Top20 Prognosis-Related Genes
       ENGSID      Genesymbol  HR   Coefficient z score p value
   ENSG00000241697 TMEFF1     1.552 0.440       4.737   2.17E-06
   ENSG00000160117 ANKLE1     1.525 0.422       4.691   2.72E-06
   ENSG00000214128 TMEM213    1.427 0.355       4.559   5.13E-06
   ENSG00000090932 DLL3       1.622 0.483       4.528   5.95E-06
   ENSG00000268940 CT45A1     1.361 0.308       4.522   6.11E-06
   ENSG00000159556 ISL2       1.528 0.424       4.502   6.72E-06
   ENSG00000205456 TP53TG3D   1.474 0.388       4.407   1.05E-05
   ENSG00000100156 SLC16A8    1.473 0.387       4.381   1.18E-05
   ENSG00000182759 MAFA       1.471 0.386       4.321   1.56E-05
   ENSG00000106689 LHX2       1.472 0.387       4.284   1.84E-05
   ENSG00000148331 ASB6       2.035 0.711       4.275   1.91E-05
   ENSG00000269437 NXF2B      1.425 0.354       4.269   1.96E-05
   ENSG00000124260 MAGEA10    1.366 0.312       4.262   2.02E-05
   ENSG00000126890 CTAG2      1.346 0.297       4.210   2.56E-05
   ENSG00000184029 DSCR4      1.414 0.347       4.172   3.02E-05
   ENSG00000187730 GABRD      1.480 0.392       4.153   3.29E-05
   ENSG00000149927 DOC2A      1.600 0.470       4.127   3.68E-05
   ENSG00000075043 KCNQ2      1.428 0.356       4.126   3.69E-05
   ENSG00000110148 CCKBR      1.404 0.339       4.113   3.91E-05
   ENSG00000121905 HPCA       1.492 0.400       4.107   4.01E-05
   [69]Open in a new tab

Identification of 398 Genes with Copy Number Variation

   The parameter threshold is a fragment with an amplification or deletion
   length greater than 0.1 and p < 0.05. [70]Figure 1A shows a
   significantly amplified fragment in the COAD genome. [71]Supplement
   Table 2 records the genes that have been significantly amplified on
   each fragment. For example, CCND2 significantly amplifies on the
   12p13.32 fragment (q value = 1.38E-07). CCND3 significantly amplifies
   on the 6p21.1 fragment (q value = 0.0002804). VEGFA significantly
   amplifies on the 6p21.1 fragment (q value = 0.0002804). A total of 137
   genes are amplified. [72]Figure 1B shows fragments with significant
   deletions in the COAD genome. [73]Supplement Table 3 records the genes
   that have significant deletion on each fragment. For example, RBFOX1
   has significantly deletion on the 16p13.3 fragment (q value =
   4.02E-44). SMAD4 has significant deletion on the 18q21.2 fragment (q
   value = 9.94E-15). APC has significant deletion on the 5q22.2 fragment
   (q value = 1.48E-05). A total of 261 genes have deletion.

Figure 1.

   [74]Figure 1
   [75]Open in a new tab

   (A) Fragments with significant amplification in the COAD genome. (B)
   Fragments with significant deletion in the COAD genome. q represents
   the long arm of the chromosome, p represents the broken arm.

Identification of 486 Mutant Genes

   A total of 486 genes with significant mutation frequency are obtained
   with a threshold of p<0.05. [76]Figure 2 shows the distribution of
   synonymous mutations, missense mutations, frame insertion or deletion,
   frame movement, nonsense mutations, cleavage sites and other
   non-synonymous mutations of 50 significant genes in TCGA COAD cohort.
   Among the identified 486 genes, some genes are closely related to the
   occurrence and development of cancer, such as KRAS, TP53, APC, PIK3CA,
   and FBXW7. By integrating copy number mutant genes and mutant genes, we
   identified 839 common genes, which are involved in tumor-related
   pathways and biological functions ([77]Supplementary Figure 1).

Figure 2.

   [78]Figure 2
   [79]Open in a new tab

   Distribution of top 50 genes with the most significant P value in COAD
   patients. The column chart on the top shows the total number of
   synonymous and non-synonymous mutations in 50 genes per patient. The
   column chart on the right shows the number of samples in which 50 genes
   have mutations in all samples.

Identification of 9 Hub Genes Using Random Forest Algorithm

   Among the 1639 candidate prognostic genes, 71 genes are identified to
   have copy number variation and mutation. The importance of prognostic
   genes is sequenced using R package random survival forest. Parameters
   are nrep = 100 and nstep = 5, which represent the number of Monte Carlo
   iterations is 100 and the number of steps forward is 5 (11),
   respectively. Genes with relative importance greater than 0.3 are
   identified as the final signature. [80]Figure 3A shows the relationship
   between error rate and the number of classification trees. [81]Figure
   3B shows the importance sequencing of the first 9 genes out-of-bag.

Figure 3.

   [82]Figure 3
   [83]Open in a new tab

   (A) Relationship between error rate and number of classification trees.
   The axe means number of trees. (B) Importance sequencing of the first
   five genes out-of-bag. The axe means variable importance.

Establishing 9-Gene Signature to Divide Samples in TCGA Training Cohort

   As for the identified 9-gene signature, the importance and relative
   importance of HR, Z score, p value, and out-of-bag for the univariate
   regression of the 9 genes are shown in [84]Table 3. Then, the 9-gene
   signature is established using multivariate COX analysis. The signature
   is as follows ([85]Supplement Table 4):

Table 3.

   9- Genes Significantly Associated with the Overall Survival in the
   Training Dataset
   Ensembl Gene ID Symbol HR Z-score P value Importance Relative
   Importance
   ENSG00000170178 HOXD12 1.30 2.349091 1.88E-02 0.0096 1
   ENSG00000163481 RNF25 1.52 2.540626 1.11E-02 0.0094 0.9848
   ENSG00000139899 CBLN3 1.39 2.249482 2.45E-02 0.0088 0.9242
   ENSG00000088538 DOCK3 1.52 3.782743 1.55E-04 0.0078 0.8182
   ENSG00000044115 CTNNA1 0.73 −2.399278 1.64E-02 0.0064 0.6667
   ENSG00000152894 PTPRK 0.71 −2.371469 1.77E-02 0.0049 0.5152
   ENSG00000187726 DNAJB13 1.28 2.639895 8.29E-03 0.0046 0.4848
   ENSG00000171428 NAT1 0.70 −2.402113 1.63E-02 0.0045 0.4697
   ENSG00000163348 PYGO2 1.47 2.390166 1.68E-02 0.0032 0.3333
   [86]Open in a new tab

   Risk[6] = 0.2929859 * HOXD12 + 0.1081464 * RNF25 + 0.06077754 * CBLN3 +
   0.312247 * DOCK3 – 0.1765895 * CTNNA1 – 0.3442616 * PTPRK + 0.1121588 *
   DNAJB13 – 0.3093265 * NAT1 + 0.2425542 * PYGO2

   The scoring formula for each sample is the addition of the above gene
   expression value * ordinal number. Then the sample score median
   −0.1002146 is selected as cutoff to divide samples into high-risk group
   and low-risk group. [87]Figure 4 shows the classification effect in the
   TCGA training set. In [88]Figure 4A, 113 patients are divided into the
   low-risk group and 113 patients are divided into the high-risk group.
   There are significant differences between the two groups. log-rank
   p=1.288372e-05. [89]Figure 4B shows the ROC curve, where 5-year AUC is
   0.76. [90]Figure 4C shows that as the patient’s risk score increases,
   the patient’s survival time decreases. According to the expression
   changes of 9-gene signature, HOXD12, RNF25, CBLN3, DOCK3, DNAJB13, and
   PYGO2 are identified as risk factors, with high expression associated
   with high risk. CTNNA1, PTPRK, and NAT1 are identified as protective
   factors, with high expression associated with low risk.

Figure 4.

   [91]Figure 4
   [92]Open in a new tab

   (A) KM survival curve of 9-gene signature distribution in TCGA training
   set. (B) ROC curve and AUC of 9-gene signature classification. (C) Risk
   score, survival time, survival status and expression of 9 genes in TCGA
   training set.

Robustness Detection of 9-Gene Signature in TCGA Testing Cohort

   In order to determine the robustness of the model, the same model and
   the same cutoff are used as the TCGA training set, which is verified in
   the TCGA testing set as well. In [93]Figure 5A, 101 patients are
   divided into the low-risk group and 126 patients are divided into the
   high-risk group. There are significant differences between the two
   groups (log-rank p= 0.006360314). [94]Figure 5B shows the 5-year ROC
   was 0.64. [95]Figure 5C produces results similar to those of the TCGA
   training set. The survival time of death samples decreases
   significantly as the risk score increases.

Figure 5.

   [96]Figure 5
   [97]Open in a new tab

   (A) KM survival curve of 9-gene signature distribution in TCGA testing
   set. (B) ROC curve and AUC of 9-gene signature classification. (C) Risk
   score, survival time, survival status and expression of 9 genes in TCGA
   training set.

Validation of 9-Gene Signature in External Cohorts

   External independent dataset [98]GSE17538, [99]GSE39582 and
   [100]GSE24551 were used to determine the stable performance in
   predicting the prognosis. In [101]Figure 6A, 79 patients are divided
   into the low-risk group and 121 patients are divided into the high-risk
   group. There are significant differences between the two groups
   (log-rank p= 0.027). [102]Figure 6B shows the ROC curve with an average
   5-year AUC is 0.60. [103]Figure 6C shows that as the patient’s risk
   score increases, the patient’s survival time decreases. In the
   [104]GSE39582 and [105]GSE24551 cohorts, we found that the 9-gene
   signature can still significantly divide the two data sets into high
   and low-risk groups significantly. Among them, the 1-year AUC of
   [106]GSE39582 is 0.68, and the 5-year AUC of [107]GSE24551 is 0.63. Our
   research confirmed that our model has stable prediction performance and
   robustness on different platforms, and the external verification of
   multiple data sets minimizes the statistical deviation
   ([108]Supplementary Figure 2).

Figure 6.

   [109]Figure 6
   [110]Open in a new tab

   (A) KM survival curve of 9-gene signature distribution in
   [111]GSE17538. (B) ROC curve and AUC of 9-gene signature
   classification. (C) Risk score, survival time, survival status and
   expression of 9 genes in [112]GSE17538.

Univariate and Multivariate Cox Regression Analysis

   In order to identify the prognostic independence of 9-gene signature,
   univariate and multivariate COX regression were analyzed in TCGA
   training and testing cohort ([113]Table 4). It showed that high-risk
   score (HR= 3.05, 95% CI = 1.36–6.84, p= 0.007), age >50 (HR= 1.04, 95%
   CI = 1.007–1.07, p= 0.016), AJCC Stage III/IV (HR= 37.58, 95% CI =
   1.42–990.93, p= 0.030) were independent risk factors in multivariable
   cox analysis.

Table 4.

   Univariate and Multivariate Cox Regression Analyses
   Variables Univariate Analysis Multivariable Analysis
   HR 95% CI of HR P value HR 95% CI of HR P value
   TCGA training datasets
   9-gene risk score
   Low-risk score 1 (Ref) 1 (Ref)
   High-risk score 3.97 2.04–7.73 4.9E-05 3.05 1.36–6.84 0.007
   Age(≤50/>50) 1.02 0.99–1.05 6.0E-02 1.04 1.00–1.07 0.016
   Gender female 1 (Ref) 1 (Ref)
   Gender male 1.05 0.59–1.86 0.87 0.66 0.31–1.37 0.265
   Lymphatic Invasion NO 1 (Ref) 1 (Ref)
   Lymphatic Invasion YES 2.58 1.35–4.91 0.00 1.94 0.66–5.68 0.227
   Pathologic T 1/ T 2 1 (Ref) 1 (Ref)
   Pathologic T 3 1.55 0.54–4.43 0.412 0.15 0.01–1.34 0.091
   Pathologic T 4 5.15 1.66–16 0.005 0.43 0.04–4.13 0.461
   Pathologic N 0 1 (Ref) 1 (Ref)
   Pathologic N 1 1.38 0.63–2.96 0.413 0.24 0.04–1.56 0.136
   Pathologic N 2 3.00 1.58–5.66 0.001 0.49 0.07–3.21 0.456
   Pathologic M 0 1 (Ref) 1 (Ref)
   Pathologic M 1 4.52 2.33–8.73 7.42E-06 2.28 0.89–5.79 0.080
   Pathologic M X 1.82 0.77–4.26 0.168 1.56 0.50–4.84 0.440
   AJCC Stage Ⅰ 1 (Ref) 1 (Ref)
   AJCC Stage Ⅱ 1.94 0.43–8.65 0.3854 13.42 0.68–262.17 0.087
   AJCC Stage III/Ⅳ 4.28 1.01–17.98 0.047 37.58 1.42–990.93 0.030
   [114]Open in a new tab

   We performed KM curve of different stages in the training cohort, and
   found that our model can significantly divide Stage II and Stage III
   patients into high- and low-risk groups, but no significance was found
   in Stage II and Stage IV because of a small number of samples
   ([115]Supplementary Figure 3)

Analysis of Pathways Enriched in High-Risk and Low-Risk Groups Using GSEA

   In TCGA training data set, GSEA is used to analyze the pathways
   enriched in high-risk and low-risk groups ([116]Table 5). Due to the
   large number of KEGG pathways included in the analysis, and the FDR q
   values are not significant, so here we choose the first four results
   with the most significant p value to plot. As shown in [117]Figure 7,
   KEGG ERBB SIGNALING PATHWAY, KEGG COLORECTAL CANCER, KEGG p53 SIGNALING
   PATHWAY, and KEGG TGF BETA SIGNALING PATHWAY, which are significantly
   enriched in both high-risk and low-risk groups, are closely related to
   the occurrence, development and metastasis of COAD ([118]Figure 7).

Table 5.

   GSEA Analysis of 9-Gene Signature
   Name SIZE ES NES NOM p-val FDR q-val
   KEGG_O_GLYCAN_BIOSYNTHESIS 26 −0.650 −1.907 0.000 0.355
   KEGG_UBIQUITIN_MEDIATED_PROTEOLYSIS 133 −0.502 −1.882 0.006 0.245
   KEGG_ERBB_SIGNALING_PATHWAY 87 −0.455 −1.772 0.008 0.494
   KEGG_ENDOMETRIAL_CANCER 52 −0.493 −1.716 0.008 0.586
   KEGG_OOCYTE_MEIOSIS 110 −0.455 −1.716 0.008 0.471
   KEGG_THYROID_CANCER 29 −0.511 −1.709 0.006 0.415
   KEGG_COLORECTAL_CANCER 62 −0.470 −1.701 0.012 0.377
   KEGG_VALINE_LEUCINE_AND_ISOLEUCINE_DEGRADATION 44 −0.584 −1.690 0.056
   0.356
   KEGG_PROSTATE_CANCER 89 −0.445 −1.689 0.019 0.321
   KEGG_APOPTOSIS 86 −0.438 −1.680 0.021 0.312
   KEGG_INSULIN_SIGNALING_PATHWAY 136 −0.393 −1.675 0.010 0.292
   KEGG_BASAL_TRANSCRIPTION_FACTORS 34 −0.546 −1.657 0.025 0.306
   KEGG_RENAL_CELL_CARCINOMA 70 −0.448 −1.602 0.037 0.413
   KEGG_ADHERENS_JUNCTION 73 −0.455 −1.596 0.046 0.400
   KEGG_DORSO_VENTRAL_AXIS_FORMATION 24 −0.511 −1.585 0.035 0.398
   KEGG_NEUROTROPHIN_SIGNALING_PATHWAY 125 −0.399 −1.585 0.040 0.374
   KEGG_RENIN_ANGIOTENSIN_SYSTEM 17 −0.527 −1.583 0.035 0.356
   KEGG_ENDOCYTOSIS 177 −0.379 −1.570 0.042 0.365
   KEGG_PEROXISOME 78 −0.470 −1.568 0.063 0.351
   KEGG_AMINO_SUGAR_AND_NUCLEOTIDE_SUGAR_METABOLISM 44 −0.451 −1.550 0.054
   0.373
   KEGG_PROPANOATE_METABOLISM 32 −0.525 −1.544 0.064 0.370
   KEGG_LONG_TERM_POTENTIATION 68 −0.402 −1.534 0.023 0.375
   KEGG_FATTY_ACID_METABOLISM 41 −0.484 −1.533 0.100 0.360
   KEGG_STARCH_AND_SUCROSE_METABOLISM 50 −0.441 −1.517 0.056 0.380
   KEGG_CHRONIC_MYELOID_LEUKEMIA 73 −0.417 −1.507 0.086 0.387
   KEGG_P53_SIGNALING_PATHWAY 66 −0.420 −1.502 0.056 0.383
   KEGG_BIOSYNTHESIS_OF_UNSATURATED_FATTY_ACIDS 20 −0.506 −1.496 0.069
   0.383
   KEGG_TGF_BETA_SIGNALING_PATHWAY 85 −0.399 −1.493 0.059 0.374
   KEGG_NICOTINATE_AND_NICOTINAMIDE_METABOLISM 22 −0.423 −1.467 0.0391
   0.4133
   [119]Open in a new tab

Figure 7.

   [120]Figure 7
   [121]Open in a new tab

   Pathways enriched in high-risk and low-risk groups obtained by 9-gene
   signature.

Comparison with Other Prognostic Signature

   In order to validate the superiority of our model, we found two
   published prognostic signature of colorectal cancer to make comparison,
   such as Zuo[122]^21 and Kim.[123]^22 To make it comparable, we used the
   corresponding genes in these two signatures. The risk score of each
   sample in TCGA was calculated using the same method, and the ROC and KM
   curve of each signature was evaluated. According to the median risk
   score value, the sample was divided into Risk-H and Risk-L groups, and
   the difference of prognosis between the two groups was calculated
   ([124]Figure 8A and B). We further analyzed the restricted mean
   survival curves of these signatures, we can see that our model has the
   highest C-index (0.72), which has more advantages in the long-term
   survival prediction ([125]Figure 8C).

Figure 8.

   [126]Figure 8
   [127]Open in a new tab

   Comparative analysis of 9-gene signature and others. (A) The AUC and KM
   curves of Zuo’s model. (B) The AUC and KM curves of Kim’s model. (C)
   RMS curves and C-index of three signatures.

Data Analysis Flowchart

   In order to make the reader easier to understand, we have drawn a
   flowchart for the data analysis part of the paper ([128]Figure 9).

Figure 9.

   [129]Figure 9
   [130]Open in a new tab

   Data analysis flowchart.

Expression Levels of 9 Genes

   Based on the bioinformatics analysis results, the expression of 9 genes
   was verified in twenty normal tissues and colon cancer tissues. The
   results in [131]Figure 10 showed that the mRNA expression of HOXD12,
   RNF25, CBLN3, DOCK3, DNAJB13, and PYGO2 were increased in colon cancer
   tissues and the mRNA expression of CTNNA1, PTPRK, NAT1were decreased in
   colon cancer tissues (p<0.05). It was consistent with that analyzed
   using bioinformatic analysis.

Figure 10.

   [132]Figure 10
   [133]Open in a new tab

   Expression levels of 9 genes quantified using qPCR in twenty paired
   normal tissues and colon cancer tissues. *P<0.05.

Discussion

   Colon adenocarcinoma (COAD) is a highly heterogeneous disease. Its
   occurrence and development process changes with the change of genetic
   and epigenetic factors.[134]^23 Therefore, COAD patients with the same
   pathological features may have different prognostic and therapeutic
   response to certain drugs. Screening prognostic molecular markers that
   fully reflect the biological characteristics of COAD is greatly
   significant for individualized treatment. Several recent studies have
   shown that genomics, epigenomics and transcriptomics play a vital role
   in tumor development and progression, helping predict the prognosis of
   patients.[135]^24^,[136]^25 Therefore, multi-omics studies can help
   determine tumor heterogeneity and screen therapeutic targets and tumor
   biomarkers, which have greater advantages.[137]^26 The study screens
   and identifies 9-gene signature associated with COAD prognosis by
   analyzing multi-omics data, including transcriptome data, copying
   number variation data and mutation data. The 9-gene signature
   established by screening has strong robustness and stable prediction
   performance in both internal verification set and external verification
   set. It enables stable prediction performance in data sets of different
   platforms. In addition, the clinical information in TCGA and
   [138]GSE19234 are analyzed systematically by COX regression. The
   results show that 9-gene signature is an independent prognostic factor,
   which maintains stable clinical independence under the influence of
   many clinical factors, including training set, TCGA internal
   verification set and [139]GSE19234 external verification set.

   Several studies have shown that multi-omics have been used in clinical
   prediction of prognosis and therapeutic responses. To be more specific,
   Oncotype DX used for breast cancer recurrence score contains 21
   genes.[140]^27^,[141]^28 Another Mammaprint™, which contains 70 genes,
   is used to assess the risk of metastasis in breast
   cancer.[142]^29^,[143]^30 In the study of COAD, ColoPrint, which
   contains 18 genes, is used to predict the risk of prognosis and
   recurrence in patients.[144]^31^–[145]^33 These results show that great
   potential of multi-omics screening in clinical application through gene
   expression profile. In this study, the AUC of 9-gene signature screened
   by multi-omics in the training set and validation set for five years is
   more than 0.64, which is more effective in predicting the prognosis of
   patients. Furthermore, the multi-omics in this study contains only 9
   genes, making it easier to apply clinically.

   In 9-gene signature identified and verified by multi-omics data,
   HOXD12, RNF25, CBLN3, DOCK3, DNAJB13, and PYGO2 are risk factors, and
   CTNNA1, PTPRK, NAT1 are protective factors. Consistently, our results
   based on clinical samples showed the mRNA expression of HOXD12, RNF25,
   CBLN3, DOCK3, DNAJB13, and PYGO2 were increased and the mRNA expression
   of CTNNA1, PTPRK, NAT1were decreased in colon cancer tissues. These
   results suggested these 9-genes may play an important role in the
   progression of COAD. There have been reported that HOXD12 has high
   expression in progesterone receptor positive breast cancer
   tissues.[146]^34 RNF25 is an E3 ubiquitin ligase. Studies have
   confirmed that RNF25 promotes gefitinib resistance by mediating
   crosstalk between the mediated NF-κB and ERK pathways.[147]^35 DOCK3 is
   shown to be involved in regulating tumor cell
   metastasis.[148]^36^,[149]^37 PYG02 is closely related to the prognosis
   of glioma, esophageal squamous cell carcinoma, hepatocellular carcinoma
   and other tumors.[150]^38^–[151]^40 CTNNA1 is closely related to the
   prognosis of invasive breast cancer and renal cell
   carcinoma.[152]^41^,[153]^42 PTPRK is a marker of breast
   cancer.[154]^43 NAT1 is significantly correlated with the increase of
   the overall survival time of breast cancer patients.[155]^44 In brief,
   these genes are closely related to the prognosis of tumors. In
   addition, CBLN3 and DNAJB13 have not been reported to be associated
   with tumors. They are found to be prognostic markers of COAD for the
   first time in this study. In addition, GSEA enrichment analysis results
   also show that 9-gene signature is closely related to the progression
   and metastasis of COAD. Examples include KEGG ERBB SIGNALING PATHWAY,
   KEGG COLORECTAL CANCER, KEGG p53 SIGNALING PATHWAY, and KEGG TGF BETA
   SIGNALING PATHWAY. The above results show that the model has potential
   clinical application value and can provide potential drug targets for
   patients with COAD.

   Although this study screens and verifies the potential prognostic
   markers of COAD based on large sample multi-omics data, this study
   still has some limitations. The conclusions of this study are mainly
   based on bioinformatics analysis, which still need to be further
   verified by experiments in vitro and in vivo. In addition, all the
   samples involved in this study are retrospective studies. The clinical
   application still requires comprehensive and in-depth research.
   Finally, because our study based on public cohorts, there are also have
   limitations such as the therapeutic effects of these patients, which
   are not found in the original research.

   To sum up, our research results indicate that the 9 gene prognostic
   signature is a reliable tool for predicting the OS of COAD patients.

Funding Statement

   This work was supported by 345 Talent Project of Shengjing hospital of
   China Medical University.

Disclosure

   The authors report no conflicts of interest for this work.

References