Abstract

   The prognosis of colon adenocarcinoma (COAD) remains poor. However, the
   specific and sensitive biomarkers for diagnosis and prognosis of COAD
   are absent. Transcription factors (TFs) are involved in many biological
   processes in cells. As the molecule of the signal pathway of the
   terminal effectors, TFs play important roles in tumorigenesis and
   development. A growing body of research suggests that aberrant TFs
   contribute to the development of COAD, as well as to its
   clinicopathological features and prognosis. In consequence, a few
   studies have investigated the relationship between the TF-related risk
   model and the prognosis of COAD. Therefore, in this article, we hope to
   develop a prognostic risk model based on TFs to predict the prognosis
   of patients with COAD. The mRNA transcription data and corresponding
   clinical data were downloaded from TCGA and GEO. Then, 141
   differentially expressed genes, validated by the GEPIA2 database, were
   identified by differential expression analysis between normal and tumor
   samples. Univariate, multivariate and Lasso Cox regression analysis
   were performed to identify seven prognostic genes (E2F3, ETS2, HLF,
   HSF4, KLF4, MEIS2, and TCF7L1). The Kaplan–Meier curve and the receiver
   operating characteristic curve (ROC, 1-year AUC: 0.723, 3-year AUC:
   0.775, 5-year AUC: 0.786) showed that our model could be used to
   predict the prognosis of patients with COAD. Multivariate Cox analysis
   also reported that the risk model is an independent prognostic factor
   of COAD. The external cohort ([32]GSE17536 and [33]GSE39582) was used
   to validate our risk model, which indicated that our risk model may be
   a reliable predictive model for COAD patients. Finally, based on the
   model and the clinicopathological factors, we constructed a nomogram
   with a C-index of 0.802. In conclusion, we emphasize the clinical
   significance of TFs in COAD and construct a prognostic model of TFs,
   which could provide a novel and reliable model for the prognosis of
   COAD.

   Keywords: transcription factors, colon adenocarcinoma, risk score,
   bioinformatics, nomogram

Introduction

   Colon cancer is one of the most common malignant tumors and the fifth
   leading cause of cancer-related death worldwide ([34]Bray et al.,
   2018). Colon adenocarcinoma (COAD) is the most common pathological type
   of colon cancer ([35]Fleming et al., 2012). Currently, the AJCC
   (American Joint Committee on Cancer) TNM staging system, together with
   age and gender, are the main prognostic indicators for COAD patients
   ([36]Kim et al., 2015; [37]Bruni et al., 2020). However, the
   significant differences in survival outcomes of COAD patients with the
   same clinicopathologic characteristics still exist, which signifies
   that a prognostic model based solely on clinicopathologic
   characteristics is of limited value ([38]Nagtegaal et al., 2011;
   [39]Bruni et al., 2020; [40]Wang T. et al., 2020). Therefore, searching
   for highly specific and sensitive methods to predict the prognosis of
   COAD patients precisely is the key point of developing individualized
   treatment strategies for COAD patients.

   Transcription factors (TFs) are vital cellular proteins for
   transcription of genes in human cells, which bind to target genes by
   recognizing promoters or enhancers of DNA sequences and affect many
   cellular functions, such as cell cycle and cell metabolism
   ([41]Vaquerizas et al., 2009; [42]Fitzgerald et al., 2014; [43]Ehsani
   et al., 2016). As the terminal effectors of signaling pathways in
   cells, normal TFs play important roles in overall gene expression
   profiles ([44]Xu H. et al., 2021). Recent studies have revealed the
   significance of deregulation TFs in COAD development and demonstrated
   that TFs associate closely with clinicopathological features and
   prognosis of COAD ([45]Zhou et al., 2020; [46]Xu H. et al., 2021).
   Meanwhile, several significant TFs, such as nuclear factor κB (NF-κB)
   and cAMP-response element-binding protein (CREB), have been found that
   aberrantly express in COAD and promote the development of COAD
   ([47]Rayet & Gelinas, 1999; [48]Hui et al., 2014). Hence, TFs may be
   important biomarkers for predicting prognosis, as well as potential
   targets for the treatment of COAD patients. However, recent studies
   mainly focused on the predicting value of TF-related genes; the present
   study constructed an original risk score model based on TF-related
   genes with remarkable predicting efficiency in COAD patients.

   Though the efficacy of colon cancer treatment has improved in recent
   years with the advance in surgical methods and follow-up treatments,
   the mortality in patients with tumor infiltration in situ and distant
   metastasis remains high ([49]Xu et al., 2020). In addition,
   overtreatment in patients may lead to adverse reactions related to
   chemotherapy and immunotherapy ([50]Meyerhardt & Mayer, 2005).
   Therefore, there is an urgent need to build superior or comprehensive
   models to predict the overall survival (OS) of COAD patients to judge
   who are high-risk or low-risk patients so that the toxic harm
   associated with overtreatment is reduced and novel measures for
   treatment will be provided. In the present study, a risk model based on
   TF-related genes for COAD patients was constructed using data
   downloaded from The Cancer Genome Atlas (TCGA) database and validated
   in the Gene Expression Omnibus (GEO) database. By using multivariable
   Cox regression and survival analysis, we demonstrated that our risk
   model can be regarded as an independent prognostic factor of COAD. In
   general, our findings indicated that the TF-related risk model is of
   great predicting value in COAD patients and provides novel potential
   targets for COAD treatment.

Methods

Data Collection

   The mRNA expression data (Workflow Type: HT Seq-FPKM) and clinical
   information pertaining to survival time for COAD patients downloaded
   from TCGA website ([51]https://portal.gdc.cancer.gov/repository) were
   used as the training set. 437 samples (39 normal tissues, 398 COAD
   tissues), and 384 cases of COAD patients were collected. As well, data
   from the GEO ([52]http://www.ncbi.nlm.nih.gov/geo/) database were used
   as the validation set. There are 177 and 579 COAD patients in the
   [53]GSE17536 and [54]GSE39582, respectively. The GEO samples were
   analyzed by Affymetrix Human Genome U133 Plus 2.0 Array platform. All
   cases from TCGA or GEO databases that miss the information were
   excluded from the analysis. The clinical characteristics of the
   patients (age, gender, stage, T stage, N stage, and M stage) were
   recorded. Unknown clinical characteristics were deleted. As for cases
   from TCGA, we removed one patient (TCGA-AZ-4323) whose risk score was
   significantly abnormal in our follow-up risk assessment.

Identification of Differentially Expressed Genes and Prognostic Genes

   Complete data of TF-related gene symbols were download from the Human
   Transcription Factor database
   ([55]http://bioinfo.life.hust.edu.cn/HumanTFDB#!/). Differentially
   expressed genes were identified by the linear models for microarray
   data (limma) R package based on TCGA–COAD data. The Gene Expression
   Profiling Interactive Analysis (GEPIA2,
   [56]http://gepia2.cancer-pku.cn/#general) database, which can analyze
   the difference of gene expression between tumor and normal tissue
   ([57]Tang et al., 2017), was used to further screen for differentially
   expressed transcription factor-related genes. Cutoff values were set at
   a p value < 0.05 and |log2 fold change| >1. Differentially expressed
   genes related to the OS of patients were analyzed by univariable Cox
   analysis with “Survival” R package. A value of p < 0.05 was considered
   to have a statistically significant difference.

Construction of Risk Score Model by Multivariate Cox and Lasso Regression

   Multivariate Cox regression analysis with the “survival” R package and
   Lasso analysis with the “glmnet” R package were used to determine which
   genes can be used as predictive genes. The TF-related risk model was
   constructed by the result of multivariate Cox regression analysis, and
   each COAD patient risk score was calculated by the risk model to be
   divided into two groups (high- and low-risk subgroups) based on the
   median value. The risk score formula was as follows:
   [MATH:
   <mrow><mo> </mo><mtext>Risk Score </mtext><mo>=</mo><mtext> </mtext><ms
   tyle
   displaystyle="true"><msubsup><mo>∑</mo><mi>i</mi><mn>7</mn></msubsup><m
   row><mi mathvariant="italic">Xi</mi><mo>∗</mo><mi
   mathvariant="italic">Yi</mi></mrow></mstyle><mo> </mo></mrow> :MATH]
   (X: coefficients, Y: gene expression level). Receiver operating
   characteristic (ROC) curves and Kaplan–Meier curves (K-M curves) were
   used to evaluate the ability of this risk model. Meanwhile, univariate
   and multivariate Cox regression analyses were also used to evaluate the
   prognostic efficiency of different clinicopathological features and our
   risk model.

Alteration of Seven Genes in the Model and Protein–Protein Interaction
Network

   The cBioPortal dataset ([58]https://www.cbioportal.org/), which
   contains genomic data from a variety of tumors, was used to study the
   genetic variability of transcription factor-related genes in our model.
   The STRING database ([59]https://string-db.org/) is set for searching
   online for known protein interoperability relationships. We used this
   database to analyze and predict the functional relationship among TFs,
   and the cytoHubba plug-in in the software of Cytoscape (version3.8.1)
   was used to show the correlation between TFs in our risk model and
   other TFs.

The Construction of a Nomogram to Estimate the Clinical Outcome of Colon
Adenocarcinoma Patients

   We used the “rms” R package to construct a nomogram. Subsequently,
   calibration curves were used to test the association between the
   predicted outcome and the actual situation in 1, 3, and 5 years. The
   [60]GSE39582 was used as the validation dataset for the nomogram.

Gene Set Enrichment Analysis

   Gene set enrichment analysis (GSEA), as a computational method, can be
   used to detect significantly enriched and depleted group of genes
   ([61]Subramanian et al., 2005). In this study, GSEA was regarded as an
   approach to explore potential molecular mechanisms underlying the
   expressions of TFs in our risk model. The group was divided into
   high-expression and low-expression groups according to the median value
   of each transcription factor expression level. Necessary files were
   finished and uploaded into the GSEA 4.1 software. The gene set database
   was “c2. cp.kegg.v7.2. symbols.gmt [Curated]” and “c5. go.v7.4.
   symbols.gmt [Gene ontology].” The number of permutations were conducted
   1,000 times in the analysis. The phenotype labels were high and low.
   Normal p-value <0.05 were enriched.

Correlation of the Genes and Risk Score With Clinicopathologic Features and
Immune Cells

   Clinical association analysis was used to assess the relationship
   between risk model and clinicopathologic features by using the
   “beeswarm” R package. Tumor immune microenvironment is crucial for the
   antitumor immunity in cancer, so we analyzed the relationship between
   our risk model and immune cell infiltration by getting the matrix of
   immune cells from the Tumor Immune Estimation Resource (TIMER)
   ([62]Bhattacharya et al., 2018), which is a database that provides the
   systematic analysis of immune infiltrates in cancer. In this database,
   users can predict the abundance and proportion of six immune cell
   subsets (B cells, CD4^+ T cells, CD8^+ T cells, dendritic cells,
   macrophages, and neutrophils) in tumor samples. The relationship
   between the risk model and immune cell infiltration was constructed by
   the Pearson’s correlation test in the R software.

Statistical Analysis

   R version 4.0.5 and Perl version 5.28 were used to perform statistical
   analysis. Excel office 2019 was used to organize data from TCGA and GEO
   database. A value of p < 0.05 was regarded as significant.

Results

Overall Design of the Study

   The flow chart for this study is shown in [63]Figure 1. Gene expression
   data and clinical data of COAD patients were downloaded from TCGA
   database. Some of the clinical information of 384 patients from TCGA
   are shown in [64]Table 1. According to the Human Transcription Factor
   database and differential expression analysis, 344 differentially
   expressed TF-related genes in patients with COAD were screened. Then,
   the GEPIA2 database was used to validate 344 differentially expressed
   genes, and 141 differentially expressed genes were screened finally.
   Lasso and multivariate Cox regression analysis were used to construct a
   prognostic risk model for patients with COAD. Then the K-M and ROC
   curves were used to assess the seven TF-related risk models. Two data
   sets of the GEO database were used as external cohort to validate the
   model. Finally, the nomogram of COAD was constructed by using TCGA
   data. Calibration and C-index were used to assess the nomogram.

FIGURE 1.

   FIGURE 1
   [65]Open in a new tab

   The workflow to construct the transcription factor (TF)-related risk
   model in COAD patients. TCGA, The Cancer Genome Atlas; COAD, colon
   adenocarcinoma; DEG, differentially expressed gene; GEPIA2, Gene
   Expression Profiling Interactive Analysis 2.

TABLE 1.

   The clinical information of colon adenocarcinoma (COAD) patients from
   The Cancer Genome Atlas (TCGA).
           Patient n = 384
   Age    ≤65           159
          >65           225
   Gender Female        180
          Male          204
   Stage  Stages I–II   216
          Stages III–IV 157
          Unknown       11
   T      T1–2           77
          T3–4          306
          Tis           1
   M      M0            285
          M1            54
          Mx            39
          Unknown       6
   N      N0            230
          N1–2          154
   [66]Open in a new tab

   Note. T, T stage; N, N stage; M, M stage; Stage, TNM stage.

Differential Gene Analysis

   Three hundred forty-four differentially expressed TF-related genes,
   which were analyzed from TCGA database, are shown in [67]Figure 2A.
   Then, 141 differentially expressed TF-related genes were screened from
   344 differentially expressed TF-related genes by using the GEPIA2
   database, including 70 downregulated genes and 71 upregulated genes
   ([68]Figure 2B). The expression information of 141 genes in COAD were
   also shown by a heatmap ([69]Figure 2C) between normal and tumor
   tissues, which was more obvious. Gene Ontology (GO) analysis showed 141
   genes were mainly concentrated on pattern specification process and
   embryonic organ development, and other significantly differentially
   expressed gene GO terms were exhibited in a circle graph ([70]Figure
   2D). In the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway
   enrichment analysis, the result is shown by a bar plot in [71]Figure
   2E. These results suggested that the genes screened in this study are
   related to tumorigenesis and the development pathway.

FIGURE 2.

   [72]FIGURE 2
   [73]Open in a new tab

   The results of differential gene analysis. (A) The volcano plot of
   differentially expressed TF-related genes based on TCGA. The sky blue
   points are the downregulated genes, the orange points are the
   upregulated genes, representing p < 0.05, |log[2]FC| > 1. (B) The Venn
   plot to show the same differentially expressed TF-related genes between
   TCGA and GEPIA2 database. (C) The heatmap of differentially expressed
   TF-related genes. The vertical axis refers to genes, the horizontal
   axis refers to differences in gene expression between tissues, the
   orange means high expression, and the sky blue means low expression.
   (D) Gene Ontology (GO) circle graph of the top 10 GO terms with the
   most enriched genes. (E) Bar graph of the top 24 Kyoto Encyclopedia of
   Genes and Genomes (KEGG) pathway with the most enriched genes; the
   vertical axis refers to names of pathways, and the horizontal axis
   refers to the number of genes.

Construction of Risk Score Model by Univariate, Multivariate, and Lasso Cox
Regression Analysis

   In TCGA set, 141 TF-related genes were considered in the univariable
   Cox regression analysis with a p-value <0.05 as the threshold to
   distinguish which genes were related to the prognosis of COAD patients
   ([74]Figure 3A). There were 11 genes associated with the prognosis of
   COAD. Then, the Lasso regression ([75]Supplementary Figures S1A,B) and
   multivariable Cox analysis ([76]Table 2) were used to do further
   identification of TF-related genes associated with the prognosis of
   COAD patients, and the coefficient of each TF-related gene was
   calculated in order to obtain the risk score. After analyzing, seven
   genes were selected from 11 genes to construct a risk score model. The
   minimum Akaike information criterion (AIC) of the risk model was
   683.96. Furthermore, the median value of the risk score from training
   and validation set was used as the cutoff value to divide patients into
   low-risk and high-risk groups.

FIGURE 3.

   [77]FIGURE 3
   [78]Open in a new tab

   Construction of prognostic model for COAD. (A) Hazard ratio of
   univariate Cox analysis for DEGs. (B) Survival analysis to verify the
   prognostic model. (C) ROC curve to evaluate the predictive efficacy of
   the risk model. (D) Distribution of risk scores of each COAD patient.
   (E) Correlation between survival time and survival status of each
   patient. (F) The expression pattern of seven TF-related genes. DEGs,
   differentially expressed genes. ROC, receiver operating characteristic.

TABLE 2.

   The results of multivariate Cox analysis in COAD.
   ID     Coefficients    HR      HR.95L    HR.95H   p-Value
   E2F3     0.735185   2.085867   1.1849   3.671907 0.010836
   ETS2     −0.29493   0.744583  0.504,494 1.098931 0.137551
   HLF      −2.29211   0.101,053  0.02909  0.351037 0.000309
   HSF4     0.341165   1.406,585 0.970863  2.03786  0.071289
   KLF4     −0.21757   0.804475  0.608,272 1.063964 0.127,187
   MEIS2    0.91836    2.505179  1.296842  4.839387 0.006262
   TCF7L1   0.551359   1.735609  0.954879  3.154683 0.070526
   [79]Open in a new tab

   Note. HR, hazard ratio.

Risk Score Model

   Seven genes (E2F3, ETS2, HLF, HSF4, KLF4, MEIS2, and TCF7L1) were taken
   into our risk model. Based on the coefficient of each gene from
   multivariable Cox regression analysis, the risk score can be calculated
   by the formula shown in the method with the coefficients in [80]Table
   2.

   The K-M curves show that the high-risk group performed with
   significantly poorer prognostic outcomes ([81]Figure 3B) than the
   low-risk group. Multi-indicator ROC curves were plotted, and the AUCs
   of the risk model were as follows: 1-year AUC: 0.723, 3-year AUC:
   0.775, 5-year AUC: 0.797 ([82]Figure 3C). The risk score, survival
   time, and gene expression of the seven genes in every COAD patient are
   vividly shown in [83]Figures 3D–F.

   Univariable and multivariable Cox analyses were used to identify
   whether risk score and clinicopathologic features (age, gender, AJCC
   stage, T stage, N stage, and M stage) could be the independent
   prognostic indicator. The results of univariable Cox analysis show that
   age, AJCC stage, T stage, N stage, M stage, and risk score were
   significantly correlated with OS (p < 0.05; [84]Figure 4A). Meanwhile,
   the results of multivariable Cox analysis show that age and risk score
   were still significantly associated with OS (p < 0.05; [85]Figure 4B).
   These results suggested that the risk score model could be the
   independent prognostic indicator for COAD.

FIGURE 4.

   [86]FIGURE 4
   [87]Open in a new tab

   Analysis of clinical independence for riskScore. (A) Univariable Cox
   regression analysis for clinical characters and risk score. (B)
   Multivariable Cox regression analysis for clinical characters and risk
   score. T, T stage; N, N stage; M, M stage; stage, TNM stage; riskScore,
   risk score model.

   To better understand COAD patients who can get more benefit from our
   risk score model, we divided these patients into several subgroups: age
   ≤65 and age >65, female and male, Stages I–II and Stages III–IV, T1–T2
   and T3–T4, N0 and N1–2, and M0 and M1. In our K-M curves, age ≤65,
   male, Stages III–IV, T3–T4, N1–2, and M0 seem more suitable for our
   risk score model ([88]Supplementary Figure S2A–L). These results
   suggest that our risk score model was strongly related to tumor
   progression.

Validation of Transcription Factor-Related Risk Score Model in Gene
Expression Omnibus Datasets

   In [89]GSE17536, survival data from 177 patients were used for external
   validation of our risk model. The risk score of each patient was
   calculated using the coefficients from TCGA data, and the patients were
   divided into high-risk group and low-risk group according to the median
   of the risk score. K-M curve analysis demonstrated better survival
   outcome in the low-risk group ([90]Figure 5A). Similarly, in
   [91]GSE39582, 579 patients were used as external validation of our risk
   model. The results also showed significant differences in OS among
   patients at different risk groups in this dataset ([92]Figure 5B).

FIGURE 5.

   [93]FIGURE 5
   [94]Open in a new tab

   Validation of the risk model in the GEO dataset and nomogram for
   predicting survival rate in COAD patients. (A) [95]GSE17536 and (B)
   [96]GSE39582 dataset. (C) The nomogram for predicting the 1, 3, and
   5-year survival rate by age, stage, riskScore. (D–F) The 1, 3, and
   5-year calibration curves of TCGA dataset. (G–I) The 1, 3, and 5-year
   calibration curves of Gene Expression Omnibus (GEO) dataset.

Genetic Alternation of Seven Transcription Factor-Related Genes and
Protein–Protein Interaction Network

   The cBioPortal database was used to analyze the mutations of seven
   genes, which showed that E2F3, ETS2, HLF, HSF4, KLF4, MEIS2, and TCF7L1
   were altered in 8, 8, 6, 6, 5, 9, and 6% of 524 colorectal
   adenocarcinoma patients separately ([97]Supplementary Figure S3A). As
   shown in [98]Supplementary Figure S3B, genes were altered in 34.94% of
   332 COAD patients and 36.03% of 136 rectal adenocarcinoma patients. The
   change in the expression of mRNA in cancer was the main change type for
   these genes. The PPI network based on the STRING database and Cytoscape
   software showed that KLF4, MEIS2, and TCF7L1 interacted more with other
   transcription factors ([99]Supplementary Figure S3C). In total, our
   results suggest that the establishment of a transcriptional-related
   gene model makes up for the fact that a single transcriptional gene is
   not a good indicator of the condition of a patient.

Construction and Validation of Nomogram Model

   Age, TNM stage, and risk score of TCGA dataset were taken into
   consideration to establish the nomogram to forecast the survival
   probabilities of 1, 3, and 5-year overall survival time ([100]Figure
   5C). In our nomogram, each factor had its score according to their
   contribution in the risk of survival. The calibration curves were
   plotted to judge whether the actual survival time was in line with the
   predicted survival rate in 1, 3, and 5 years ([101]Figures 5D–F). The
   C-index of this nomogram in predicting overall survival time was 0.802.
   The result of [102]GSE39582, which was the validation dataset of the
   nomogram is shown in [103]Figures 5G–I. These results show that the
   nomogram in our study can predict the actual survival outcome well.

Gene Set Enrichment Analysis Revealed Seven Transcription Factor-Related Gene
Signaling Pathways

   GSEA was used to detect the potential signaling pathways and gene
   function of seven TF-related genes in COAD between low- and
   high-expression datasets. The signaling pathways and gene function, of
   which the nominal p-value <0.05 in the enrichment of “c2. cp.kegg.v7.2.
   symbols.gmt [Curated]” and “c5. go.v7.4. symbols.gmt [Gene ontology],”
   were taken into consideration. The results are shown in
   [104]Supplementary Figures S4A–G, which suggested that most of genes
   were associated with tumors, and their abnormal expression may
   influence the development of related tumors. These genes were also
   involved in some tumor-related signaling pathways, such as TGF beta,
   P53, and MAPK. In addition, the genes in the model also can affect the
   signal pathways of apoptosis, immunity, and metabolism. As for GO
   analysis, the results analyzed by GSEA show that the seven genes may
   participate in cell metabolism and cell adhesion. In total, the GSEA
   results suggested that our risk model may participate in tumor-related,
   immune and metabolic signaling pathway, and related genes can affect
   cell growth and adhesion.

Correlation of the Risk Score With Clinicopathologic Features and Immune
Cells

   The risk score was significantly correlated with the
   clinicopathological features of TCGA-COAD (p < 0.05; [105]Figures
   6A–D). High risk score is associated with Stages III–IV, T3–4, N1–2,
   and M1, which meant that the TF-related risk model can reflect the
   progression of the COAD. Meanwhile, because the immune microenvironment
   is important for the tumors, we also analyzed the correlation between
   risk score and immune cells, which is shown in [106]Figures 6E,F. The
   results showed that the risk score was related to the CD4 T cell and
   macrophage. These results suggested that the risk score based on TFs
   may reflect the tumor progression and the infiltration of immune cells
   in tumor microenvironment.

FIGURE 6.

   [107]FIGURE 6
   [108]Open in a new tab

   Analysis of clinical and immunological relevance for riskScore. (A–D)
   Analysis of relationship between clinicopathological factors and
   riskScore. (E,F) Analysis of relationship between immune cells and
   riskScore.

Discussion

   The present study indicated that the prognostic model based on TFs can
   predict the prognosis of COAD well. Meanwhile, the correlation between
   risk score and clinicopathologic factors as well as immune cells showed
   that the model was associated with poor clinicopathologic factors and
   immune cells. In GSEA analysis, seven TFs were found to be involved in
   signaling pathways associated with tumor development.

   TFs are important components of cells that participate in the
   transcription of genes and many other biological processes
   ([109]Dynlacht, 1997; [110]Accili & Arden, 2004). About one-third of
   human developmental disorders are associated with abnormal TFs
   ([111]Boyadjiev & Jabs, 2000). Many studies have investigated the role
   of TFs in tumors and the corresponding mechanisms. Dong et al. found
   that the transcription factor can support the development of breast
   cancer ([112]Dong et al., 2020). Zhan et al. reported that
   transcription factors play a key role in the development of
   hepatoblastoma ([113]Zhan & Zhao, 2020). Meanwhile, studies have also
   reported the role of TFs in the development of COAD ([114]Rayet &
   Gelinas, 1999; [115]Hui et al., 2014). However, the study of prognostic
   transcription factors is still lacking. As the terminal effector
   molecule of the cell signal pathway, TFs play important roles in the
   development of tumor, which makes it very important to study their
   function in predicting the prognosis of patients. The bioinformatics
   was used as a research tool to analyze data from a database to build
   and validate a prognostic model and to analyze its association with
   clinical features and immunity. In our study, data from TCGA were used
   as training set, and data from GEO were used as validation set to
   construct a seven-TF-related risk model by using multivariable Cox and
   Lasso regression. The prognostic model can predict the prognosis of
   patients well, which will be helpful in clinical evaluation and provide
   new therapeutic targets.

   Seven TF-related genes were identified in this study, which have been
   reported separately regarding their roles in tumor. E2F transcription
   factors (E2F3), which can interact with histone acetyltransferase and
   induce cells to enter the cell cycle, participated in the development
   and progression of many tumors, such as pancreatic cancer and ovarian
   cancer ([116]Kurtyka et al., 2014; [117]Rotgers et al., 2014; [118]Park
   et al., 2015; [119]Pengcheng et al., 2021; [120]Xu D. et al., 2021). In
   COAD, E2F3 played a key role in carcinogenesis by promoting the
   expression of cyclin D1 and CDK2 (B. [121]Yang et al., 2020). V-ets
   erythroblastosis virus E26 oncogene homolog 2 (ETS2) can act as both a
   tumor suppressor and an oncogene (X. [122]Ma et al., 2019). In COAD,
   ETS2 can induce oxaliplatin resistance and promote the malignant
   behavior of tumor cells ([123]Wang H. et al., 2020). Hepatic leukemia
   factor (HLF), a kind of leukemia zipper transcription factor, can
   regulate circadian rhythms ([124]Inaba et al., 1994). A study of
   hepatocellular carcinoma suggested that the role of HLF in inducing
   sorafenib resistance and in promoting tumor progression by activating
   c-Jun may help to discover new targets for cancer therapy ([125]Xiang
   et al., 2019). Heat shock factor 4 (HSF4) can regulate cellular
   proliferation and differentiation ([126]Jin et al., 2012). In cancer
   research, HSF4 can promote EMT in liver cancer cells to stimulate cell
   proliferation and invasion and was associated with poor prognosis in
   COAD patients (P. [127]Ma et al., 2020; Y. [128]Yang et al., 2017).
   Krüppel-like factor 4 (KLF4) can inhibit the development of tumor by
   associating with the non-Warburg metabolic behaviors ([129]Blum et al.,
   2021). Low expression of KLF4 may lead to poor prognosis and link to
   the progression of COAD ([130]Taracha-Wisniewska et al., 2020). T-cell
   factor 7-like 1 (TCF7L1) has been found to have a high expression in
   many cancers, such as breast cancer and skin squamous cell carcinoma
   ([131]Slyper et al., 2012; [132]Ku et al., 2017). It is the suppressor
   of the Wnt target gene expression and cell circle is the method for it
   to promote the development of COAD ([133]Murphy et al., 2016;
   [134]Eshelman et al., 2017). These studies indicate that seven
   transcription factors are associated with tumor. Our risk model can be
   well applied to evaluate the prognosis of patients with COAD.

   Survival analysis based on clinical feature groups showed that our risk
   model was better able to predict the prognosis of young men with COAD
   at the advanced TNM stage. In addition, our risk model associated with
   advanced TNM stage and immune cells (CD4 T cell and macrophage). These
   results indicated that our risk model is related to the tumor
   microenvironment and affects the prognosis of patients with COAD
   ([135]Diederichsen et al., 2003; [136]Ye et al., 2019).

   The present study, similar to another study using TF-related genes to
   construct a model of COAD, reveals the role of TFs in predicting the
   prognosis of COAD ([137]Liu et al., 2020). However, when validated with
   the GEPIA2 database, the expression of TF-related genes in another
   article does not show the difference between normal and tumor samples.
   In contrast, the TF-related genes used in our article showed
   significant differences between normal and tumor samples in the
   validation of the GEPIA2 database. At the same time, our study goes
   further to combine clinical factors to construct a nomogram in order to
   improve the prognostic ability in COAD. A study that constructed a
   prognostic model based on immune-related genes reveal that TFs may
   regulate the expression of immune-related genes ([138]Sun et al.,
   2020). By combining the results in our study, it was shown that TFs may
   participate in the tumor immune microenvironment. However, the model
   based on immune-related genes show that ROC was only 0.719, which was
   lower than that of our model. Therefore, our study showed that the
   genes in our model have significant difference between normal and tumor
   tissues, and our risk score model shows better ability in predicting
   the prognosis of COAD patients.

   Though our model has shown remarkable ability in predicting prognosis
   of COAD patients, there are still a few limitations. This prognostic
   model still requires data of patients from other large cohorts to
   validate and not just take advantage of the data on the network
   database. Some of the TF-related genes in our model, which are not
   studied for mechanisms in COAD, still need to be revealed in our future
   studies.

Conclusion

   We identified seven TF-related genes, including E2F3, ETS2, HLF, HSF4,
   KLF4, MEIS2, and TCF7L1, and constructed a risk score model, which can
   predict the prognosis of COAD well. Moreover, these genes were
   associated with the prognosis of COAD patients and related to the
   development of cancer. Therefore, we thought our finding could help
   distinguish the COAD patients in the clinic, and the seven TF-related
   genes can become biological targets to treat COAD patients.

Acknowledgments