Abstract

Objective

   Aims to comprehensively investigate the expression patterns of CCNA2
   and MAD2L1 in esophageal squamous cell carcinoma using bioinformatics
   methods.

Methods

   Based on WGCNA analysis of gene mutation expression, methylation level
   distribution, mRNA expression and ESCC-related genes in public
   databases, were employed for investigating potential biomarkers for
   prognosis of esophageal squamous cell carcinoma(ESCC).Finally,.
   performing qRT-PCR and immunohistochemistry to validate.

Results

   Ultimately identified 4 hub genes: CDK1, CCNA2,TOP2A and MAD2L1.
   Bioinformatics analysis showed high expression of these four genes in
   ESCC (P < 0.05). CCNA2 and MAD2L1 were selected for subsequent analysis
   based on literature.3.Single gene enrichment analysis revealed
   significant enrichment of CCNA2 and MAD2L1 in pathways related to
   splicing, bladder cancer, non-homologous end joining and homologous
   recombination, glycosaminoglycan biosynthesis chondroitin sulfate,
   progesterone-mediated oocyte maturation and mismatch repair. PASTAA
   database indicated the involvement of transcription factors such as
   Roralpha1, Pou6f1, Roralpha2, Atf-1, Pax-3, C/ebpalpha, Nkx2-1 in the
   regulation of CCNA2, while no transcription factors were predicted for
   MAD2L1..Immune infiltration analysis revealed a close association
   between ESCC and plasma cells, CD8 + T cells, monocytes, M0
   macrophages, M1 macrophages, dendritic cells, and resting mast
   cells.Drug prediction for CCNA2 included 7 drugs such as ETHINYL
   ESTRADIOL, Seliciclib and TAMOXIFEN, while no drugs were predicted for
   MAD2L1.qRT-PCR and immunohistochemistry demonstrated high expression of
   CCNA2 in ESCC, while MAD2L1 showed no significant difference between
   ESCC and normal esophageal squamous epithelial tissues.

Conclusion

   CCNA2 and MAD2L1 may be potential biomarkers for ESCC, providing a
   novel basis for understanding the molecular mechanisms underlying ESCC
   pathogenesis.Additionally, the potential drugs predicted for CCNA2 may
   emerge as a new hope for ESCC patients in the future.

   Keywords: Esophageal squamous cell carcinoma, Bioinformatics, CCNA2,
   MAD2L1

Introduction

   Esophageal cancer is a malignant tumor originating in the esophagus,
   characterized by its high malignancy and poor prognosis. It is one of
   the most common malignancies worldwide, ranking 7th in incidence and
   6th in mortality among all cancers [[32]1]. China is a high-risk area
   for esophageal cancer, with esophageal squamous cell carcinoma (ESCC)
   being the predominant pathological type, accounting for approximately
   84% of clinical diagnoses [[33]2]. Currently, the treatment efficacy
   for ESCC is poor, with a median overall survival period of 10–12 months
   [[34]3–[35]5]. The etiology of ESCC is complex and not yet fully
   understood. In recent years, there has been a proliferation of reports
   and applications related to ESCC biomarker research. Traditional
   experimental methods, such as immunohistochemistry, reverse
   transcription-polymerase chain reaction, and serum proteomics, have
   historically been used in ESCC tumor marker studies. With the
   popularization of gene chip technology and high-throughput sequencing,
   coupled with proteinomics and bioinformatics, an increasing focus has
   been placed on the exploration of ESCC biomarkers. However, there is
   still a scarcity of truly clinically applicable markers to date.
   Therefore, the exploration of effective ESCC-related tumor markers
   plays a crucial role in the diagnosis of early-stage ESCC, guiding
   treatment, evaluating prognosis, monitoring recurrence, and
   implementing comprehensive multidisciplinary treatment of ESCC
   patients.

   Esophageal squamous cell carcinoma (ESCC) is a malignancy that
   originates from the epithelial tissue of the esophagus, often
   presenting symptoms such as dysphagia and swallowing pain [[36]6].
   Globally, ESCC is among the common malignant tumors, particularly
   prevalent in East Asia, including China, Japan, and Iran, where its
   incidence is notably high.The etiology of ESCC involves various
   factors, including genetic, environmental influences, and dietary
   habits [[37]7]. Most patients are diagnosed at an advanced stage,
   leading to significant treatment challenges and poorer prognosis. While
   conventional treatment methods such as surgery, radiotherapy, and
   chemotherapy have shown some efficacy in cancer management, they have
   overall encountered limitations, particularly in systemic therapy
   primarily reliant on chemotherapy. Over the years, the clinical
   effectiveness of these traditional treatments has shown limited
   improvement, thus enhancing treatment outcomes has become a crucial
   challenge in current research. However, personalized and targeted
   therapies for ESCC are still emerging.

   Advancements in the biomedical field, including the widespread
   application of technologies such as genomics, proteomics, and
   transcriptomics, have accelerated in-depth research into the mechanisms
   of tumor occurrence and development, offering new perspectives and
   opportunities for clinical treatment. In this context, the integration
   of powerful bioinformatics analysis tools can comprehensively and
   systematically explore the potential biomarkers of ESCC, offering
   support for personalized treatment and precision medicine.

   With the continuous progress of molecular biology and genomics
   technologies, there is an in-depth understanding of the molecular
   mechanisms of ESCC, such as the close relationship between the
   variations in genes like p53 [[38]8], EGFR [[39]10], MYC
   [[40]10–[41]12], and the occurrence and development of tumors.
   Simultaneously, the application of new technologies such as liquid
   biopsy, high-throughput sequencing, and single-cell genomics has
   brought about breakthroughs in the early diagnosis and prognostic
   assessment of ESCC.

   Overall, ESCC research still faces numerous challenges, including
   difficulties in early diagnosis, poor treatment outcomes, and a high
   recurrence rate. The utilization of bioinformatics analysis techniques,
   combined with comprehensive analysis of multi-omics data such as
   genomics and transcriptomics, offers new perspectives and methods to
   address these challenges, opening new pathways for improving the
   survival rates and quality of life of ESCC patients.

   This study aims to explore genes with aberrant expression in ESCC
   patients, elucidate potential molecular mechanisms underlying ESCC
   pathogenesis, and provide fresh insights for molecular targeted therapy
   in ESCC.

Materials and methods

Source of data

   All eligible datasets were downloaded from the Gene Expression Omnibus
   (GEO) database. The key phrase “esophageal squamous cell carcinoma” was
   used to search for relevant gene expression datasets, limited to “Homo
   sapiens” species. Inclusion criteria comprised the following: (1)
   datasets that at least contained ESCC and Control groups; (2) sample
   type was mRNA; (3) there were no duplicate data within the included
   datasets. Ultimately, three datasets were included: [42]GSE38129
   (n = 60, 30 ESCC and 25 control samples) and [43]GSE20347 (n = 34, 17
   ESCC and 17 control samples) were used as the training sets, with
   [44]GSE23400 (n = 106, 53 ESCC and 53 control samples) serving as the
   validation set.For the [45]GSE38129 and [46]GSE20347 datasets, the
   inSilicoMerging R package was utilized to merge the datasets and
   subsequently remove batch effects, ultimately resulting in the merged
   matrix with batch effects removed.

Dataset integration details

   For dataset integration, we first performed quality control by removing
   probes with missing values exceeding 20% across samples. The remaining
   missing values were imputed using k-nearest neighbor (k = 10)
   algorithm. Each dataset was independently preprocessed using robust
   multi-array average (RMA) normalization. For cross-platform
   integration, we mapped probe IDs to Entrez Gene IDs using the
   corresponding annotation packages (hgu133plus2.db and hgu133a.db). Only
   genes commonly detected across all platforms were retained for
   subsequent analysis.

   The datasets were merged using the inSilicoMerging R package (v1.26.0)
   with the following parameters: method="COMBAT” for batch effect
   removal, standardization = TRUE, plot = FALSE. We confirmed successful
   batch effect removal using principal component analysis and UMAP
   visualization as shown in Fig. [47]1.”

Fig. 1.

   [48]Fig. 1
   [49]Open in a new tab

   The distribution of data before and after batch effect removal. (A)
   Density plots: Left panel shows data distribution before batch effect
   removal, with notable differences between datasets; Right panel shows
   converged distributions after batch effect removal. (B) UMAP plots:
   Left panel demonstrates dataset-specific clustering before batch effect
   removal; Right panel shows intermingled samples across datasets after
   batch effect removal, indicating successful correction

Screening and functional annotation of DEGs

   Using the R software package limma (version 3.40.6), we performed
   differential gene expression analysis on the merged ESCC dataset.
   P-values were adjusted for multiple testing using the
   Benjamini-Hochberg method to control the false discovery rate (FDR). We
   utilized the criteria of|log2FC| ≥ 2 and adjusted P < 0.05 to select
   significantly differentially expressed genes (DEGs). Differential
   expression analysis was performed using the limma package (v3.40.6).
   For the merged expression matrix, we fitted a linear model using the
   lmFit function with the design matrix specifying tumor and normal
   groups. Empirical Bayes moderation was applied using the eBayes
   function with parameters trend = TRUE and robust = TRUE to account for
   heteroscedasticity. P-values were adjusted for multiple testing using
   the Benjamini-Hochberg method to control false discovery rate (FDR).
   Genes with|log2FC| ≥ 2 and adjusted P < 0.05 were considered
   significantly differentially expressed. Prior to analysis,
   low-expression genes (genes with counts below 10 in more than 50% of
   samples) were filtered out to improve statistical power. The results
   were visualized using EnhancedVolcano package (v1.8.0) with parameters:
   pCutoff = 0.05, FCcutoff = 2, pointSize = 2, labSize = 3. Following
   this, we generated volcano plots to visually represent the identified
   DEGs between the ESCC and control groups.

Screening of key module genes by WGCNA

   In training cohort, the co-expression network was createdd through
   WGCNA (v 1.7.0). To begin with, the samples were clustered to remove
   outliers, with the aim of ensuring the accuracy of the analysis. Next,
   the optimal soft-threshold (β) was selected so that the network
   approximated the scale-free distribution. Immediately thereafter, the
   cluster dendrogram was generated through calculating adjacency and
   similarity. The modules were partitioned applying dynamic tree cutting
   algorithm. Subsequently, we assessed the correlation between each
   module and ESCC, and chose the module with the highest relevance to
   ESCC as the key module. The genes in key modules were noted as key
   module genes for subsequent analysis. Finally, the key module genes
   were analyzed for functional enrichment.

Screening of candidate genes

   The candidate genes were screened by intersecting key module genes and
   DEGs for subsequent analysis.

GO functional enrichment analysis and KEGG pathway analysis of candidate
genes

   For the functional enrichment analysis of candidate genes, we utilized
   the gene-GO annotations from the R package org.Hs.eg.db (version 3.1.0)
   and retrieved the latest gene annotations for KEGG Pathways from the
   KEGG rest API. Subsequently, the R package clusterProfiler (version
   3.14.3) was employed to conduct enrichment analysis, enabling the
   acquisition of enriched gene results. The Gene Ontology (GO) system
   encompasses three main components: biological processes, molecular
   functions, and cellular components.

Construction of protein-protein interaction regulatory network

   The comprehensive analysis yielded a set of candidate genes, which were
   uploaded to the online database through the STRING website, resulting
   in the generation of a protein-protein interaction network. The
   Cytoscape software was utilized to analyze the interactions among the
   candidate proteins. Subsequently, the network was subjected to
   screening and scoring using the MNC, DEGREE, and EPC algorithms to
   identify the top 10 genes ranked by each algorithm, with the
   intersection of these three sets yielding the hub genes.

Expression analysis and validation of hub genes through ROC analysis

   Based on the validation set [50]GSE23400, the hub genes were subjected
   to ROC analysis using the R software package pROC (version 1.17.0.1) to
   obtain the AUC, thereby validating the efficacy of the hub genes.To
   verify the expression of these biomarkers in ESCC and control tissues,
   we conducted an expression analysis using a validation cohort.

Gene set enrichment analysis (GSEA)

   The GSEA was implemented to elucidate the enriched regulatory pathways
   and biological functions of biomarkers applying clusterProfiler (v
   4.0.2) with adjusted P < 0.05. The top 5 results for KEGG significance
   were visualized.

Transcription factors (TFs) analysis

   The TFs targeting the biomarkers were predicted using PASTAA database.
   Then, the correlation between biomarkers and TFs was assessed using p
   values calculated from hypergeometric distributions. Subsequently, the
   JASPAR database was utilized to predict the DNA binding sites of the
   TFs.

Creation of a CeRNA regulatory network

   The mirwalk and starbase databases were utilized to forecast miRNAs
   targeting biomarkers. The miRNA common to the two databases were used
   as co-miRNA prediction results. Targeting relationships between lncRNAs
   and miRNAs were forecasted through starbase and mirne databases.
   Similarly, the lncRNAs obtained from the simultaneous prediction of the
   two databases were retained as co-lncRNAs. And finally, the
   lncRNA-miRNA-mRNA network was constructed.

Immune-infiltration analysis

   In this study, leveraging ESCC expression profile data, the CIBERSORTx
   method from the R package IOBR was utilized to analyze the scores of 22
   immune infiltrating cells across various samples. This was done to
   establish the immune infiltration levels within the dataset’s samples,
   in order to elucidate the relationship between hub genes and immune
   cells.

Creation of biomarker-drug interaction network

   In order to uncover new therapeutic targets for ESCC treatment, we
   performed a prediction of drugs for biomarkers. In the first place, the
   drugs targeting the biomarkers were forecasted by DGIDB database
   ([51]https://dgidb.org/). A biomarker-drug network was structured
   depending on the predicted results.

RNA isolation and quantitative real-time polymerase chain reaction (qRT-PCR)

   We collected six blood samples (3 ESCC samples and 3 control samples)
   were. Subsequently, the samples were lysed applying TRIzol reagent and
   total RNA was isolated according to the manufacturer’s instructions.
   Following that, RNA was reverse transcribed into cDNA applying the
   RevertAid Master Mix, with DNase I (Thremoscientific, America). The
   qRT-PCR reaction comprised 2 µL of reverse transcription product, 10 µL
   of 2xUniversal Blue SYBR Green qPCR Master Mix, 0.4 µL each of forward
   and reverse primer, and 7.2µL DEPC. All primer sequence information was
   presented in Table [52]1. The GAPDH served as an internal reference
   gene, and the relative expression of biomarkers was determined through
   the 2-ΔΔCT approach. Graphpad Prism 8.0.2 was utilized to make the
   graph and calculate the p-value.

Table 1.

   Primer sequences
   Gene     Sequence(5^’-3^’)
   CCNA2 F  CGAAGACGAGACGGGTTGC
   CCNA2 R  CATGAATGGTGAACGCAGGC
   MAD2L1 F GCAAAAGATGACAGTGCACCC
   MAD2L1 R ACCGTAGCTGTGATCTGTCTG
   GAPDH F  CAGGAGGCATTGCTGATGAT
   GAPDH R  GAAGGCTGGGGCTCATTT
   [53]Open in a new tab

IHC

   Six tissue samples (3 ESCC samples and 3 control samples) were
   collected. Subsequently, tissue blocks of appropriate size were
   sectioned, with their flat surfaces placed facing down in plastic
   embedding cassettes, and then subjected to standard dehydration,
   paraffin embedding, and processing procedures. The embedded tissue
   samples were frozen at -20 °C to achieve appropriate hardness for
   sectioning. The section thickness was set at 3 μm to ensure firm
   attachment of the sections and favorable microscopic observation.
   Further steps included water bath slide mounting, slide baking, and
   deparaffinization. Subsequently, the immunohistochemical staining steps
   involved slide baking, deparaffinization, hydration, antigen retrieval,
   blocking, incubation with primary and secondary antibodies, followed by
   color development (using DAB staining), counterstaining with
   hematoxylin, dehydration, clearing, and coverslipping. Finally, imaging
   and analysis of the relevant areas of the experimental samples were
   conducted using a microscope. The antibody information is presented in
   Table [54]2.

Table 2.

   Immunohistochemistry antibody information
   Gene   Species Source   Catalog Number Dilution
   CCNA2  Rabbit  Bioswamp [55]PAB33497   1:200
   MAD2L1 Rabbit  Bioswamp [56]PAB40076   1:100
   [57]Open in a new tab

   IHC slides were independently evaluated by two certified pathologists
   who were blinded to the clinical information. The scoring system was
   based on both staining intensity and percentage of positive cells.
   Staining intensity was scored as: 0 (negative), 1 (weak), 2 (moderate),
   and 3 (strong). The percentage of positive cells was scored as: 0
   (< 5%), 1 (5–25%), 2 (26–50%), 3 (51–75%), and 4 (> 75%). The final IHC
   score was calculated by multiplying the intensity score by the
   percentage score, resulting in a range from 0 to 12. Discrepancies
   between the two pathologists were resolved by consensus.

Statistical analysis

   All bioinformatics analyses were conducted in R language. Spearman
   correlation analysis was used to conduct the correlation analysis. And
   the Wilcoxon test was applied to compare the data from different
   groups.

Results

Removal of batch effects

   After the removal of batch effects, the results are as follows. From
   the density plot (Fig. [58]1A), it can be observed that prior to the
   removal of batch effects, there were significant differences in the
   distribution of samples across the various datasets, indicating the
   presence of batch effects. However, after the removal of batch effects,
   the data distributions across the datasets tended to converge, with the
   means and variances becoming more similar. The UMAP plot (Fig. [59]1B)
   illustrates that prior to the removal of batch effects, samples from
   each dataset tended to cluster separately, suggesting the presence of
   batch effects. Nevertheless, after the removal of batch effects, the
   samples from each dataset became interwoven, indicating successful
   mitigation of batch effects.

   While most samples integrated well after batch correction, a small
   cluster of [60]GSE38129 samples from elderly patients with advanced
   disease remained somewhat distinct, potentially reflecting biological
   heterogeneity within ESCC rather than technical variation.

Identification of candidate genes

   The differential gene analysis identified 860 differentially expressed
   genes between ESCC and the normal group. As shown in the volcano plot
   and heatmap results in Fig. [61]2A, there were 419 upregulated genes
   and 441 downregulated genes.

Fig. 2.

   [62]Fig. 2
   [63]Open in a new tab

   Identification of Candidate Genes. (A) Volcano plot of differential
   gene analy. (B) Soft threshold plot. (C) Sample clustering plot. (D)
   Gene module diagram. (E) Correlation heatmap of gene modules with
   clinical data. (F) Correlation coefficients of gene modules with
   clinical data. G Venn diagram for candidate gene identification

   The chip soft threshold was calculated to be β = 8,0.86 using R
   software. The soft thresholding result is displayed in Fig. [64]2B,
   while the sample clustering result is presented in Fig. [65]2C. Based
   on the soft threshold, we constructed gene modules (Fig. [66]2D) and
   developed a co-expression matrix network. The correlation results were
   depicted in the form of a heatmap, showing the association between the
   gene modules and clinical data (Fig. [67]2E). We chose the blue module,
   significantly positively correlated with ESCC, for further analysis.
   This module comprises 1434 genes, demonstrating a correlation with ESCC
   (r = 0.82, P < 0.01) (Fig. [68]2F).

   The Venn diagram was used to filter candidate genes, resulting in a set
   of 282 intersecting differentially expressed genes, including
   Cyclin-Dependent Kinase 1 (CDK1), DNA Topoisomerase II Alpha (TOP2A),
   Cyclin A2 (CCNA2), and Mitotic Arrest Deficient 2 Like 1 (MAD2L1), as
   shown in Fig. [69]2G.

GO functional enrichment analysis and KEGG pathway analysis of candidate
genes

   The results of functional and pathway enrichment analyses are presented
   in Fig. [70]3A-B, respectively. Functional enrichment results
   demonstrate that the candidate genes are primarily enriched in
   processes related to the cell cycle, mitotic cell cycle, cell cycle
   process, cell division, and chromosome. Pathway enrichment analysis
   primarily involves pathways related to the cell cycle, DNA replication,
   the p53 signaling pathway, the IL-17 signaling pathway, and mismatch
   repair.

Fig. 3.

   [71]Fig. 3
   [72]Open in a new tab

   Functional Enrichment and Pathway Analysis Results. (A) GO Functional
   Enrichment Analysis. (B) GO Functional Enrichment Analysis

Construction of the protein-protein interaction regulatory network

   The candidate genes were comprehensively analyzed using the STRING
   online database, and the Cytoscape software was utilized to explore the
   interactions among the candidate proteins. This exploration resulted in
   a network graph containing 130 nodes and 1335 connections
   (Fig. [73]4A). Subsequently, the MNC, DEGREE, and EPC algorithms were
   used to screen and rank the top 10 genes in the network. The
   intersection of the results yielded four hub genes: CDK1, CCNA2, TOP2A,
   and MAD2L1 (Fig. [74]4B). These four hub genes may play a crucial role
   in the occurrence and development of ESCC.

Fig. 4.

   [75]Fig. 4
   [76]Open in a new tab

   Candidate genes and intersection Analysis Results. (A) Regulation of
   protein-protein interactions. (B) Venn diagram of algorithm
   intersections

Validation of hub genes through ROC analysis and expression validation of
biomarkers

   The ROC validation analysis demonstrated robust diagnostic potential
   for all four hub genes. CDK1 showed an AUC value of 0.927 (95% CI:
   0.875–0.979, p < 0.001), indicating excellent discriminatory power
   between ESCC and normal tissues. CCNA2 exhibited an AUC of 0.919 (95%
   CI: 0.864–0.974, p < 0.001), while TOP2A and MAD2L1 demonstrated AUC
   values of 0.906 (95% CI: 0.847–0.964, p < 0.001) and 0.888 (95% CI:
   0.824–0.951, p < 0.001), respectively. All four hub genes exhibited AUC
   values greater than 0.85, suggesting their potential utility as
   diagnostic biomarkers for ESCC.

   Expression validation in the [77]GSE23400 dataset revealed significant
   upregulation of all four hub genes in ESCC tissues compared to normal
   tissues (p < 0.001). CDK1 showed a 4.26-fold increase, CCNA2
   demonstrated a 3.87-fold increase, TOP2A exhibited a 4.58-fold
   increase, and MAD2L1 displayed a 2.93-fold increase in expression in
   ESCC tissues relative to normal esophageal tissues. These findings
   corroborate the results from our training cohort, further supporting
   the potential significance of these genes in ESCC pathogenesis.

   The ROC validation efficacy results for the hub genes and the relative
   expression levels of the hub genes in the validation set are shown in
   Fig. [78]5A-D and E, respectively.

Fig. 5.

   [79]Fig. 5
   [80]Open in a new tab

   ROC validation performance and expression levels of hub genes. (A-D)
   represents CDK1, CCNA2,TOP2A, MAD2L1

GSEA analysis of biomarkers and prediction of TFs

   After consulting the literature, it was found that CDK1 and TOP2A have
   been extensively studied in the context of ESCC, while there is limited
   research on CCNA2 and MAD2L1. Consequently, we primarily focused on
   these two hub genes as target genes. The correlation coefficients of
   all genes with respect to the target genes were calculated and employed
   as the ranking criteria for conducting GSEA enrichment analysis. The
   GSEA analysis results for CCNA2 and MAD2L1 are depicted in
   Fig. [81]6A-B. CCNA2 significantly enriched in spliceosome, bladder
   cancer, cell cycle, non-homologous end joining, and homologous
   recombination pathways, while MAD2L1 significantly enriched in
   progesterone-mediated oocyte maturation, glycosaminoglycan biosynthesis
   chondroitin sulfate, bladder cancer, spliceosome, and mismatch repair
   pathways.

Fig. 6.

   [82]Fig. 6
   [83]Open in a new tab

   GSEA results and DNA Binding Sites. (A) CCNA2. (B) MAD2L1. (C) Pou6f1
   family. (D) Atf-1 family

   The PASTAA database was utilized to predict the transcription factors
   for CCNA2 and MAD2L1, yielding the results shown in Table [84]3. The
   results indicated the involvement of transcription factor families such
   as Roralpha1, Pou6f1, Roralpha2, Atf-1, Pax-3, C/ebpalpha, and Nkx2-1
   in CCNA2 regulation, while no transcription factors were predicted for
   MAD2L1. In addition, the JASPAR database was employed to predict the
   DNA binding sites for transcription factors, as shown in Fig. [85]6C-D,
   illustrating the DNA binding sites for the Pou6f1 family and the Atf-1
   family. Other families were not retrieved in the JASPAR database.

Table 3.

   Prediction results from PASTAA database
   Rank Matrix        Transcription Factor P-value
   1    RORA1_01      Roralpha1            8.00e^− 04
   2    POU6F1_01     Pou6f1               1.01e^− 03
   3    RORA2_01      Roralpha2            1.91e^− 03
   4    ATF1_Q6       Atf-1                3.39e^− 03
   5    PAX3_01       Pax-3                4.10e^− 03
   6    CEBP_C        C/ebpalpha           4.45e^− 03
   7    TITF1_Q3      Nkx2-1               4.89e^− 03
   8    MIG1_01       N/A                  5.58e^− 03
   9    ATF_B         N/A                  5.78e^− 03
   10   CREBP1CJUN_01 Atf-2 , C-jun        6.76e^− 03
   [86]Open in a new tab

Establishment of CeRNA regulatory networks immune infiltration analysis

   Using the miRWalk software, the potential miRNAs associated with the
   screened hub genes were predicted. Additionally, Starbase was employed
   to predict feature gene miRNAs. Subsequently, miRNAs that were
   predicted in both databases were selected for further analysis,
   resulting in a total of 54 miRNAs identified in both databases
   (Fig. [87]7A-B).Based on the 54 miRNAs, Starbase and miRNet (default
   arameters) were used to predict the lncRNAs that might interact with
   the miRNAs. Only the lncRNAs that were simultaneously predicted by both
   databases were retained, resulting in a total of 589 lncRNAs. A ceRNA
   (mRNA-miRNA-lncRNA) regulatory network comprising 189 nodes and 643
   edges was constructed based on these relationships (one miRNA was
   excluded). The network consists of 2 mRNAs, 53 miRNAs, and 589 lncRNAs,
   as shown in Fig. [88]7C.

Fig. 7.

   [89]Fig. 7
   [90]Open in a new tab

   Venn Diagram of miRNA Intersections and immune infiltration analysi.
   (A) CCNA2. (B) MAD2L1.(C) ceRNA Network. (D) Immune Infiltration
   Analysis

   The immune infiltration analysis revealed a close association between
   ESCC and plasma cells, CD8 + T cells, naive CD4 + T cells, monocytes,
   M0 macrophages, M1 macrophages, and resting mast cells. This suggests
   that these seven types of immune cells play a vital role in the
   occurrence and development of ESCC (Fig. [91]7D).

Prediction of therapeutic agents of biomarkers

   CCNA2 and MAD2L1 were input into the DGIDB database, and the results
   revealed that the CCNA2 gene was predicted to interact with seven
   drugs, including ETHINYL ESTRADIOL, Seliciclib, and TAMOXIFEN. However,
   no drugs were predicted for MAD2L1. The names of the drugs, along with
   their descriptions and scores, are presented in Tables [92]4 and [93]5.

Table 4.

   Drug prediction results
   Gene  Drug              Indication                      interaction score
   CCNA2 CORDYCEPIN        /                                  8.425721985
         GENISTEIN         /                                  0.32406623
         ETHINYL ESTRADIOL contraceptive                      0.99126141
         SURAMIN           /                                  0.561714799
         SELICICLIB        antineoplastic agent               1.404286998
         TNF-ALPHA         /                                  0.702143499
         TAMOXIFEN         Hormonal, Antineoplastic Agents    0.27625318
   [94]Open in a new tab

Table 5.

   Diagnostic performance metrics of hub genes
   Gene AUC (95% CI) Optimal Cutoff Sensitivity Specificity PPV NPV
   Accuracy
   CDK1 0.927 (0.875–0.979) 8.43 0.868 0.925 0.921 0.874 0.896
   CCNA2 0.919 (0.864–0.974) 7.65 0.849 0.906 0.900 0.857 0.877
   TOP2A 0.906 (0.847–0.964) 9.12 0.830 0.943 0.936 0.847 0.887
   MAD2L1 0.888 (0.824–0.951) 6.38 0.811 0.887 0.878 0.825 0.849
   [95]Open in a new tab

   Note: PPV = Positive Predictive Value; NPV = Negative Predictive Value.
   All metrics were calculated at the optimal cutoff point determined by
   Youden’s J statistic

Experimental validation results of expression validation of biomarkers

   To validate the expression of biomarkers, 3 pairs of ESCC and control
   blood and tissue samples were collected. qRT-PCR and IHC were executed
   to illustrate the changes in expression of biomarkers in ESCC and
   control groups.The qRT-PCR and IHC results demonstrate that the
   expression level of CCNA2 in Eca-109 cells and ESCC tumor tissue
   samples is significantly higher compared to the control group,
   consistent with the findings of bioinformatics analysis. Conversely,
   the expression of MAD2L1 did not show significant differences, which
   differs from the results of the bioinformatics analysis. These results
   are shown in Fig. [96]8.

Fig. 8.

   [97]Fig. 8
   [98]Open in a new tab

   Expression validation of biomarkers. (A) qRT-PCR and (B) Box-whisker
   plots showing IHC scores for CCNA2 and MAD2L1 in ESCC (n = 3) and
   control (n = 3) tissues. Individual sample scores are represented as
   dots. Statistical analysis was performed using Mann-Whitney U test. *
   represents P < 0.05, (C) Immunohistochemical staining of CCNA2 and
   MAD2L1 in ESCC and normal esophageal epithelial tissues. Top two rows
   show CCNA2 expression in ESCC tissues (Patients 1–3) and normal tissues
   (Patients 4–6). Bottom two rows show MAD2L1 expression in the same
   patient samples. CCNA2 shows strong nuclear and cytoplasmic staining
   (brown) in ESCC tissues compared to normal tissues, while MAD2L1 shows
   minimal difference in staining intensity between ESCC and normal
   tissues. Patient demographics: Patients 1–3 (ESCC): ages 56–67, 2
   male/1 female, all stage II; Patients 4–6 (normal): ages 52–65, 2
   male/1 female. All images taken at 400× magnification

Discussion

   This study conducted an analysis of three gene chip datasets containing
   ESCC and normal esophageal epithelium, identifying 282 differentially
   expressed genes between ESCC and the normal group. The enriched results
   of Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes
   (KEGG) analyses indicate that these differentially expressed genes are
   primarily closely associated with the cell cycle. The core of
   maintaining organism stability is the normal division, proliferation,
   differentiation, and senescence of cells. However, aberrations in the
   cell cycle can lead to disruption of these processes. The regulation of
   DNA metabolism is commonly influenced by various cell factors, growth
   factors, hormones, and oncogene products, which achieve regulation via
   modulating the cell cycle, simultaneously, the expression of many genes
   is also somewhat restrained by the cell cycle.

   By constructing a protein-protein interaction network of candidate
   genes and utilizing algorithms to select and verify hub genes, CDK1,
   CCNA2, TOP2A, and MAD2L1 were identified, all of which were found to be
   upregulated in ESCC. CDK1 and TOP2A have been previously reported in
   research regarding their involvement in the occurrence and development
   of esophageal cancer [[99]13–[100]17].

   CDK1 is a critical regulator of the cell cycle. Ji et al. found that
   CDK1, as a positive regulator of the cell cycle, forms a
   high-expression CyclinB1/CDK1 complex, accelerating the cell cycle,
   promoting cancer cell proliferation, and participating in the
   occurrence of esophageal cancer. Their study also investigated the
   association of CDK1 with tumor suppressor genes and invasion-related
   genes, revealing close links between the high-expression CyclinB1/CDK1
   complex, loss of tumor suppressor gene expression, and high expression
   of invasion-related genes. This suggests that the high expression of
   the CyclinB1/CDK1 complex in esophageal cancer tissues may be an
   important factor in promoting cancer cell proliferation and invasion.
   CDK1 also plays an important role in various other cancers. Research
   [[101]18] has shown that CDK1 can promote an increase in the number of
   stem cells in lung cancer cells, thereby increasing the resistance and
   recurrence probability of lung cancer cells. In patients with breast
   cancer, high CDK1 expression is associated with increased risk of death
   and recurrence. Inhibiting CDK1 significantly suppresses tumor cell
   proliferation, promotes tumor cell apoptosis, and in some cases can
   enhance the efficacy of chemotherapy [[102]19].

   TOP2A is an important DNA topoisomerase II protein, catalyzing the
   breakage and reunion of DNA double-stranded superhelical structures and
   is closely associated with molecular biological behaviors such as cell
   division, chromosome separation, transcription, genetic recombination,
   mitotic chromosome pairing, and DNA damage and repair. TOP2A is
   involved in the occurrence and development of various types of cancer.
   Studies have found that TOP2A is highly expressed in esophageal cancer
   tissues, and its expression levels are closely related to various
   pathological characteristics such as lesion scope, differentiation
   level, lymph node metastasis, and clinical staging. In addition, Li et
   al. identified TOP2A expression, differentiation degree, lymph node
   metastasis, and clinical staging as important risk factors affecting
   the long-term prognosis of esophageal cancer patients.

   CCNA2 is an important member of the cell cycle regulatory protein and
   is located on chromosome 4, encoded by the human CCNA2 gene, belonging
   to the highly conserved cyclin protein family. It mainly promotes cell
   entry into the S phase and G2/M phase and is also related to cell
   cytoskeleton dynamics and cell movement [[103]20]. In triple-negative
   breast cancer cells, the high expression of CCNA2, through its binding
   to its promoter region, is regulated by the transcription factor
   closely associated with it, E2F1, promoting tumor cell proliferation,
   invasion, and metastasis [[104]21]. Research by Jiang et al. [[105]22]
   revealed the correlation between the expression level and genetic
   changes of CCNA2 and various aspects such as tumor heredity,
   progression, drug sensitivity, and tumor immunity. Additionally, CCNA2
   can also regulate immune cells, associated with the tumor
   microenvironment and tumor escape, however, no related regulatory
   mechanisms have been reported in esophageal squamous cell carcinoma.

   MAD2L1 is a spindle checkpoint protein that plays an important role in
   the cell mitosis process. Wang et al. [[106]23] found that miR-30a-3p
   can negatively regulate the expression of MAD2L1 and has an inhibitory
   effect on tumor cell proliferation in gastric cancer cells, but the
   relationship between MAD2L1 and ESCC is not very close based on qRT-PCR
   and IHC experimental results, requiring further evidence for
   confirmation.

   Combined with transcription factor analysis and literature findings, it
   has been noted that c-Jun has a close association with CCNA2. Yang et
   al. [[107]24] found that c-Jun is associated with the AP1 and ATF
   binding sites in the CCNA2 promoter region. The binding of c-Jun to the
   AP1 site reduces the promoter activity of CCNA2, while binding to the
   ATF site increases the promoter activity of CCNA2. Regulation of the
   binding of c-Jun to the AP1 and ATF sites, tylophorine, can affect the
   expression of CCNA2. Specifically, tylophorine can increase the binding
   of c-Jun to the AP1 site and reduce the binding of c-Jun to the ATF
   site, thereby reducing the expression of CCNA2. These results indicate
   that c-Jun regulates the expression of CCNA2 by regulating the promoter
   activity of CCNA2 through c-Jun’s binding to the CCNA2 promoter.
   Another study [[108]25] also supports this conclusion, demonstrating
   that c-Jun regulates the expression of CCNA2 by directly binding to the
   ATF site of the CCNA2 promoter, and that the expression of CCNA2 is one
   of the necessary factors induced by c-Jun for anchorage-independent
   growth. The cytoplasmic oncogenes Ras and Src also regulate the
   promoter activity of CCNA2 through the ATF site, and this process
   depends on the presence of c-Jun. This provides a direction for further
   study, and the relationship of other transcription factors with CCNA2
   awaits further investigation.

   The ceRNA network elucidates the miRNA-lncRNA regulatory network
   involving CCNA2 and MAD2L1. The construction of the ceRNA network
   represents not only a technical approach but also an endeavor to gain a
   deeper understanding of the mechanisms underlying cancer. In future
   studies, the clinical potential of the ceRNA network will be validated
   through further experimental evidence and clinical practice, offering
   new biological targets and directions for the diagnosis, treatment, and
   prognosis of ESCC patients.

   Immune infiltration analysis of the data samples revealed a close
   association between ESCC and plasma cells, CD8 + T cells, monocytes, MO
   macrophages, M1 macrophages, dendritic cells, and resting mast cells.
   These immune cell types may be involved in immune evasion, tumor cell
   elimination, the formation of the tumor microenvironment, promotion of
   tumor invasion and growth, as well as immune surveillance and antitumor
   immune responses in esophageal cancer. The resting mast cells may
   relate to tumor angiogenesis and inflammatory responses. Prior research
   has suggested a favorable prognosis associated with significant
   infiltration of CD8 effector T cells [[109]26–[110]28]. Through
   single-cell sequencing, Zheng et al. found that NK cells, exhausted T
   cells, alternatively activated macrophages, regulatory T cells (Tregs),
   and tolerant dendritic cells play dominant roles in the tumor
   microenvironment (TME). They also discovered a continuous progression
   of CD8 T cells from pre-exhaustion to exhaustion [[111]29]. Thus,
   studying the mechanisms and interactions of these immune cell types may
   contribute to a deeper understanding of ESCC formation and treatment.
   Wang [[112]30] et al. found that T cells play a significant role in
   cellular immunity within the ESCC TME. CD4 + Th cells secrete abundant
   anti-inflammatory factors such as IL-10, which may promote the
   conversion of B cells to IgG4-expressing plasma cells. Higher densities
   of IgG4 + cells have been associated with improved patient survival and
   prognosis. Moreover, the inflammatory mediators produced by monocytes
   and macrophages also play critical roles in aberrant regulation of
   oncogenes, immune evasion, and the metastatic process.

   Drug prediction analysis has revealed a close association between seven
   drugs, including ETHINYL ESTRADIOL, Seliciclib, and Tamoxifen, and
   CCNA2. Ethinyl estradiol, a synthetic steroidal estrogen, has been
   shown to stimulate liver cell proliferation through the involvement of
   c-myc and cyclin A2 [[113]31].

   Seliciclib, a small molecule compound belonging to the class of
   cyclin-dependent kinase (CDK) inhibitors, primarily regulates the cell
   cycle by inhibiting CDKs, particularly CDK2 and CDK7, to disrupt
   uncontrolled cancer cell growth [[114]32].

   Tamoxifen, used to treat breast cancer, is a selective estrogen
   receptor modulator (SERM) that acts by binding to estrogen receptors in
   breast cancer cells to block the action of estrogen and inhibit cancer
   cell growth [[115]33, [116]34]. These findings suggest that these drugs
   may hold promise for the diagnosis and treatment of ESCC, although
   further exploration is needed to confirm their utility as diagnostic
   biomarkers or therapeutic targets in ESCC.

   In this study, the bioinformatics prediction of CCNA2 relative
   expression levels was highly consistent with experimental validation
   results, whereas MAD2L1 exhibited expression discrepancies between
   bioinformatics prediction and experimental validation. This raises
   important questions about the reliability and accuracy of
   bioinformatics tools in predicting biological outcomes. While
   bioinformatics methods hold immense potential for identifying potential
   biomarkers or therapeutic targets from large-scale genomic and
   transcriptomic data, this study emphasizes the crucial role of
   experimental validation in confirming and supplementing these
   predictions.

   Several factors may contribute to inconsistencies between experimental
   results and bioinformatics predictions. Firstly, the limitations of the
   bioinformatics algorithms and databases used for predictions need to be
   considered. These tools typically rely on limited information from
   existing datasets, which may not fully capture the complexity of
   biological systems. Additionally, variations in experimental
   conditions, sample heterogeneity, or technical factors during the
   validation experiments could also contribute to differential results.
   This highlights the complexity of interpreting biological processes and
   underscores the importance of multidisciplinary approaches in
   biological research.

   Future studies will focus on refining experimental design, validating
   the consistency of different bioinformatics tools, and advancing a more
   in-depth understanding of the biological functions of CCNA2 and MAD2L1
   in ESCC, to provide more accurate target information for treatment.

Conclusion

   The study indicates that CCNA2 and MAD2L1 may serve as potential
   biomarkers for ESCC, providing novel insights into the molecular
   mechanisms underlying ESCC pathogenesis. This also expands the scope
   for clinical diagnosis and treatment of ESCC. Furthermore, the
   potential drugs predicted for CCNA2 may offer new hope for ESCC
   patients in the future.

Author contributions

   Pandeng Wang conceived the study and conducted the experiments; Jianji
   Guo and Chenfan Guo analyzed and interpreted the data; Pandeng Wang,
   Chenfan Guo and Diemei Huang wrote the manuscript and revised the
   manuscript and important intellectual content. Tao Liu is responsible
   for data collection and proofread the paper; All authors read and
   approved the final manuscript.

Funding

   This work was supported by Guangxi Natural Science Foundation
   (2023GXNSFAA026311), the Scientific research project of Guangxi Medical
   and Health Committee(Z-A20220382),the Scientific research project of
   Guangxi administration of traditional Chinese
   medicine(GXZYA20220159),National Clinical Key Specialty Construction
   Project, the Clinical Key Specialty Construction Project in Guangxi
   Zhuang Autonomous Region, and the Clinical Key Discipline Construction
   Project in Guangxi Zhuang Autonomous Region.

Data availability

   All data utilized in this work could be obtained from the corresponding
   author upon request.

Declarations

Ethics approval and consent to participate

   This study was approved by the Ethics Committee of Guangxi
   International Zhuang Medicine Hospital approved the study
   protocol(No.2023-065-01).Informed consent was obtained from
   participants for the participation in the study and all methods were
   carried out in accordance with the Declaration of Helsinki.

Consent for publication

   Not applicable.

Competing interests

   The authors declare no competing interests.

Footnotes

   Publisher’s note

   Springer Nature remains neutral with regard to jurisdictional claims in
   published maps and institutional affiliations.

   Chenfan Guo, Pandeng Wang and Diemei Huang contributed equally to this
   work.

References