Abstract Objective Aims to comprehensively investigate the expression patterns of CCNA2 and MAD2L1 in esophageal squamous cell carcinoma using bioinformatics methods. Methods Based on WGCNA analysis of gene mutation expression, methylation level distribution, mRNA expression and ESCC-related genes in public databases, were employed for investigating potential biomarkers for prognosis of esophageal squamous cell carcinoma(ESCC).Finally,. performing qRT-PCR and immunohistochemistry to validate. Results Ultimately identified 4 hub genes: CDK1, CCNA2,TOP2A and MAD2L1. Bioinformatics analysis showed high expression of these four genes in ESCC (P < 0.05). CCNA2 and MAD2L1 were selected for subsequent analysis based on literature.3.Single gene enrichment analysis revealed significant enrichment of CCNA2 and MAD2L1 in pathways related to splicing, bladder cancer, non-homologous end joining and homologous recombination, glycosaminoglycan biosynthesis chondroitin sulfate, progesterone-mediated oocyte maturation and mismatch repair. PASTAA database indicated the involvement of transcription factors such as Roralpha1, Pou6f1, Roralpha2, Atf-1, Pax-3, C/ebpalpha, Nkx2-1 in the regulation of CCNA2, while no transcription factors were predicted for MAD2L1..Immune infiltration analysis revealed a close association between ESCC and plasma cells, CD8 + T cells, monocytes, M0 macrophages, M1 macrophages, dendritic cells, and resting mast cells.Drug prediction for CCNA2 included 7 drugs such as ETHINYL ESTRADIOL, Seliciclib and TAMOXIFEN, while no drugs were predicted for MAD2L1.qRT-PCR and immunohistochemistry demonstrated high expression of CCNA2 in ESCC, while MAD2L1 showed no significant difference between ESCC and normal esophageal squamous epithelial tissues. Conclusion CCNA2 and MAD2L1 may be potential biomarkers for ESCC, providing a novel basis for understanding the molecular mechanisms underlying ESCC pathogenesis.Additionally, the potential drugs predicted for CCNA2 may emerge as a new hope for ESCC patients in the future. Keywords: Esophageal squamous cell carcinoma, Bioinformatics, CCNA2, MAD2L1 Introduction Esophageal cancer is a malignant tumor originating in the esophagus, characterized by its high malignancy and poor prognosis. It is one of the most common malignancies worldwide, ranking 7th in incidence and 6th in mortality among all cancers [[32]1]. China is a high-risk area for esophageal cancer, with esophageal squamous cell carcinoma (ESCC) being the predominant pathological type, accounting for approximately 84% of clinical diagnoses [[33]2]. Currently, the treatment efficacy for ESCC is poor, with a median overall survival period of 10–12 months [[34]3–[35]5]. The etiology of ESCC is complex and not yet fully understood. In recent years, there has been a proliferation of reports and applications related to ESCC biomarker research. Traditional experimental methods, such as immunohistochemistry, reverse transcription-polymerase chain reaction, and serum proteomics, have historically been used in ESCC tumor marker studies. With the popularization of gene chip technology and high-throughput sequencing, coupled with proteinomics and bioinformatics, an increasing focus has been placed on the exploration of ESCC biomarkers. However, there is still a scarcity of truly clinically applicable markers to date. Therefore, the exploration of effective ESCC-related tumor markers plays a crucial role in the diagnosis of early-stage ESCC, guiding treatment, evaluating prognosis, monitoring recurrence, and implementing comprehensive multidisciplinary treatment of ESCC patients. Esophageal squamous cell carcinoma (ESCC) is a malignancy that originates from the epithelial tissue of the esophagus, often presenting symptoms such as dysphagia and swallowing pain [[36]6]. Globally, ESCC is among the common malignant tumors, particularly prevalent in East Asia, including China, Japan, and Iran, where its incidence is notably high.The etiology of ESCC involves various factors, including genetic, environmental influences, and dietary habits [[37]7]. Most patients are diagnosed at an advanced stage, leading to significant treatment challenges and poorer prognosis. While conventional treatment methods such as surgery, radiotherapy, and chemotherapy have shown some efficacy in cancer management, they have overall encountered limitations, particularly in systemic therapy primarily reliant on chemotherapy. Over the years, the clinical effectiveness of these traditional treatments has shown limited improvement, thus enhancing treatment outcomes has become a crucial challenge in current research. However, personalized and targeted therapies for ESCC are still emerging. Advancements in the biomedical field, including the widespread application of technologies such as genomics, proteomics, and transcriptomics, have accelerated in-depth research into the mechanisms of tumor occurrence and development, offering new perspectives and opportunities for clinical treatment. In this context, the integration of powerful bioinformatics analysis tools can comprehensively and systematically explore the potential biomarkers of ESCC, offering support for personalized treatment and precision medicine. With the continuous progress of molecular biology and genomics technologies, there is an in-depth understanding of the molecular mechanisms of ESCC, such as the close relationship between the variations in genes like p53 [[38]8], EGFR [[39]10], MYC [[40]10–[41]12], and the occurrence and development of tumors. Simultaneously, the application of new technologies such as liquid biopsy, high-throughput sequencing, and single-cell genomics has brought about breakthroughs in the early diagnosis and prognostic assessment of ESCC. Overall, ESCC research still faces numerous challenges, including difficulties in early diagnosis, poor treatment outcomes, and a high recurrence rate. The utilization of bioinformatics analysis techniques, combined with comprehensive analysis of multi-omics data such as genomics and transcriptomics, offers new perspectives and methods to address these challenges, opening new pathways for improving the survival rates and quality of life of ESCC patients. This study aims to explore genes with aberrant expression in ESCC patients, elucidate potential molecular mechanisms underlying ESCC pathogenesis, and provide fresh insights for molecular targeted therapy in ESCC. Materials and methods Source of data All eligible datasets were downloaded from the Gene Expression Omnibus (GEO) database. The key phrase “esophageal squamous cell carcinoma” was used to search for relevant gene expression datasets, limited to “Homo sapiens” species. Inclusion criteria comprised the following: (1) datasets that at least contained ESCC and Control groups; (2) sample type was mRNA; (3) there were no duplicate data within the included datasets. Ultimately, three datasets were included: [42]GSE38129 (n = 60, 30 ESCC and 25 control samples) and [43]GSE20347 (n = 34, 17 ESCC and 17 control samples) were used as the training sets, with [44]GSE23400 (n = 106, 53 ESCC and 53 control samples) serving as the validation set.For the [45]GSE38129 and [46]GSE20347 datasets, the inSilicoMerging R package was utilized to merge the datasets and subsequently remove batch effects, ultimately resulting in the merged matrix with batch effects removed. Dataset integration details For dataset integration, we first performed quality control by removing probes with missing values exceeding 20% across samples. The remaining missing values were imputed using k-nearest neighbor (k = 10) algorithm. Each dataset was independently preprocessed using robust multi-array average (RMA) normalization. For cross-platform integration, we mapped probe IDs to Entrez Gene IDs using the corresponding annotation packages (hgu133plus2.db and hgu133a.db). Only genes commonly detected across all platforms were retained for subsequent analysis. The datasets were merged using the inSilicoMerging R package (v1.26.0) with the following parameters: method="COMBAT” for batch effect removal, standardization = TRUE, plot = FALSE. We confirmed successful batch effect removal using principal component analysis and UMAP visualization as shown in Fig. [47]1.” Fig. 1. [48]Fig. 1 [49]Open in a new tab The distribution of data before and after batch effect removal. (A) Density plots: Left panel shows data distribution before batch effect removal, with notable differences between datasets; Right panel shows converged distributions after batch effect removal. (B) UMAP plots: Left panel demonstrates dataset-specific clustering before batch effect removal; Right panel shows intermingled samples across datasets after batch effect removal, indicating successful correction Screening and functional annotation of DEGs Using the R software package limma (version 3.40.6), we performed differential gene expression analysis on the merged ESCC dataset. P-values were adjusted for multiple testing using the Benjamini-Hochberg method to control the false discovery rate (FDR). We utilized the criteria of|log2FC| ≥ 2 and adjusted P < 0.05 to select significantly differentially expressed genes (DEGs). Differential expression analysis was performed using the limma package (v3.40.6). For the merged expression matrix, we fitted a linear model using the lmFit function with the design matrix specifying tumor and normal groups. Empirical Bayes moderation was applied using the eBayes function with parameters trend = TRUE and robust = TRUE to account for heteroscedasticity. P-values were adjusted for multiple testing using the Benjamini-Hochberg method to control false discovery rate (FDR). Genes with|log2FC| ≥ 2 and adjusted P < 0.05 were considered significantly differentially expressed. Prior to analysis, low-expression genes (genes with counts below 10 in more than 50% of samples) were filtered out to improve statistical power. The results were visualized using EnhancedVolcano package (v1.8.0) with parameters: pCutoff = 0.05, FCcutoff = 2, pointSize = 2, labSize = 3. Following this, we generated volcano plots to visually represent the identified DEGs between the ESCC and control groups. Screening of key module genes by WGCNA In training cohort, the co-expression network was createdd through WGCNA (v 1.7.0). To begin with, the samples were clustered to remove outliers, with the aim of ensuring the accuracy of the analysis. Next, the optimal soft-threshold (β) was selected so that the network approximated the scale-free distribution. Immediately thereafter, the cluster dendrogram was generated through calculating adjacency and similarity. The modules were partitioned applying dynamic tree cutting algorithm. Subsequently, we assessed the correlation between each module and ESCC, and chose the module with the highest relevance to ESCC as the key module. The genes in key modules were noted as key module genes for subsequent analysis. Finally, the key module genes were analyzed for functional enrichment. Screening of candidate genes The candidate genes were screened by intersecting key module genes and DEGs for subsequent analysis. GO functional enrichment analysis and KEGG pathway analysis of candidate genes For the functional enrichment analysis of candidate genes, we utilized the gene-GO annotations from the R package org.Hs.eg.db (version 3.1.0) and retrieved the latest gene annotations for KEGG Pathways from the KEGG rest API. Subsequently, the R package clusterProfiler (version 3.14.3) was employed to conduct enrichment analysis, enabling the acquisition of enriched gene results. The Gene Ontology (GO) system encompasses three main components: biological processes, molecular functions, and cellular components. Construction of protein-protein interaction regulatory network The comprehensive analysis yielded a set of candidate genes, which were uploaded to the online database through the STRING website, resulting in the generation of a protein-protein interaction network. The Cytoscape software was utilized to analyze the interactions among the candidate proteins. Subsequently, the network was subjected to screening and scoring using the MNC, DEGREE, and EPC algorithms to identify the top 10 genes ranked by each algorithm, with the intersection of these three sets yielding the hub genes. Expression analysis and validation of hub genes through ROC analysis Based on the validation set [50]GSE23400, the hub genes were subjected to ROC analysis using the R software package pROC (version 1.17.0.1) to obtain the AUC, thereby validating the efficacy of the hub genes.To verify the expression of these biomarkers in ESCC and control tissues, we conducted an expression analysis using a validation cohort. Gene set enrichment analysis (GSEA) The GSEA was implemented to elucidate the enriched regulatory pathways and biological functions of biomarkers applying clusterProfiler (v 4.0.2) with adjusted P < 0.05. The top 5 results for KEGG significance were visualized. Transcription factors (TFs) analysis The TFs targeting the biomarkers were predicted using PASTAA database. Then, the correlation between biomarkers and TFs was assessed using p values calculated from hypergeometric distributions. Subsequently, the JASPAR database was utilized to predict the DNA binding sites of the TFs. Creation of a CeRNA regulatory network The mirwalk and starbase databases were utilized to forecast miRNAs targeting biomarkers. The miRNA common to the two databases were used as co-miRNA prediction results. Targeting relationships between lncRNAs and miRNAs were forecasted through starbase and mirne databases. Similarly, the lncRNAs obtained from the simultaneous prediction of the two databases were retained as co-lncRNAs. And finally, the lncRNA-miRNA-mRNA network was constructed. Immune-infiltration analysis In this study, leveraging ESCC expression profile data, the CIBERSORTx method from the R package IOBR was utilized to analyze the scores of 22 immune infiltrating cells across various samples. This was done to establish the immune infiltration levels within the dataset’s samples, in order to elucidate the relationship between hub genes and immune cells. Creation of biomarker-drug interaction network In order to uncover new therapeutic targets for ESCC treatment, we performed a prediction of drugs for biomarkers. In the first place, the drugs targeting the biomarkers were forecasted by DGIDB database ([51]https://dgidb.org/). A biomarker-drug network was structured depending on the predicted results. RNA isolation and quantitative real-time polymerase chain reaction (qRT-PCR) We collected six blood samples (3 ESCC samples and 3 control samples) were. Subsequently, the samples were lysed applying TRIzol reagent and total RNA was isolated according to the manufacturer’s instructions. Following that, RNA was reverse transcribed into cDNA applying the RevertAid Master Mix, with DNase I (Thremoscientific, America). The qRT-PCR reaction comprised 2 µL of reverse transcription product, 10 µL of 2xUniversal Blue SYBR Green qPCR Master Mix, 0.4 µL each of forward and reverse primer, and 7.2µL DEPC. All primer sequence information was presented in Table [52]1. The GAPDH served as an internal reference gene, and the relative expression of biomarkers was determined through the 2-ΔΔCT approach. Graphpad Prism 8.0.2 was utilized to make the graph and calculate the p-value. Table 1. Primer sequences Gene Sequence(5^’-3^’) CCNA2 F CGAAGACGAGACGGGTTGC CCNA2 R CATGAATGGTGAACGCAGGC MAD2L1 F GCAAAAGATGACAGTGCACCC MAD2L1 R ACCGTAGCTGTGATCTGTCTG GAPDH F CAGGAGGCATTGCTGATGAT GAPDH R GAAGGCTGGGGCTCATTT [53]Open in a new tab IHC Six tissue samples (3 ESCC samples and 3 control samples) were collected. Subsequently, tissue blocks of appropriate size were sectioned, with their flat surfaces placed facing down in plastic embedding cassettes, and then subjected to standard dehydration, paraffin embedding, and processing procedures. The embedded tissue samples were frozen at -20 °C to achieve appropriate hardness for sectioning. The section thickness was set at 3 μm to ensure firm attachment of the sections and favorable microscopic observation. Further steps included water bath slide mounting, slide baking, and deparaffinization. Subsequently, the immunohistochemical staining steps involved slide baking, deparaffinization, hydration, antigen retrieval, blocking, incubation with primary and secondary antibodies, followed by color development (using DAB staining), counterstaining with hematoxylin, dehydration, clearing, and coverslipping. Finally, imaging and analysis of the relevant areas of the experimental samples were conducted using a microscope. The antibody information is presented in Table [54]2. Table 2. Immunohistochemistry antibody information Gene Species Source Catalog Number Dilution CCNA2 Rabbit Bioswamp [55]PAB33497 1:200 MAD2L1 Rabbit Bioswamp [56]PAB40076 1:100 [57]Open in a new tab IHC slides were independently evaluated by two certified pathologists who were blinded to the clinical information. The scoring system was based on both staining intensity and percentage of positive cells. Staining intensity was scored as: 0 (negative), 1 (weak), 2 (moderate), and 3 (strong). The percentage of positive cells was scored as: 0 (< 5%), 1 (5–25%), 2 (26–50%), 3 (51–75%), and 4 (> 75%). The final IHC score was calculated by multiplying the intensity score by the percentage score, resulting in a range from 0 to 12. Discrepancies between the two pathologists were resolved by consensus. Statistical analysis All bioinformatics analyses were conducted in R language. Spearman correlation analysis was used to conduct the correlation analysis. And the Wilcoxon test was applied to compare the data from different groups. Results Removal of batch effects After the removal of batch effects, the results are as follows. From the density plot (Fig. [58]1A), it can be observed that prior to the removal of batch effects, there were significant differences in the distribution of samples across the various datasets, indicating the presence of batch effects. However, after the removal of batch effects, the data distributions across the datasets tended to converge, with the means and variances becoming more similar. The UMAP plot (Fig. [59]1B) illustrates that prior to the removal of batch effects, samples from each dataset tended to cluster separately, suggesting the presence of batch effects. Nevertheless, after the removal of batch effects, the samples from each dataset became interwoven, indicating successful mitigation of batch effects. While most samples integrated well after batch correction, a small cluster of [60]GSE38129 samples from elderly patients with advanced disease remained somewhat distinct, potentially reflecting biological heterogeneity within ESCC rather than technical variation. Identification of candidate genes The differential gene analysis identified 860 differentially expressed genes between ESCC and the normal group. As shown in the volcano plot and heatmap results in Fig. [61]2A, there were 419 upregulated genes and 441 downregulated genes. Fig. 2. [62]Fig. 2 [63]Open in a new tab Identification of Candidate Genes. (A) Volcano plot of differential gene analy. (B) Soft threshold plot. (C) Sample clustering plot. (D) Gene module diagram. (E) Correlation heatmap of gene modules with clinical data. (F) Correlation coefficients of gene modules with clinical data. G Venn diagram for candidate gene identification The chip soft threshold was calculated to be β = 8,0.86 using R software. The soft thresholding result is displayed in Fig. [64]2B, while the sample clustering result is presented in Fig. [65]2C. Based on the soft threshold, we constructed gene modules (Fig. [66]2D) and developed a co-expression matrix network. The correlation results were depicted in the form of a heatmap, showing the association between the gene modules and clinical data (Fig. [67]2E). We chose the blue module, significantly positively correlated with ESCC, for further analysis. This module comprises 1434 genes, demonstrating a correlation with ESCC (r = 0.82, P < 0.01) (Fig. [68]2F). The Venn diagram was used to filter candidate genes, resulting in a set of 282 intersecting differentially expressed genes, including Cyclin-Dependent Kinase 1 (CDK1), DNA Topoisomerase II Alpha (TOP2A), Cyclin A2 (CCNA2), and Mitotic Arrest Deficient 2 Like 1 (MAD2L1), as shown in Fig. [69]2G. GO functional enrichment analysis and KEGG pathway analysis of candidate genes The results of functional and pathway enrichment analyses are presented in Fig. [70]3A-B, respectively. Functional enrichment results demonstrate that the candidate genes are primarily enriched in processes related to the cell cycle, mitotic cell cycle, cell cycle process, cell division, and chromosome. Pathway enrichment analysis primarily involves pathways related to the cell cycle, DNA replication, the p53 signaling pathway, the IL-17 signaling pathway, and mismatch repair. Fig. 3. [71]Fig. 3 [72]Open in a new tab Functional Enrichment and Pathway Analysis Results. (A) GO Functional Enrichment Analysis. (B) GO Functional Enrichment Analysis Construction of the protein-protein interaction regulatory network The candidate genes were comprehensively analyzed using the STRING online database, and the Cytoscape software was utilized to explore the interactions among the candidate proteins. This exploration resulted in a network graph containing 130 nodes and 1335 connections (Fig. [73]4A). Subsequently, the MNC, DEGREE, and EPC algorithms were used to screen and rank the top 10 genes in the network. The intersection of the results yielded four hub genes: CDK1, CCNA2, TOP2A, and MAD2L1 (Fig. [74]4B). These four hub genes may play a crucial role in the occurrence and development of ESCC. Fig. 4. [75]Fig. 4 [76]Open in a new tab Candidate genes and intersection Analysis Results. (A) Regulation of protein-protein interactions. (B) Venn diagram of algorithm intersections Validation of hub genes through ROC analysis and expression validation of biomarkers The ROC validation analysis demonstrated robust diagnostic potential for all four hub genes. CDK1 showed an AUC value of 0.927 (95% CI: 0.875–0.979, p < 0.001), indicating excellent discriminatory power between ESCC and normal tissues. CCNA2 exhibited an AUC of 0.919 (95% CI: 0.864–0.974, p < 0.001), while TOP2A and MAD2L1 demonstrated AUC values of 0.906 (95% CI: 0.847–0.964, p < 0.001) and 0.888 (95% CI: 0.824–0.951, p < 0.001), respectively. All four hub genes exhibited AUC values greater than 0.85, suggesting their potential utility as diagnostic biomarkers for ESCC. Expression validation in the [77]GSE23400 dataset revealed significant upregulation of all four hub genes in ESCC tissues compared to normal tissues (p < 0.001). CDK1 showed a 4.26-fold increase, CCNA2 demonstrated a 3.87-fold increase, TOP2A exhibited a 4.58-fold increase, and MAD2L1 displayed a 2.93-fold increase in expression in ESCC tissues relative to normal esophageal tissues. These findings corroborate the results from our training cohort, further supporting the potential significance of these genes in ESCC pathogenesis. The ROC validation efficacy results for the hub genes and the relative expression levels of the hub genes in the validation set are shown in Fig. [78]5A-D and E, respectively. Fig. 5. [79]Fig. 5 [80]Open in a new tab ROC validation performance and expression levels of hub genes. (A-D) represents CDK1, CCNA2,TOP2A, MAD2L1 GSEA analysis of biomarkers and prediction of TFs After consulting the literature, it was found that CDK1 and TOP2A have been extensively studied in the context of ESCC, while there is limited research on CCNA2 and MAD2L1. Consequently, we primarily focused on these two hub genes as target genes. The correlation coefficients of all genes with respect to the target genes were calculated and employed as the ranking criteria for conducting GSEA enrichment analysis. The GSEA analysis results for CCNA2 and MAD2L1 are depicted in Fig. [81]6A-B. CCNA2 significantly enriched in spliceosome, bladder cancer, cell cycle, non-homologous end joining, and homologous recombination pathways, while MAD2L1 significantly enriched in progesterone-mediated oocyte maturation, glycosaminoglycan biosynthesis chondroitin sulfate, bladder cancer, spliceosome, and mismatch repair pathways. Fig. 6. [82]Fig. 6 [83]Open in a new tab GSEA results and DNA Binding Sites. (A) CCNA2. (B) MAD2L1. (C) Pou6f1 family. (D) Atf-1 family The PASTAA database was utilized to predict the transcription factors for CCNA2 and MAD2L1, yielding the results shown in Table [84]3. The results indicated the involvement of transcription factor families such as Roralpha1, Pou6f1, Roralpha2, Atf-1, Pax-3, C/ebpalpha, and Nkx2-1 in CCNA2 regulation, while no transcription factors were predicted for MAD2L1. In addition, the JASPAR database was employed to predict the DNA binding sites for transcription factors, as shown in Fig. [85]6C-D, illustrating the DNA binding sites for the Pou6f1 family and the Atf-1 family. Other families were not retrieved in the JASPAR database. Table 3. Prediction results from PASTAA database Rank Matrix Transcription Factor P-value 1 RORA1_01 Roralpha1 8.00e^− 04 2 POU6F1_01 Pou6f1 1.01e^− 03 3 RORA2_01 Roralpha2 1.91e^− 03 4 ATF1_Q6 Atf-1 3.39e^− 03 5 PAX3_01 Pax-3 4.10e^− 03 6 CEBP_C C/ebpalpha 4.45e^− 03 7 TITF1_Q3 Nkx2-1 4.89e^− 03 8 MIG1_01 N/A 5.58e^− 03 9 ATF_B N/A 5.78e^− 03 10 CREBP1CJUN_01 Atf-2 , C-jun 6.76e^− 03 [86]Open in a new tab Establishment of CeRNA regulatory networks immune infiltration analysis Using the miRWalk software, the potential miRNAs associated with the screened hub genes were predicted. Additionally, Starbase was employed to predict feature gene miRNAs. Subsequently, miRNAs that were predicted in both databases were selected for further analysis, resulting in a total of 54 miRNAs identified in both databases (Fig. [87]7A-B).Based on the 54 miRNAs, Starbase and miRNet (default arameters) were used to predict the lncRNAs that might interact with the miRNAs. Only the lncRNAs that were simultaneously predicted by both databases were retained, resulting in a total of 589 lncRNAs. A ceRNA (mRNA-miRNA-lncRNA) regulatory network comprising 189 nodes and 643 edges was constructed based on these relationships (one miRNA was excluded). The network consists of 2 mRNAs, 53 miRNAs, and 589 lncRNAs, as shown in Fig. [88]7C. Fig. 7. [89]Fig. 7 [90]Open in a new tab Venn Diagram of miRNA Intersections and immune infiltration analysi. (A) CCNA2. (B) MAD2L1.(C) ceRNA Network. (D) Immune Infiltration Analysis The immune infiltration analysis revealed a close association between ESCC and plasma cells, CD8 + T cells, naive CD4 + T cells, monocytes, M0 macrophages, M1 macrophages, and resting mast cells. This suggests that these seven types of immune cells play a vital role in the occurrence and development of ESCC (Fig. [91]7D). Prediction of therapeutic agents of biomarkers CCNA2 and MAD2L1 were input into the DGIDB database, and the results revealed that the CCNA2 gene was predicted to interact with seven drugs, including ETHINYL ESTRADIOL, Seliciclib, and TAMOXIFEN. However, no drugs were predicted for MAD2L1. The names of the drugs, along with their descriptions and scores, are presented in Tables [92]4 and [93]5. Table 4. Drug prediction results Gene Drug Indication interaction score CCNA2 CORDYCEPIN / 8.425721985 GENISTEIN / 0.32406623 ETHINYL ESTRADIOL contraceptive 0.99126141 SURAMIN / 0.561714799 SELICICLIB antineoplastic agent 1.404286998 TNF-ALPHA / 0.702143499 TAMOXIFEN Hormonal, Antineoplastic Agents 0.27625318 [94]Open in a new tab Table 5. Diagnostic performance metrics of hub genes Gene AUC (95% CI) Optimal Cutoff Sensitivity Specificity PPV NPV Accuracy CDK1 0.927 (0.875–0.979) 8.43 0.868 0.925 0.921 0.874 0.896 CCNA2 0.919 (0.864–0.974) 7.65 0.849 0.906 0.900 0.857 0.877 TOP2A 0.906 (0.847–0.964) 9.12 0.830 0.943 0.936 0.847 0.887 MAD2L1 0.888 (0.824–0.951) 6.38 0.811 0.887 0.878 0.825 0.849 [95]Open in a new tab Note: PPV = Positive Predictive Value; NPV = Negative Predictive Value. All metrics were calculated at the optimal cutoff point determined by Youden’s J statistic Experimental validation results of expression validation of biomarkers To validate the expression of biomarkers, 3 pairs of ESCC and control blood and tissue samples were collected. qRT-PCR and IHC were executed to illustrate the changes in expression of biomarkers in ESCC and control groups.The qRT-PCR and IHC results demonstrate that the expression level of CCNA2 in Eca-109 cells and ESCC tumor tissue samples is significantly higher compared to the control group, consistent with the findings of bioinformatics analysis. Conversely, the expression of MAD2L1 did not show significant differences, which differs from the results of the bioinformatics analysis. These results are shown in Fig. [96]8. Fig. 8. [97]Fig. 8 [98]Open in a new tab Expression validation of biomarkers. (A) qRT-PCR and (B) Box-whisker plots showing IHC scores for CCNA2 and MAD2L1 in ESCC (n = 3) and control (n = 3) tissues. Individual sample scores are represented as dots. Statistical analysis was performed using Mann-Whitney U test. * represents P < 0.05, (C) Immunohistochemical staining of CCNA2 and MAD2L1 in ESCC and normal esophageal epithelial tissues. Top two rows show CCNA2 expression in ESCC tissues (Patients 1–3) and normal tissues (Patients 4–6). Bottom two rows show MAD2L1 expression in the same patient samples. CCNA2 shows strong nuclear and cytoplasmic staining (brown) in ESCC tissues compared to normal tissues, while MAD2L1 shows minimal difference in staining intensity between ESCC and normal tissues. Patient demographics: Patients 1–3 (ESCC): ages 56–67, 2 male/1 female, all stage II; Patients 4–6 (normal): ages 52–65, 2 male/1 female. All images taken at 400× magnification Discussion This study conducted an analysis of three gene chip datasets containing ESCC and normal esophageal epithelium, identifying 282 differentially expressed genes between ESCC and the normal group. The enriched results of Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses indicate that these differentially expressed genes are primarily closely associated with the cell cycle. The core of maintaining organism stability is the normal division, proliferation, differentiation, and senescence of cells. However, aberrations in the cell cycle can lead to disruption of these processes. The regulation of DNA metabolism is commonly influenced by various cell factors, growth factors, hormones, and oncogene products, which achieve regulation via modulating the cell cycle, simultaneously, the expression of many genes is also somewhat restrained by the cell cycle. By constructing a protein-protein interaction network of candidate genes and utilizing algorithms to select and verify hub genes, CDK1, CCNA2, TOP2A, and MAD2L1 were identified, all of which were found to be upregulated in ESCC. CDK1 and TOP2A have been previously reported in research regarding their involvement in the occurrence and development of esophageal cancer [[99]13–[100]17]. CDK1 is a critical regulator of the cell cycle. Ji et al. found that CDK1, as a positive regulator of the cell cycle, forms a high-expression CyclinB1/CDK1 complex, accelerating the cell cycle, promoting cancer cell proliferation, and participating in the occurrence of esophageal cancer. Their study also investigated the association of CDK1 with tumor suppressor genes and invasion-related genes, revealing close links between the high-expression CyclinB1/CDK1 complex, loss of tumor suppressor gene expression, and high expression of invasion-related genes. This suggests that the high expression of the CyclinB1/CDK1 complex in esophageal cancer tissues may be an important factor in promoting cancer cell proliferation and invasion. CDK1 also plays an important role in various other cancers. Research [[101]18] has shown that CDK1 can promote an increase in the number of stem cells in lung cancer cells, thereby increasing the resistance and recurrence probability of lung cancer cells. In patients with breast cancer, high CDK1 expression is associated with increased risk of death and recurrence. Inhibiting CDK1 significantly suppresses tumor cell proliferation, promotes tumor cell apoptosis, and in some cases can enhance the efficacy of chemotherapy [[102]19]. TOP2A is an important DNA topoisomerase II protein, catalyzing the breakage and reunion of DNA double-stranded superhelical structures and is closely associated with molecular biological behaviors such as cell division, chromosome separation, transcription, genetic recombination, mitotic chromosome pairing, and DNA damage and repair. TOP2A is involved in the occurrence and development of various types of cancer. Studies have found that TOP2A is highly expressed in esophageal cancer tissues, and its expression levels are closely related to various pathological characteristics such as lesion scope, differentiation level, lymph node metastasis, and clinical staging. In addition, Li et al. identified TOP2A expression, differentiation degree, lymph node metastasis, and clinical staging as important risk factors affecting the long-term prognosis of esophageal cancer patients. CCNA2 is an important member of the cell cycle regulatory protein and is located on chromosome 4, encoded by the human CCNA2 gene, belonging to the highly conserved cyclin protein family. It mainly promotes cell entry into the S phase and G2/M phase and is also related to cell cytoskeleton dynamics and cell movement [[103]20]. In triple-negative breast cancer cells, the high expression of CCNA2, through its binding to its promoter region, is regulated by the transcription factor closely associated with it, E2F1, promoting tumor cell proliferation, invasion, and metastasis [[104]21]. Research by Jiang et al. [[105]22] revealed the correlation between the expression level and genetic changes of CCNA2 and various aspects such as tumor heredity, progression, drug sensitivity, and tumor immunity. Additionally, CCNA2 can also regulate immune cells, associated with the tumor microenvironment and tumor escape, however, no related regulatory mechanisms have been reported in esophageal squamous cell carcinoma. MAD2L1 is a spindle checkpoint protein that plays an important role in the cell mitosis process. Wang et al. [[106]23] found that miR-30a-3p can negatively regulate the expression of MAD2L1 and has an inhibitory effect on tumor cell proliferation in gastric cancer cells, but the relationship between MAD2L1 and ESCC is not very close based on qRT-PCR and IHC experimental results, requiring further evidence for confirmation. Combined with transcription factor analysis and literature findings, it has been noted that c-Jun has a close association with CCNA2. Yang et al. [[107]24] found that c-Jun is associated with the AP1 and ATF binding sites in the CCNA2 promoter region. The binding of c-Jun to the AP1 site reduces the promoter activity of CCNA2, while binding to the ATF site increases the promoter activity of CCNA2. Regulation of the binding of c-Jun to the AP1 and ATF sites, tylophorine, can affect the expression of CCNA2. Specifically, tylophorine can increase the binding of c-Jun to the AP1 site and reduce the binding of c-Jun to the ATF site, thereby reducing the expression of CCNA2. These results indicate that c-Jun regulates the expression of CCNA2 by regulating the promoter activity of CCNA2 through c-Jun’s binding to the CCNA2 promoter. Another study [[108]25] also supports this conclusion, demonstrating that c-Jun regulates the expression of CCNA2 by directly binding to the ATF site of the CCNA2 promoter, and that the expression of CCNA2 is one of the necessary factors induced by c-Jun for anchorage-independent growth. The cytoplasmic oncogenes Ras and Src also regulate the promoter activity of CCNA2 through the ATF site, and this process depends on the presence of c-Jun. This provides a direction for further study, and the relationship of other transcription factors with CCNA2 awaits further investigation. The ceRNA network elucidates the miRNA-lncRNA regulatory network involving CCNA2 and MAD2L1. The construction of the ceRNA network represents not only a technical approach but also an endeavor to gain a deeper understanding of the mechanisms underlying cancer. In future studies, the clinical potential of the ceRNA network will be validated through further experimental evidence and clinical practice, offering new biological targets and directions for the diagnosis, treatment, and prognosis of ESCC patients. Immune infiltration analysis of the data samples revealed a close association between ESCC and plasma cells, CD8 + T cells, monocytes, MO macrophages, M1 macrophages, dendritic cells, and resting mast cells. These immune cell types may be involved in immune evasion, tumor cell elimination, the formation of the tumor microenvironment, promotion of tumor invasion and growth, as well as immune surveillance and antitumor immune responses in esophageal cancer. The resting mast cells may relate to tumor angiogenesis and inflammatory responses. Prior research has suggested a favorable prognosis associated with significant infiltration of CD8 effector T cells [[109]26–[110]28]. Through single-cell sequencing, Zheng et al. found that NK cells, exhausted T cells, alternatively activated macrophages, regulatory T cells (Tregs), and tolerant dendritic cells play dominant roles in the tumor microenvironment (TME). They also discovered a continuous progression of CD8 T cells from pre-exhaustion to exhaustion [[111]29]. Thus, studying the mechanisms and interactions of these immune cell types may contribute to a deeper understanding of ESCC formation and treatment. Wang [[112]30] et al. found that T cells play a significant role in cellular immunity within the ESCC TME. CD4 + Th cells secrete abundant anti-inflammatory factors such as IL-10, which may promote the conversion of B cells to IgG4-expressing plasma cells. Higher densities of IgG4 + cells have been associated with improved patient survival and prognosis. Moreover, the inflammatory mediators produced by monocytes and macrophages also play critical roles in aberrant regulation of oncogenes, immune evasion, and the metastatic process. Drug prediction analysis has revealed a close association between seven drugs, including ETHINYL ESTRADIOL, Seliciclib, and Tamoxifen, and CCNA2. Ethinyl estradiol, a synthetic steroidal estrogen, has been shown to stimulate liver cell proliferation through the involvement of c-myc and cyclin A2 [[113]31]. Seliciclib, a small molecule compound belonging to the class of cyclin-dependent kinase (CDK) inhibitors, primarily regulates the cell cycle by inhibiting CDKs, particularly CDK2 and CDK7, to disrupt uncontrolled cancer cell growth [[114]32]. Tamoxifen, used to treat breast cancer, is a selective estrogen receptor modulator (SERM) that acts by binding to estrogen receptors in breast cancer cells to block the action of estrogen and inhibit cancer cell growth [[115]33, [116]34]. These findings suggest that these drugs may hold promise for the diagnosis and treatment of ESCC, although further exploration is needed to confirm their utility as diagnostic biomarkers or therapeutic targets in ESCC. In this study, the bioinformatics prediction of CCNA2 relative expression levels was highly consistent with experimental validation results, whereas MAD2L1 exhibited expression discrepancies between bioinformatics prediction and experimental validation. This raises important questions about the reliability and accuracy of bioinformatics tools in predicting biological outcomes. While bioinformatics methods hold immense potential for identifying potential biomarkers or therapeutic targets from large-scale genomic and transcriptomic data, this study emphasizes the crucial role of experimental validation in confirming and supplementing these predictions. Several factors may contribute to inconsistencies between experimental results and bioinformatics predictions. Firstly, the limitations of the bioinformatics algorithms and databases used for predictions need to be considered. These tools typically rely on limited information from existing datasets, which may not fully capture the complexity of biological systems. Additionally, variations in experimental conditions, sample heterogeneity, or technical factors during the validation experiments could also contribute to differential results. This highlights the complexity of interpreting biological processes and underscores the importance of multidisciplinary approaches in biological research. Future studies will focus on refining experimental design, validating the consistency of different bioinformatics tools, and advancing a more in-depth understanding of the biological functions of CCNA2 and MAD2L1 in ESCC, to provide more accurate target information for treatment. Conclusion The study indicates that CCNA2 and MAD2L1 may serve as potential biomarkers for ESCC, providing novel insights into the molecular mechanisms underlying ESCC pathogenesis. This also expands the scope for clinical diagnosis and treatment of ESCC. Furthermore, the potential drugs predicted for CCNA2 may offer new hope for ESCC patients in the future. Author contributions Pandeng Wang conceived the study and conducted the experiments; Jianji Guo and Chenfan Guo analyzed and interpreted the data; Pandeng Wang, Chenfan Guo and Diemei Huang wrote the manuscript and revised the manuscript and important intellectual content. Tao Liu is responsible for data collection and proofread the paper; All authors read and approved the final manuscript. Funding This work was supported by Guangxi Natural Science Foundation (2023GXNSFAA026311), the Scientific research project of Guangxi Medical and Health Committee(Z-A20220382),the Scientific research project of Guangxi administration of traditional Chinese medicine(GXZYA20220159),National Clinical Key Specialty Construction Project, the Clinical Key Specialty Construction Project in Guangxi Zhuang Autonomous Region, and the Clinical Key Discipline Construction Project in Guangxi Zhuang Autonomous Region. Data availability All data utilized in this work could be obtained from the corresponding author upon request. Declarations Ethics approval and consent to participate This study was approved by the Ethics Committee of Guangxi International Zhuang Medicine Hospital approved the study protocol(No.2023-065-01).Informed consent was obtained from participants for the participation in the study and all methods were carried out in accordance with the Declaration of Helsinki. Consent for publication Not applicable. Competing interests The authors declare no competing interests. Footnotes Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Chenfan Guo, Pandeng Wang and Diemei Huang contributed equally to this work. References