Abstract Gastric cancer is an aggressive malignancy characterized by significant clinical heterogeneity arising from complex genetic and environmental interactions. This study employed single-cell RNA sequencing, using the 10 × Genomics platform, to analyze 262,532 cells from gastric cancer samples, identifying 32 distinct clusters and 10 major cell types, including immune cells (e.g., T cells, monocytes) and epithelial subpopulations. Among 27 epithelial subgroups, five malignant subpopulations were identified, each defined by unique marker gene expressions and playing diverse roles in tumor progression. Developmental trajectory analysis revealed potential stem-like characteristics in certain clusters, suggesting their involvement in therapeutic resistance and disease recurrence. Cell–cell communication analysis uncovered a dynamic network of interactions within the tumor microenvironment, potentially influencing tumor growth and metastasis. Differential gene expression analysis identified key genes (LDHA, GPC3, MIF, CD44, and TFF3) that were used to construct a prognostic risk score model. This model demonstrated robust predictive power, achieving AUC values of 0.77, 0.77, and 0.76 for 1-, 3-, and 5-year overall survival in the TCGA training dataset, with validation across independent cohorts. These findings deepen our understanding of gastric cancer's cellular and molecular heterogeneity, offering insights into potential therapeutic targets and biomarkers. By facilitating the development of targeted therapies and personalized treatment strategies, these results hold promise for improving clinical outcomes in gastric cancer patients. Keywords: Gastric cancer, Single-cell RNA sequencing, Malignant epithelial cell subpopulations, Tumor microenvironment dynamics, Prognostic biomarkers Introduction Gastric cancer remains one of the most prevalent and deadly malignancies worldwide, characterized by its aggressive nature and high mortality rate [[30]1–[31]3]. Despite advancements in diagnostic techniques and treatment modalities, the prognosis for gastric cancer patients, especially those diagnosed at advanced stages, remains dismal [[32]4, [33]5]. The complexity of gastric cancer, driven by its heterogeneous nature and the intricate interplay of genetic, environmental, and lifestyle factors, poses significant challenges for effective management and treatment [[34]6–[35]8]. The etiology of gastric cancer is multifactorial, with Helicobacter pylori infection recognized as a major risk factor [[36]9, [37]10]. Other contributing factors include dietary habits, smoking, alcohol consumption, and genetic predispositions [[38]11, [39]12]. The disease's progression from pre-neoplastic lesions to invasive cancer involves a multitude of genetic alterations and the dysregulation of key signaling pathways [[40]13–[41]15]. Understanding these molecular mechanisms is crucial for the development of targeted therapies and improving patient outcomes [[42]16]. Recent advances in molecular biology and genomics have ushered in a new era of cancer research, enabling the detailed characterization of the genetic and epigenetic alterations underlying various cancers, including gastric cancer [[43]17, [44]18]. High-throughput sequencing technologies, such as whole-genome sequencing and RNA sequencing, have provided unprecedented insights into the cancer genome, revealing the complexity and diversity of genetic alterations across different cancer types and within individual tumors [[45]19, [46]20]. The advent of single-cell sequencing technologies has further revolutionized our understanding of the tumor microenvironment, unveiling the remarkable cellular heterogeneity present within tumors [[47]21, [48]22]. This heterogeneity is a key factor contributing to treatment resistance and cancer relapse, as different cell populations within the tumor may respond differently to therapeutic interventions [[49]23, [50]24]. Single-cell analysis allows for the identification and characterization of distinct cell populations within tumors, including cancer cells, immune cells, and stromal cells, providing valuable insights into their roles in tumor progression and response to therapy [[51]25]. In light of these advancements, the identification and characterization of malignant epithelial cell subpopulations in gastric cancer represent a critical step toward unraveling the complexity of this disease [[52]26, [53]27]. By dissecting the cellular composition of gastric cancer and understanding the molecular characteristics of these cell populations, researchers can identify potential therapeutic targets and biomarkers for early detection and personalized treatment. Furthermore, the tumor microenvironment, which includes a complex network of interactions between cancer cells, immune cells, and stromal cells, plays a pivotal role in tumor progression and metastasis [[54]28–[55]30]. Investigating the mechanisms of cell–cell communication within the tumor microenvironment and understanding how these interactions influence cancer progression and treatment response are essential for the development of effective cancer therapies [[56]31]. The metabolic reprogramming of cancer cells is another area of intense research. Alterations in metabolic pathways, such as glycolysis and oxidative phosphorylation, are hallmarks of cancer and are associated with aggressive tumor behavior and poor prognosis [[57]32–[58]34]. The study of cancer metabolism, including the identification of metabolic enzymes and pathways altered in gastric cancer, offers opportunities for the development of novel therapeutic strategies targeting cancer metabolism. Given the complexity of gastric cancer and the limitations of current treatment options, there is an urgent need for a deeper understanding of the molecular mechanisms driving this disease. This study leverages single-cell sequencing and advanced bioinformatics to identify differentially expressed genes, analyze their functional roles, and explore immune infiltration within the tumor microenvironment. By characterizing malignant epithelial cell subpopulations, mapping their developmental trajectories, and analyzing cell–cell communication, we aim to uncover novel insights into gastric cancer biology. The development of a risk score model based on these findings could unveil prognostic biomarkers and therapeutic targets, laying the groundwork for personalized medicine strategies. This integrated approach has the potential to improve diagnosis, treatment, and prognosis, ultimately enhancing clinical outcomes and quality of life for patients. Methods Data acquisition Bulk RNA-seq and clinical annotation data for gastric cancer patients were downloaded from The Cancer Genome Atlas (TCGA) cohort (TCGA-STAD). Validation datasets were obtained from the Gene Expression Omnibus (GEO) with accession numbers [59]GSE62254 and [60]GSE15459. Single-cell RNA-seq data were sourced from GEO datasets [61]GSE183904, [62]GSE134520, and [63]GSE167297. Clustering and dimensionality reduction of scRNA-seq data Before analysis, scRNA-seq data underwent rigorous quality control, including filtering out cells with more than 25% expression from mitochondria-associated genes. To focus on the most informative features, the top 2000 highly variable genes were identified and normalized using the ScaleData function, which implements variance stabilization transformation (VST). Principal Component Analysis (PCA) was then employed to reduce the dimensionality of the data, selecting 20 principal components for downstream analysis. Cell groups were identified using the FindNeighbors function with k = 20 and the FindClusters function with a resolution parameter set at 0.5. Differentially expressed genes were identified using the RunDEtest function with a log fold-change threshold of 0.25 and an adjusted p-value cutoff of 0.05. Finally, the UMAP method in Seurat was applied for nonlinear dimensionality reduction, enabling the visualization of cells in a two-dimensional space, effectively grouping cells with similar gene expression patterns and distinguishing those with differing profiles. Functional analysis of single cell subpopulation We identify each cell group raised or lowered genes using RunDEtest () function of SCP ([64]https://github.com/zhanghao-njmu/SCP). TF and CSPA of each cell population were identified and GO-BP functional enrichment analysis was performed on specific markers of each cell population using FeatureHeatmap() function of SCP. GSEA analysis was performed for each cell subpopulation using RunGSEA() function of SCP. Moreover, we score the gene set of the cell subpopulation using AddModuleScore() function of Seurat. Pseudotime analysis Pseudotime analysis, also known as cell trajectory analysis, is utilized to predict the progression of apoptosis pathways, the development of different cell subtypes, and the differentiation paths of stem cells during disease progression. In this study, we employed Monocle 2 to conduct pseudotime analysis by examining key gene expression patterns. Monocle modeled gene expression levels as a nonlinear smooth function of pseudotime to illustrate changes in gene expression over time. An FDR threshold of < 0.01 was selected for determining statistical significance in pseudotime analysis to balance sensitivity and specificity. This threshold was chosen based on preliminary analyses indicating robust detection of key gene expression changes. Cell–cell interaction analysis To explore the interactions between immune cell subtypes, we leveraged ligand-receptor information within the single-cell gene expression matrix using CellChat software ([65]http://www.cellchat.org/) with default settings. This approach modeled the probability of communication events and identified significant cell–cell interactions based on the established ligand-receptor pairs. Immune infiltration analysis Immune cell infiltration was scored using ssGSEA with metagenes specific to gastric cancer as identified in the study [[66]35]. The ESTIMATE package was utilized to calculate immune scores for each cell population. Functional enrichment analysis of differentially expressed genes Differentially expressed genes between cancerous and adjacent non-cancerous tissues in the TCGA-STAD dataset were identified using the limma R package (|logFC|> 1 & adj.Pvalue < 0.05). GO and KEGG pathway analyses were conducted on these genes using the "clusterProfiler" R package (version 4.0.5), with significant enrichment determined by an FDR of < 0.05. Construction and validation of a prognostic model The prognostic model was constructed using LASSO regression with tenfold cross-validation to select optimal lambda values. Subsequently, multiple Cox proportional hazards models were fitted to identify genes significantly associated with overall survival. The model was validated using independent GEO datasets ([67]GSE62254 and [68]GSE15459), ensuring that data preprocessing steps were consistently applied across training and validation cohorts. Risk scores for each patient were calculated using the formula: Risk score = Σi Coefficient (mRNAi) × Expression(mRNAi). The optimal cutoff value was determined by the "surv_cutpoint" function in the "survminer" R package, which uses maximally selected rank statistics to divide patients into high and low-risk groups. Overall survival (OS) between these groups was compared using the Kaplan–Meier method, and the model's predictive reliability and validity were assessed using time-dependent ROC analysis. Statistical analysis All statistical analyses in this study were performed using R software, unless otherwise specified. Results Identification of malignant epithelial cell subpopulations in gastric cancer Using single-cell RNA sequencing, we analyzed 262,532 gastric cancer cells and co-clustered them into 32 distinct cell populations at a resolution of 0.6 (Fig. [69]1A). Cell type annotations were performed using SingleR, identifying 10 major cell types: T cells, monocytes, B cells, epithelial cells, smooth muscle cells, endothelial cells, dendritic cells, NK cells, fibroblasts, and erythroblasts (Fig. [70]1B). Violin plots in Fig. [71]1C illustrate the expression levels of key marker genes within each subpopulation, where CD3, CD14, EPCAM, CD79A, ACTA2, CD31, CD83, GNLY, DCN, and GYPA were identified as markers for T cells, monocytes, B cells, epithelial cells, smooth muscle cells, endothelial cells, dendritic cells, NK cells, fibroblasts, and erythroblasts, respectively. Fig. 1. [72]Fig. 1 [73]Open in a new tab Single-cell analysis of gastric cancer cell populations and identification of epithelial subpopulations. A UMAP plot showing 32 distinct cell populations derived from the analysis of 262,532 single gastric cancer cells at a resolution of 0.6. B Cell type annotations using SingleR identified 10 major cell types, including T cells, monocytes, B cells, epithelial cells, smooth muscle cells, endothelial cells, dendritic cells, NK cells, fibroblasts, and erythroblasts. C Violin plots displaying expression levels of key marker genes (CD3, CD14, EPCAM, CD79A, ACTA2, CD31, CD83, GNLY, DCN, GYPA) for the corresponding 10 major cell types. D UMAP plot showing 27 epithelial cell clusters derived from 66,668 single epithelial cells at a resolution of 0.6. E Distribution of malignant epithelial cell subpopulations. F UMAP plot showing 5 malignant epithelial cell groups after further subgrouping of 6461 single malignant epithelial cells at a resolution of 0.01. G Heatmap of upregulated and downregulated genes in each malignant epithelial cell group, with significant downregulation observed for CXCL5, GPC3 in group 0, TM4SF5, CTSD in group 2, CYC1, MDK in group 3, and CD24, SCAND1 in group 4. H Heatmap depicting highly expressed genes and their enriched GO-related functional pathways in the 5 malignant epithelial cell groups, with specific expression of S100P, PGC in subgroup 0, IGF2, IFIT2, SPINK1 in subgroup 1, IGFBP3, KLK1, CDC42EP1 in subgroup 2, HSPA2, TPT1, TMEM176A in subgroup 3, and HLA-DPA, IGHA1, S100A14 in subgroup 4. I GSEA analysis showing upregulated pathways in each malignant epithelial subpopulation, including oxidative phosphorylation, regulation of lipid metabolism, antigen processing and presentation, regulation of programmed cell death, and MHC protein complex assembly Next, we focused on a subset of 66,668 epithelial cells from the 10 major cell types for further subgrouping. This resulted in 27 epithelial cell clusters at a resolution of 0.6 (Fig. [74]1D), with the distribution of malignant epithelial cell subpopulations shown in Fig. [75]1E. Subsequent analysis of 6461 malignant epithelial cells revealed 5 distinct malignant epithelial cell groups at a resolution of 0.01 (Fig. [76]1F). Figure [77]1G highlights the upregulated and downregulated genes within each malignant epithelial cell group, showing significant downregulation of CXCL5 and GPC3 in group 0, TM4SF5 and CTSD in group 2, CYC1 and MDK in group 3, and CD24 and SCAND1 in group 4. The heatmap in Fig. [78]1H displays highly expressed characteristic genes and their associated GO-related functional pathways across the 5 malignant epithelial cell groups. Notably, S100P and PGC were highly expressed in subgroup 0, IGF2, IFIT2, and SPINK1 in subgroup 1, IGFBP3, KLK1, and CDC42EP1 in subgroup 2, HSPA2, TPT1, and TMEM176A in subgroup 3, and HLA-DPA, IGHA1, and S100A14 in subgroup 4. Additionally, Fig. [79]1I presents the upregulated GSEA pathways in each malignant epithelial subpopulation, with oxidative phosphorylation, lipid metabolism regulation, antigen processing and presentation, regulation of programmed cell death, and MHC protein complex assembly significantly enriched across the 5 subgroups. Acquisition and functional enrichment analysis of gastric cancer-related glycometabolic genes Based on the intersection of 472 specific marker genes from the 5 malignant epithelial cell subpopulations and 257 glycosyltransferase genes, we identified 20 common genes (Fig. [80]2A). A subsequent correlation analysis of these 20 genes in 415 tumor tissue samples from the TCGA-STAD dataset revealed strong correlations between TPI1 and GAPDH, FKBP4, GPI, as well as between FKBP4 and GAPDH, GPI (Fig. [81]2B). The expression patterns of these 20 common genes were further examined using our integrated dataset of 262,532 single gastric cancer cells. The UMAP plot in Fig. [82]2C illustrates the expression of these genes at the single-cell level, with genes such as GAPDH, TPI1, ALDOA, and PPIA showing high expression levels, while RAE1, GPC3, TFF3, and CLDN3 exhibited lower expression levels. Fig. 2. [83]Fig. 2 [84]Open in a new tab Identification and expression analysis of 20 common genes between glycosyltransferase and malignant epithelial cell marker genes. A Venn diagram displaying the overlap of 472 specific marker genes from 5 malignant epithelial cell subpopulations and 257 glycosyltransferase genes, resulting in the identification of 20 common genes. B Correlation analysis of the 20 common genes in 415 tumor tissue samples from the TCGA-STAD dataset, showing significant correlations between TPI1 and GAPDH, FKBP4, GPI, as well as between FKBP4 and GAPDH, GPI. C UMAP plot presenting the single-cell level expression of the 20 common genes in 262,532 single gastric cancer cells. Genes such as GAPDH, TPI1, ALDOA, and PPIA exhibit high expression levels, while RAE1, GPC3, TFF3, and CLDN3 show lower expression levels We further scored the activity of the 20 genes within individual cells, dividing 60,000 epithelial cells into high- and low-risk groups based on an established threshold (Fig. [85]3A, [86]B). Using GSEA analysis, we explored the enrichment of functional pathways between these two groups. As shown in Fig. [87]3C, key pathways such as ADP metabolic processes, nucleoside diphosphate metabolic processes, and pyruvate metabolic processes were significantly enriched in the high-risk group. Fig. 3. [88]Fig. 3 [89]Open in a new tab Risk stratification of epithelial cells and pathway enrichment analysis. A Heatmap representing the activity scores of the 20 common genes in 60,000 epithelial cells. B Epithelial cells were classified into high- and low-risk groups based on a defined threshold for gene activity. C GSEA analysis highlighting significantly enriched functional pathways in the high-risk group, including ADP metabolic processes, nucleoside diphosphate metabolic processes, and pyruvate metabolic processes Developmental trajectory analysis of malignant epithelial cell subpopulations The distribution of the five malignant epithelial cell subpopulations is presented in Fig. [90]4A. The heatmap in Fig. [91]4B shows the expression of marker genes for these subpopulations, revealing significant expression of TFF2, LYZ, PI3, and PGC in subgroup 0; APOA2, TTR, IGF2, GSTA1, and CCL20 in subgroup 1; IGFL2, ZG16B, and DEFB1 in subgroup 2; CEACAM5, EFNA1, TFF3, and S100A4 in subgroup 3; and PSCA and HPGD in subgroup 4. Figure [92]4C illustrates the single-cell level expression of the 20 common genes, highlighting high expression of ENO1, GAPDH, TPI1, MDH2, MIF, and PPIA in the epithelial subpopulations, while RAE1, GPC3, CD44, and TFF3 show lower expression. Fig. 4. [93]Fig. 4 [94]Open in a new tab Distribution, marker gene expression, and developmental trajectory of malignant epithelial subpopulations. A UMAP plot showing the distribution of five distinct malignant epithelial cell subpopulations. B Heatmap displaying the expression of marker genes for the five epithelial subpopulations. Subgroup 0 is characterized by high expression of TFF2, LYZ, PI3, and PGC; subgroup 1 by APOA2, TTR, IGF2, GSTA1, and CCL20; subgroup 2 by IGFL2, ZG16B, and DEFB1; subgroup 3 by CEACAM5, EFNA1, TFF3, and S100A4; and subgroup 4 by PSCA and HPGD. C UMAP plot illustrating the single-cell level expression of 20 common genes in malignant epithelial subpopulations, with high expression of ENO1, GAPDH, TPI1, MDH2, MIF, and PPIA, and low expression of RAE1, GPC3, CD44, and TFF3. D, E Developmental trajectory of malignant epithelial subpopulations, indicating that subgroups 3 and 1 may have originated from subgroup 2, which may represent a tumor stem cell population with differentiation potential Figure [95]4D, [96]E depict the developmental trajectory of the malignant epithelial subpopulations, suggesting that subgroups 3 and 1 may have differentiated from subgroup 2. This suggests that subgroup 2 could represent a tumor stem cell population with differentiation potential. Figure [97]5 presents the expression changes of the 20 common genes along the developmental trajectory over time, showing a gradual decrease in ALDOA, CLDN3, and TFF3 expression, while GPC3, PPIA, and STMN1 exhibit a gradual increase in expression. Fig. 5. [98]Fig. 5 [99]Open in a new tab Temporal expression changes of 20 common genes along the developmental trajectory. The line graph illustrates the expression changes of 20 common genes along the developmental trajectory over time. ALDOA, CLDN3, and TFF3 show a gradual decrease in expression, while GPC3, PPIA, and STMN1 demonstrate a gradual increase as the cells progress along their developmental trajectory Cell communication analysis We conducted an intercellular communication analysis to investigate the interactions between different cell populations. The heatmap in Fig. [100]6A illustrates both the number and strength of these interactions, highlighting significant quantitative interactions between monocytes and endothelial cells, as well as fibroblasts. In terms of interaction strength, monocytes and fibroblasts exhibited the closest connection. To further explore the ligand-receptor pathways driving these interactions, Fig. [101]6B presents the enrichment of pathways mediating communication between malignant epithelial cell subpopulations and other cell types. The results revealed that MIF-CD74 + CD44 primarily facilitated interactions between malignant epithelial cells and monocytes, MIF-CD74 + CXCR4 mediated interactions with T cells, and PPIA-BSG was the key pathway driving interactions between malignant epithelial cells and smooth muscle cells. Fig. 6. [102]Fig. 6 [103]Open in a new tab Intercellular communication analysis of gastric cancer cell populations. A Heatmap showing the number and strength of intercellular interactions among cell populations, with monocytes demonstrating strong quantitative interactions with endothelial cells and fibroblasts, and the strongest interaction strength observed between monocytes and fibroblasts. B Ligand-receptor pathway analysis illustrating key pathways mediating communication between malignant epithelial cells and other cell types. MIF-CD74 + CD44 predominantly mediated interactions between malignant epithelial cells and monocytes, MIF-CD74 + CXCR4 mediated interactions with T cells, and PPIA-BSG mediated interactions between malignant epithelial cells and smooth muscle cells Identification and functional analysis of differentially expressed genes in gastric cancer Figures [104]7A (volcano plot) and 7B (heatmap) illustrate the expression levels of differentially expressed genes (DEGs) in the TCGA-STAD dataset (logFC > 1 and adj.P.Val < 0.05). Notably, genes such as CLDN, PKM, STMN1, TXN, and GPI were significantly upregulated in tumor tissues compared to normal tissues, while ALDOA was downregulated. By intersecting the specific marker genes used to define the 10 major single-cell groups in gastric cancer (comprising over 260,000 cells) with these DEGs, we identified 199 common genes (Fig. [105]7C). A GO functional enrichment analysis was then performed on these 199 common genes. The bubble chart in Fig. [106]7D shows that these genes were primarily enriched in biological pathways related to chemotaxis, leukocyte migration, regulation of vasculature development, and positive regulation of proteolysis. Fig. 7. [107]Fig. 7 [108]Open in a new tab Differentially expressed gene analysis and functional enrichment. A Volcano plot displaying the differentially expressed genes (DEGs) in the TCGA-STAD dataset. Genes such as CLDN, PKM, STMN1, TXN, and GPI were significantly upregulated in tumor tissues, while ALDOA was downregulated. B Heatmap showing the expression levels of selected DEGs between gastric cancer and normal tissues. C Venn diagram illustrating the intersection of marker genes from 10 major single-cell groups in gastric cancer and the identified DEGs, revealing 199 common genes. D Bubble chart of GO-BP/MF/CC functional enrichment analysis for the 199 common genes, showing significant enrichment in pathways related to chemotaxis, leukocyte migration, regulation of vasculature development, and positive regulation of proteolysis Immune infiltration analysis Using the 20 common genes, we performed hierarchical clustering of gastric cancer tissues with ConsensusClusterPlus, identifying two distinct subgroups (C1 and C2) when K = 2 (Fig. [109]8A–C). Box plots in Fig. [110]8D show the expression differences of the 20 common genes between these two subgroups. Specifically, higher expression levels of ALDOA, ENO1, FKBP4, GPI, and LDHA were observed in C1 compared to C2, while GPC3 showed lower expression levels in C2. Principal component analysis (PCA) in Fig. [111]8E further confirmed the clear segregation between the two subgroups, validating the clustering results. Additionally, immune cell infiltration levels were analyzed between the subgroups. Box plots in Fig. [112]8F, based on the ESTIMATE algorithm, revealed that although the differences were not statistically significant, the immune scores were higher in C2 than in C1. Moreover, Fig. [113]8G presents ssGSEA-based box plots indicating higher infiltration levels of activated B cells, central memory T cells, effector memory CD4 T cells, macrophages, and mast cells in C2, while activated CD4 T cells and type 17 helper T cells exhibited higher infiltration in C1. Fig. 8. [114]Fig. 8 [115]Open in a new tab Hierarchical clustering and immune cell infiltration analysis of gastric cancer tissues based on the 20 common genes. A–C ConsensusClusterPlus analysis of gastric cancer tissues based on the 20 common genes identified two subgroups (C1 and C2) when K = 2. D Box plots showing expression differences of the 20 common genes between the two subgroups, with higher levels of ALDOA, ENO1, FKBP4, GPI, and LDHA in C1, and lower levels of GPC3 in C2. E PCA plot demonstrating clear separation between the two subgroups, confirming the clustering results. F Box plots of immune scores between C1 and C2 based on the ESTIMATE algorithm, showing higher immune scores in C2. G ssGSEA-based box plots indicating higher infiltration of activated B cells, central memory T cells, effector memory CD4 T cells, macrophages, and mast cells in C2, while C1 showed higher infiltration of activated CD4 T cells and type 17 helper T cells Establishment and evaluation of risk score model Figure [116]9A shows the use of LASSO regression analysis to screen 20 common genes, ultimately selecting 5 genes (LDHA, GPC3, MIF, CD44, and TFF3) to construct a risk score model (Fig. [117]9D). The risk score formula was: LDHA × 0.16490235 + GPC3 × 0.14109302 + MIF × (-0.17954014) + CD44 × 0. 10172204 + TFF3 × 0.05819914. KM curves indicated significant differences in prognosis between high- and low-risk groups in the TCGA training dataset, with similar results observed in the [118]GSE15459 and [119]GSE62254 validation datasets (Fig. [120]9B). Figure [121]9C presents the AUC values for predicting 1-, 3-, and 5-year OS in patients from the TCGA training dataset, and the [122]GSE15459 and [123]GSE62254 validation datasets. The model performed well, with AUC values of 0.77, 0.77, and 0.76 for 1-, 3-, and 5-year OS in the TCGA dataset, respectively, while also demonstrating good performance in the validation datasets, indicating strong specificity and sensitivity in predicting prognosis. Finally, we incorporated clinical factors such as age and stage, combined with the risk score, to construct a nomogram predicting 1-, 3-, and 5-year survival probabilities for patients (Fig. [124]9E). Fig. 9. [125]Fig. 9 [126]Open in a new tab Development and validation of a risk score model based on 5 selected genes. A LASSO regression analysis was used to screen the 20 common genes, resulting in the selection of 5 key genes (LDHA, GPC3, MIF, CD44, and TFF3) for constructing the risk score model. B Kaplan–Meier survival curves comparing overall survival (OS) between high- and low-risk groups in the TCGA training dataset, with significant differences in prognosis. Similar results were observe in the [127]GSE15459 and [128]GSE62254 validation datasets. C ROC curves displaying the area under the curve (AUC) values for predicting 1-, 3-, and 5-year OS in the TCGA training dataset, and the [129]GSE15459 and [130]GSE62254 validation datasets. The model performed well, with AUC values of 0.77, 0.77, and 0.76 for 1-, 3-, and 5-year OS in the TCGA dataset, respectively, demonstrating high sensitivity and specificity in prognosis prediction. D A risk score formula was constructed based on the following equation: LDHA × 0.16490235 + GPC3 × 0.14109302 + MIF × (− 0.17954014) + CD44 × 0 .10172204 + TFF3 × 0.05819914. E A nomogram combining clinical factors such as age and stage with the risk score to predict 1-, 3-, and 5-year survival probabilities in patients Discussion The findings of this study provide significant insights into the cellular heterogeneity and molecular mechanisms underlying gastric cancer, contributing to a deeper understanding of its pathogenesis and identification of potential therapeutic targets. By integrating single-cell RNA sequencing, gene expression profiling, and functional enrichment analysis, we established a comprehensive framework that synergistically enhances the robustness and innovation of our research approach. Our analysis revealed five distinct malignant epithelial cell subpopulations within gastric cancer, each with unique marker gene expressions and developmental pathways. This cellular heterogeneity is indicative of the intricate tumor microenvironment and the diverse oncogenic pathways at play in gastric cancer. Similar studies have identified various subpopulations within gastric and other cancers, underscoring the importance of intra-tumoral heterogeneity in understanding tumor biology and treatment resistance. For instance, a study by Zhang et al. identified multiple subtypes of gastric cancer cells with distinct gene expression profiles and survival outcomes, highlighting the heterogeneity within gastric cancer [[131]36]. Our findings align with these observations, further emphasizing the need to consider this heterogeneity in developing targeted therapies. The developmental trajectory analysis provided insights into the potential origin and evolution of these subpopulations, suggesting that one of the subpopulations might represent a stem-like cell group with significant differentiation potential. This is in line with the cancer stem cell hypothesis, which posits that a subset of cells within tumors possess stem cell properties and drive tumorigenesis, metastasis, and treatment resistance. Studies by Takaishi et al. and others have supported this concept in gastric cancer, identifying stem cell markers and their roles in gastric cancer [[132]37]. Our findings contribute to this body of evidence, suggesting that targeting these stem-like cell populations could be a promising therapeutic strategy. The cell communication analysis highlighted the complex interplay between different cell types within the tumor microenvironment, particularly between malignant cells and immune cells. This aspect of our study underscores the importance of the tumor microenvironment in cancer progression and the potential for targeting these interactions in cancer therapy. Studies by Quail et al. have shown that targeting the tumor microenvironment can enhance the efficacy of cancer therapies, and our findings support the exploration of similar strategies in gastric cancer [[133]38, [134]39]. Our identification of differentially expressed genes and the construction of a risk score model based on these genes provide valuable tools for predicting prognosis and identifying potential therapeutic targets. The selected genes—LDHA, GPC3, MIF, CD44, and TFF3—are implicated in various aspects of gastric cancer development and progression. LDHA (lactate dehydrogenase A) plays a critical role in glycolysis and is often upregulated in gastric cancer, promoting tumor growth and metastasis through enhanced aerobic glycolysis (the Warburg effect) [[135]40]. GPC3 (glypican-3) is a heparan sulfate proteoglycan that modulates cell growth and differentiation, and its overexpression has been linked to poor prognosis and increased tumor invasiveness in gastric cancer [[136]41]. MIF (macrophage migration inhibitory factor) is a pro-inflammatory cytokine involved in tumor immune evasion and angiogenesis, contributing to gastric cancer cell proliferation and resistance to apoptosis [[137]42]. CD44, a cell surface glycoprotein, is a well-known marker for cancer stem cells, facilitating tumor invasion, metastasis, and chemoresistance by promoting epithelial-to-mesenchymal transition (EMT) in gastric cancer [[138]43]. Lastly, TFF3 (trefoil factor 3), which is associated with mucosal protection and repair, has been shown to enhance gastric cancer cell migration and invasion, correlating with a more aggressive cancer phenotype [[139]44]. Collectively, these genes contribute to key oncogenic processes in gastric cancer, including metabolic reprogramming, immune modulation, and metastatic potential, making them promising targets for therapeutic intervention. Additionally, our risk score model, validated in multiple cohorts, demonstrates good specificity and sensitivity, suggesting its potential utility in clinical practice. While this study provides significant insights into the cellular heterogeneity and molecular mechanisms underlying gastric cancer, several limitations must be acknowledged. The inherent technical biases of single-cell RNA sequencing, such as dropouts and batch effects, could affect the robustness of the identified cell subpopulations and pathways. Furthermore, the relatively small sample size may limit the generalizability of the findings across diverse patient populations and clinical settings, necessitating validation in larger, independent cohorts. The complexity of the tumor microenvironment and the dynamic nature of cell–cell interactions further challenge our ability to fully capture and interpret these intricate biological processes. Additionally, potential confounding factors, such as patient heterogeneity in genetic background, disease stage, and treatment history, along with technical variability in data acquisition, may influence the observed results and warrant cautious interpretation. To address these challenges, future research should focus on validating the prognostic risk score model in larger and more diverse datasets, ensuring its applicability across various clinical settings. Integrating the model with existing diagnostic frameworks and developing user-friendly tools for clinical implementation will be critical for its adoption in practice. Moreover, targeting stem-like malignant epithelial subpopulations identified in this study offers a promising therapeutic strategy. Efforts should be directed toward identifying specific molecular targets within these subpopulations and developing therapies, such as small molecule inhibitors or monoclonal antibodies, to selectively eliminate these cells. Combining these novel approaches with existing treatment modalities has the potential to enhance therapeutic efficacy and improve patient outcomes. Expanding research into the tumor microenvironment using advanced technologies like spatial transcriptomics could further elucidate the interactions driving tumor progression and resistance, offering additional avenues for intervention. In conclusion, our study contributes to the evolving landscape of gastric cancer research by providing a detailed analysis of its cellular heterogeneity, molecular characteristics, and the tumor microenvironment. By comparing our findings with existing studies, we highlight the unique contributions of our research while acknowledging the broader context of ongoing efforts to understand and treat gastric cancer. Future research should focus on translating these findings into therapeutic strategies, with an emphasis on targeting the identified cell subpopulations and molecular pathways to improve patient outcomes in gastric cancer. Acknowledgements