Abstract Lactate metabolism (LM) plays a crucial role in tumor progression and therapy resistance in non-small cell lung cancer (NSCLC). Several methods had been developed for NSCLC prognosis prediction based on lactate metabolism-related information. The existing methods for the construction of prognosis prediction models are mostly based on single models such as linear models, SVM, and decision trees. Prognosis biomarkers and prognosis prediction models based on this kind of methods often have limited prognostic performance. In this study, we proposed a novel methodology for constructing prognosis prediction model and identifying lactate-related prognostic biomarkers in NSCLC. We first screened for lactate metabolism-related malignant genes from the scRNA-Seq data of NSCLC malignant cells. We proposed a Cox elastic-net regression combined with genetic algorithm (GA-EnCox) to predict prognosis and optimize the selection of key biomarkers. We identified five key LM-related genes (LYPD3, KRT8, CCT6A, PSMB7, and HMGA1) that significantly correlated with patient prognosis in LUAD cohorts. The prognostic model constructed with these genes outperformed other currently popular models across multiple datasets, demonstrating stable predictive capability. Survival analysis based on bulk RNA-Seq data demonstrated that the low-risk group had significantly better overall survival compared to the high-risk group. Further analysis revealed that lactate metabolism-related prognosis risk might be associated with monocyte lineages such as macrophages and DC’s infiltration and these prognosis biomarkers may indicate the therapeutic responses of immune checkpoint inhibitors for NSCLC patients. More importantly, we validated HMGA1 and KRT8 at protein level and their association with histologic grades, stages, and clinical outcomes in consistently treated in-house NSCLC cohorts. Finally, we experimentally validated one of the biomarkers, HMGA1, confirming its role in promoting malignant phenotypes of NSCLC. This study provides valuable insights into the role of lactate metabolism-related biomarkers and their impact on patient outcomes, it was expected to provide important reference value for prognosis assessment and personalized treatment decision of NSCLC patients. Subject terms: Tumour biomarkers, Non-small-cell lung cancer, Computational biology and bioinformatics Introduction Lung cancer is the most frequently diagnosed cancer worldwide and the leading cause of cancer-related deaths^[40]1–[41]3. Non-small cell lung cancer (NSCLC) accounts for 80–85% of all lung cancer cases^[42]4. Despite advancements in therapeutic approaches, the prognosis for NSCLC remains poor, with a low five-year survival rate^[43]5. The high heterogeneity and complex molecular mechanisms of NSCLC complicate treatment efforts, necessitating a deeper understanding of its biological characteristics to develop novel therapeutic strategies^[44]6. The significance of glucose metabolism in cancer has been recognized for nearly a century, with studies demonstrating a high dependence of tumor cells on glucose, known as the “Warburg effect”^[45]7. Under this effect, tumor cells preferentially utilize anaerobic glycolysis over oxidative phosphorylation even under aerobic conditions. Although anaerobic glycolysis has a lower ATP yield, its rapid rate provides the necessary energy and metabolic precursors for rapid proliferation^[46]8,[47]9. The substantial lactate production in tumor metabolism is transported out of cells by monocarboxylate transporters MCT1 and MCT4, a process critical for maintaining tumor cell survival and growth^[48]10. Lactate increases the resistance of chemo-therapy such as PI3K inhibitors or tamoxifen in colon cancer and estrogen receptor (ER) positive breast cancer^[49]11–[50]13. Also, metabolic symbiosis mediated by lactate results in drug resistance of TKI inhibitors in gastric cancer^[51]14. Identifying NSCLC-related biomarkers to predict prognosis and immunotherapy responses is essential for developing effective personalized treatment strategies. Several genes, such as LMNB2 and FNDC4, have already been confirmed as prognostic biomarkers for NSCLC. However, lactate metabolism-related prognostic biomarkers for NSCLC have not yet been thoroughly investigated. Although several studies have demonstrated that targeting lactate metabolism-related enzymes and transporters effectively inhibits tumor growth^[52]15–[53]18. The development of inhibitors progressing to clinical application remains limited. The narrow substrate-binding site of LDHA poses significant challenges for its selective and effective inhibition in cancer, and no LDH inhibitors with significant selectivity and in vivo efficacy have been identified thus far^[54]19. Moreover, given the critical role MCTs play in maintaining tissue homeostasis, inhibiting MCTs may lead to toxicities, such as metabolic alterations (ketoacidosis) due to reduced ketone body uptake in MCT1-deficient cells^[55]10. Consequently, exploring alternative lactate metabolism-related targets for indirect regulation of lactate metabolism is warranted. Several sets of lactate-associated tumor biomarkers in NSCLC have been proposed^[56]3,[57]20–[58]25, yet they exhibit notable limitations. The existing methods are mostly based on single models such as linear models, SVM, and decision trees. Prognosis biomarkers selected based on a single model often have limited prognostic performance. Additionally, these methods rarely consider the targetability of the prognostic biomarkers identified by the model and barely optimize the selection of key biomarkers affecting prognostic prediction performance based on prognostic prediction metrics. In this study, we proposed a novel methodology for constructing prognosis prediction model and identifying lactate-related prognostic biomarkers in NSCLC. This approach formulates the prognostic biomarker optimization problem based on the Elastic Cox model, with tumor prognosis indicators as the optimization objective. The genetic algorithm (GA) is employed to solve the above optimization problem for optimizing the selection of key prognosis biomarkers iteratively, resulting in the GA-EnCox prognostic model, which demonstrates superior prognostic prediction performance compared to the existing tumor prognostic models across multiple NSCLC datasets. The overview of this study is illustrated in Fig. [59]1. We first identified lactate metabolism-related malignant genes (LM-related gene) by integrating genes implicated in the lactate metabolism process with malignant cell markers from scRNA-Seq data. Further, using the GA-EnCox method, we selected five biomarkers from LM-related gene, which are highly correlated with NSCLC prognosis, and validated the superior prognostic predictive capacity of the 5-LM-related gene signature across various independent NSCLC gene expression databases. We then analyzed the impact of predicted risk on pathway activity, immune cell infiltration, small molecule drug response, and immunotherapy response. Finally, we validated HMGA1 and KRT8 at protein level and their association with histologic grades, stages, and clinical outcomes in consistently treated in-house NSCLC cohort and validate HGMA1 effect on malignant phenotypes in NSCLC cell-lines. Fig. 1. [60]Fig. 1 [61]Open in a new tab The overview of the development, validation, and clinical utility of core LM-related prognostic biomarkers. Abbreviations: LUAD, Lung adenocarcinoma; NSCLC, non-small-cell lung cancer; LM: lactate metabolism. Result Lactate metabolism-related genes identified from NSCLC scRNA-Seq We obtained a gene set associated with lactate production and transport from MSigDB and retrieved genes related to lactate metabolism from GeneCard, resulting in a basic LM-related gene set comprising 277 genes (Supplementary Table [62]1). We next considered screening tumor-specific lactate-associated genes from multiple independent scRNA-Seq cohorts. In malignant cells in Wu Zhou et al cohorts^[63]26, Goveia et al. cohorts and Maynard et al cohorts^[64]27,[65]28, we selected genes that were significantly positively correlated with the enrichment scores of the basic-LM gene set and were specifically highly expressed in malignant cells. To ensure potential for therapeutic intervention, we excluded marker genes of cells that play a positive role in tumor immunity. Tumor-specific lactate-related genes were identified independently from three distinct scRNA-Seq cohorts. The intersection of these three sets resulted in a final collection of 364 tumor-specific lactate-related genes (Fig. [66]2A). Fig. 2. [67]Fig. 2 [68]Open in a new tab Analysis and prognostic validation of lactate metabolism-related genes identified from single-cell sequencing data. (A) Overlap of lactate metabolism-related genes identified in three independent NSCLC scRNA-Seq cohorts. (B) DimPlot showing malignant cells and lactate metabolism intensity in three cohorts. (C) Violin plot displaying the enrichment of lactate metabolism-related genes across various cell types. (D) Kaplan–Meier survival curves for the TCGA-LUAD, [69]GSE13213, and [70]GSE31210 cohorts comparing survival probabilities between LM-high and LM-low groups. Next, we performed UMI dimensionality reduction clustering analysis on single-cell sequencing data from NSCLC tissues and we assessed lactate metabolism intensity based on 364 LM-related genes by UCell in different cohorts (Fig. [71]2B) and examined the enrichment of lactate metabolism-related genes in various cell types (Fig. [72]2C). The results indicated that, in addition to malignant cells, epithelial cells and stromal cells also exhibited higher LM strength. To further investigate lactate metabolism in NSCLC patients, we applied these 364 LM-related genes to the TCGA-LUAD cohort to classify patients according to the intensity of lactate metabolism. Based on the GSVA levels of LM-related genes, we divided the patients into LM-high and LM-low groups. We analyzed the chromosome effects of lactate metabolism associated groups in three independent cohorts: TCGA-LUAD, [73]GSE13213, and [74]GSE31210. (Fig. [75]2D). In the TCGA-LUAD dataset, survival analysis revealed a significant difference in survival rates between the LM-high and LM-low groups (p = 0.00064), with the LM-low group demonstrating significantly better survival outcomes compared to the LM-high group (HR [95% CI] = 0.595 [0.445, 0.795]). Similarly, survival analysis in the [76]GSE13213 also indicated that patients in the LM-low group had significantly better survival outcomes than those in the LM-high group (p < 0.0001, HR [95% CI] = 0.257 [0.146, 0.453]). Additionally, in the [77]GSE31210, we observed analogous results, with the LM-low group exhibiting significantly higher survival rates compared to the LM-high group (p < 0.0001, HR [95% CI] = 0.178 [0.092, 0.346]). Across multiple independent cohorts, the application of LM-related gene sets identified consistently revealed that lactate metabolism intensity is closely related to the prognosis of NSCLC patients, with the LM-low group demonstrating significantly better survival outcomes compared to the LM-high group. In summary, these data suggest that lactate metabolism intensity may be an important factor influencing the prognosis of NSCLC patients. GA-EnCox identified lactate metabolism-related core biomarkers with stronger predictive ability The intensity of lactate metabolism is closely associated with patient survival prognosis, but the specific LM-related genes that play core roles in this process remain unclear. To investigate further the relationship between lactate metabolism intensity and prognosis in LUAD patients, we developed a tumor prognostic prediction model using genetic algorithm and Elastic Cox, to predict NSCLC patients prognosis and identify the key biomarkers for prognosis, thereby providing more reliable biomarkers for personalized treatment. Therefore, we utilized the GA-EnCox (Cox elastic-net combined with genetic algorithm) algorithm for screening from 364 LM-related genes (Fig. [78]3A). This algorithm aims to combine the global search capability of genetic algorithms with the feature selection advantages of the Cox elastic-net model to find the optimal combination of prognostic biomarker genes. As shown in Fig. [79]3B, the fitness of each chromosome in GA increased gradually and reached stability. After iterating 100 times, the average C-index tends to stabilize, reaching the maximum value at iter = 173, C-index = 0.6542, corresponding to the selection of 190 LM-related genes. Increasing the L1-penalty weight in EnCox for gene selection, five genes with non-zero regression coefficients in EnCox were identified as the final LM-related prognostic genes: LYPD3, KRT8, CCT6A, PSMB7, and HMGA1. The regression coefficients of these genes, as shown in Fig. [80]3C, indicate their significant role in predicting prognosis in LUAD patients. To validate the prognostic value and biological significance of the lactate metabolism-related key genes identified, we further analyzed these five genes. Fig. 3. [81]Fig. 3 [82]Open in a new tab Identification and validation of lactate metabolism-related prognostic genes in LUAD patients. (A) Workflow diagram illustrating the application of the GA-EnCox algorithm to identify optimal lactate metabolism-related genes for prognosis in LUAD patients. (B) Plot of fitness (C-index) across iterations of the GA-EnCox algorithm, showing stabilization after approximately 100 iterations. (C) Bar graph showing the regression coefficients of the five selected lactate metabolism-related genes (LYPD3, KRT8, CCT6A, PSMB7, HMGA1). (D) Kaplan–Meier survival curves stratified by risk groups in the TCGA-LUAD, [83]GSE26939, and [84]GSE30129 datasets. High-risk groups exhibited significantly lower survival rates compared to low-risk groups. To begin, we stratified patients into risk groups based on the expression of five genes in the TCGA-LUAD, [85]GSE26939, and [86]GSE30129 cohorts, followed by the generation of Kaplan–Meier survival curves (Fig. [87]3D). The findings indicated that patients in the low-risk group exhibited significantly higher survival rates compared to those in the high-risk group, with the differences being highly statistically significant. These results validate the stable predictive capability of our selected gene combination for the prognosis of NSCLC patients. Further, we conducted univariate and multivariate Cox regression analyses on TCGA-LUAD, [88]GSE13123, [89]GSE26939 cohorts. As shown in Table [90]1, in the univariate Cox regression analysis, LM risk score (calculated based on the expression of five LM-related genes) demonstrated statistical significance in all three cohorts. For example, in TCGA-LUAD, patients in the high LM risk group exhibited a significantly increased risk of mortality (HR = 1.97, 95% CI 1.46–2.67, p < 0.0001), nearly doubling the risk compared to the low-risk group. In [91]GSE26939, LM risk score is the only statistically significant variable in univariate Cox regression. These findings indicate that the LM risk score effectively predicts patient survival outcomes. In the multivariate Cox regression analysis, as shown in Table [92]1, LM risk score still demonstrated statistical significance and showed the strongest prognostic differentiation significance across all three cohorts. To summarize, whether in univariate Cox analysis or multivariate Cox analysis, the survival risk calculated by the prognostic model consisting of the five prognostic genes outperformed clinical or pathological information in predicting patient prognosis. This suggests that incorporating the LM risk score alongside existing clinical characteristics can enhance the predictive capability for patient prognosis. Table 1. Univariate and multivariate cox regression analysis of prognostic factors in NSCLC cancer patients. Variable Univariable cox Multivariable cox HR p value HR p value TCGA-LUAD Age  > 60/ < 60 1.14 (0.83–1.58) 0.421 1.3 (0.93–1.8) 0.122 M_stage  > M0/M0 1.02 (0.74–1.4) 0.927 1.01 (0.73–1.39) 0.969 T_stage  > T2/ < = T2 1.53 (1.11–2.13) 0.010 0.98 (0.69–1.41) 0.932 N_stage  > N0/N0 2.58 (1.93–3.47)  < 0.001 1.3 (0.81–2.1) 0.277 Grade  > II/ < = II 2.91 (2.14–3.95)  < 0.001 2.03 (1.21–3.4) 0.007 LM_risk High_risk/Low_risk 1.97 (1.46–2.67)  < 0.001 1.69 (1.23–2.31) 0.001 [93]GSE13213 Age  > 60/ < 60 1.04 (0.59–1.83) 0.891 1.42 (0.75–2.69) 0.282 Gender F/M 1.36 (0.77–2.39) 0.286 1.1 (0.58–2.06) 0.777 Stage  > = II/ < II 2.54 (1.45–4.45) 0.001 1.31 (0.43–3.93) 0.636 T_stage  > T1/T1 1.54 (0.87–2.73) 0.142 1.19 (0.65–2.18) 0.577 N_stage  > N0/N0 2.71 (1.53–4.8) 0.001 2.28 (0.73–7.14) 0.157 LM_risk High_risk/Low_risk 3.23 (1.75–5.93)  < 0.001 3.15 (1.64–6.05) 0.001 [94]GSE26939 Age  > 60/ < 60 1.69 (0.93–3.07) 0.086 1.35 (0.7–2.59) 0.375 Gender F/M 1.63 (0.96–2.78) 0.069 1.44 (0.78–2.64) 0.241 Grade III/ < III 1.27 (0.75–2.17) 0.377 1.22 (0.7–2.13) 0.491 Stage  > = II/ < II 1.64 (0.96–2.79) 0.069 1.73 (0.99–3.03) 0.056 Smoke 1/0 1 (0.43–2.34) 1.000 0.79 (0.32–1.98) 0.617 Risk High_risk/Low_risk 1.81 (1.06–3.09) 0.029 1.88 (1.07–3.3) 0.027 [95]Open in a new tab The most significant prognostic factor in are highlighted in bold. We compared the prognostic model constructed based on these five genes with other currently popular prognostic gene models to highlight the performance of the GA-EnCox model. The comparison was conducted across six different NSCLC datasets: TCGA-LUAD, [96]GSE13213, [97]GSE26939, [98]GSE30219, [99]GSE31210, and [100]GSE41271 (Table [101]2). We used the C-index (concordance index) as an indicator of the model’s discriminatory power, ranging from 0.5 to 1, with values closer to 1 indicating stronger predictive ability. Table 2. Comparative performance rankings of prognostic models across multiple NSCLC gene expression datasets. TCGA-LUAD [102]GSE13213 [103]GSE26939 [104]GSE30219 [105]GSE31210 [106]GSE41271 Average_rank ^[107]22 0.539 0.538 0.547 0.558 0.656 0.580 7.67 ^[108]21 0.662 0.594 0.500 0.620 0.691 0.558 6.50 ^[109]29 0.658 0.680 0.647 0.765 0.704 0.685 3.00 ^[110]25 0.668 0.644 0.631 0.701 0.714 0.644 3.50 ^[111]24 0.669 0.573 0.583 0.668 0.720 0.604 4.67 ^[112]20 0.643 0.635 0.625 0.680 0.671 0.630 5.50 ^[113]3 0.660 0.678 0.665 0.654 0.785 0.649 3.17 Ours 0.663 0.690 0.665 0.707 0.724 0.674 2.00 [114]Open in a new tab The best and second best results are highlighted in bold and underlined respectively. The comparison results indicated that our model performed outstandingly across most datasets, achieving an average ranking of 2.00, making it the top-performing model among all. Following closely was the model bulit by^[115]29, with an average ranking of 3.00. Xu et al.^[116]3 and Zhao et al.^[117]25 also showed good performance, with an average ranking of 3.17 and 3.50, respectively. Overall evaluation suggests that the prognostic model constructed based on LYPD3, KRT8, CCT6A, PSMB7, and HMGA1 consistently demonstrated excellent performance across multiple independent validation datasets, with average rankings significantly surpassing those of other models (Table [118]2). These results underscore the accuracy and efficacy of the GA-EnCox algorithm, indicate that the prognostic model composed of these five genes possesses stable predictive capability, providing a potentially powerful tool for prognostic assessment in NSCLC. Exploring differences in gene expression, pathways, and immune infiltration in risk groups in LUAD To explore further the differences between high-risk and low-risk groups in LUAD, we analyzed the differential genes, pathways, tumor purity, and immune infiltration between the high-risk and low-risk groups. Differential expression analysis revealed significant variations in gene expression between the high-risk and low-risk groups, which identified 840 downregulated and 842 upregulated genes in the high-risk group. The distribution of these differentially expressed genes can be visually observed in the volcano plot (Fig. [119]4A). For instance, genes such as KRT81, REG4, LYPD3, ANLN, and UCA1 were notably upregulated in the high-risk group, whereas genes like TAC4, BTG2 and PGC were significantly downregulated in the low-risk group. These differences in gene expression may be closely related to changes in lactate metabolism activity. Pathway enrichment analysis of the differentially expressed genes (Fig. [120]4B) indicated that the high-risk group exhibited significantly elevated activity in pathways related to cell proliferation, such as cell cycle and DNA replication. Conversely, the activity of pathways associated with immune response and cell adhesion, like cell adhesion molecule and cytokine-cytokine receptor interaction, was significantly reduced, indicating a more malignant tumor phenotype and notably weakened tumor immunity in the high-risk group.。 Fig. 4. [121]Fig. 4 [122]Open in a new tab Analysis of risk groups identified by 5-LM-gene prognostic model in lung adenocarcinoma patients. (A) Volcano plot representing the differentially expressed genes between high-risk and low-risk groups. The cutoff for adjusted p-value is 0.01. The plot highlights 840 downregulated genes and 842 upregulated genes in the high-risk group compared to the low-risk group. (B) KEGG pathway enrichment analysis of differentially expressed genes between the high-risk and low-risk groups. Pathways enriched by upregulated genes are shown in red, while those enriched by downregulated genes are shown in blue. (C) Boxplots for ESTIMATE Score, Immune Score, and Stromal Score comparing the tumor microenvironment characteristics between high-risk and low-risk groups. (D) Boxplots showing the proportions of various immune cell types between high-risk and low-risk groups in TCGA-LUAD as assessed by the CIBERSORTx algorithm. (E) Boxplots showing the proportions of monocytes and cDC1 in NSCLC tissues between high-risk and low-risk groups from He Fan et al. scRNA-Seq cohort. (F) Boxplots showing the proportions of monocytes and macrophages in NSCLC tissues between high-risk and low-risk groups from Adams Kaminski et al. scRNA-Seq cohort. Statistical significance was determined using Wilcoxon rank-sum test: *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001. To explore further the mechanisms underlying this prognostic difference, we performed further analyses on immune infiltration. First, substantial differences were observed between the high-risk group and the low-risk group in terms of tumor malignancy and immune cell infiltration. Evaluation of the tumor purity using the ESTIMATE (Fig. [123]4C) revealed that the ESTIMATE Score, Immune Score, and Stromal Score of the low lactate group were significantly higher than those of the high lactate group. This indicates that the TME of patients in the low lactate group harbors a higher content of immune and stromal cells, contributing to lower tumor malignancy. Further analysis using the CIBERSORTx algorithm to assess the infiltration levels of various immune cells (Fig. [124]4D) showed that the low risk group exhibited higher proportions of most immune cell types, particularly key immune cells such as B cells, T cells, and DCs. The deconvolution results of [125]GSE50081 and [126]GSE68465 (Supplementary Fig. [127]1). Also showed that in the low-risk group, tumor immune-promoting cell infiltration such as DC, memory CD4 + T cells was higher than that in the high-risk group, while macrophage infiltration was higher in the high risk group. To further verify the difference in the immune cell infiltration between the high-risk group and the low-risk group, we analyzed two independent scRNA-Seq cohort from Adams Kaminski et al. 2020^[128]30 and He Fan et al. 2021^[129]31. We calculated potential risk from the pseudo count of two scRNA-Seq cohort using the 5-gene-prognosis model obtained from TCGA-LUAD, and classified patients in them into high-risk and low-risk groups. We found that in two independent scRNA-Seq cohort, the proportion of monocyte and DC germline infiltration in the low-risk group was significantly higher than that in the high-risk group (Figs. [130]4E,F), while the proportion of macrophage infiltration in the low-risk group was significantly lower than that in the high-risk group (Figs. [131]4E,F). These findings further confirmed that lactate metabolism-related prognosis risk was related to monocyte lineages such as macrophages and DC’s infiltration. To explore further the relationship between these 5 lactate metabolism-related genes and immune cell infiltration in the tumor tissues of NSCLC patients, we utilized an independent scRNA-Seq dataset ([132]GSE207422). The distribution of different cell types was visualized using the UMAP dimensionality reduction method (Supplementary Fig. [133]1). Subsequently, we analyzed the correlation between the five genes and the infiltration proportions of different immune cell subsets (Supplementary Fig. [134]1). CCT6A, HMGA1, and PSMB7 were significantly positively correlated with the number of endothelial cells and the infiltration proportion of ALDH3A1 + CD8 + Treg cells, while CCT6A was significantly negatively correlated with neural crest cell infiltration. These findings suggest that the selected lactate metabolism-related genes are closely associated with immunosuppressive cells infiltration, further supporting the immunosuppressive role of these genes in tumor immune regulation and demonstrating the biological significance of the GA-EnCox algorithm. The five proposed prognostic biomarkers and prognostic model guide small molecule therapy and immunotherapy Regarding drug response, we analyzed the relationship between risk and the sensitivity to various drugs (IC50). Results indicated that risk score significantly influenced patient drug response (Supplementary Fig. [135]2). For Thapsigargin_180, risk score was significantly negatively correlated with drug response (r = − 0.593, p = 3.31e−49), suggesting that higher risk score corresponds to greater sensitivity to this drug. Similar negative correlations were observed for Fulvestrant_1414, Gefitinib_1010, GSK1904529A_202, indicating that patients with higher lactate metabolism-related risk score might respond better to these drugs. Conversely, for I.BET.762_275, risk score was significantly positively correlated with drug response (r = 0.404, p = 3.09e-21), suggesting that higher lactate metabolism-related risk score correlates with lower sensitivity to this drug. This implies that high risk score might increase resistance to certain drugs, affecting treatment outcomes. Therefore, five biomarkers-related risk score could serve as a predictor for patient response to specific drugs, providing a basis for personalized treatment strategies. To evaluate the predictive value of risk score for immune therapy response, we conducted a TIDE (Tumor Immune Dysfunction and Exclusion) analysis. The results showed (Fig. [136]5A) that the high-risk group had significantly higher TIDE response score than the low-risk group, indicating that patients with high lactate metabolism-related risk score might be less sensitive to immune checkpoint inhibitors and other immune therapies. Overall, a significant positive correlation was observed between the risk score and the TIDE response score, which further substantiates this conclusion. (Fig. [137]5B). In two independent LUAD immunotherapy response cohorts, the predicted risk scores for treatment responders were significantly lower than those for non-responders, further validating the predictive ability of the risk score for immunotherapy response (Fig. [138]5A). Further analysis revealed that (Supplementary Fig. [139]2) the immune exclusion score was significantly lower in the low-risk group, suggesting that their TME is more conducive to immune cell infiltration and function. Conversely, the high-risk group had significantly higher levels of MDSC, suggesting that high lactate metabolism-related risk score may promote the activity of immunosuppressive cells, inhibiting anti-tumor immune responses. Four of the five prognostic biomarkers (CCT6A, HMGA1, KRT8, LYPD3) exhibited significantly higher expression in the non-responsive group compared to the responsive group. (Fig. [140]5C). We validated the relationship between the expression levels of these five key genes and the efficacy of immune checkpoint inhibitor (aPD1) therapy in NSCLC patients. In three independent PD1 ICB treatment cohorts, CCT6A and HMGA1 were all highly expressed in the MPR (pCR) group compared to the NMPR group, thereby validating them as biomarkers for immunotherapy response, indicative of immunotherapy inhibition (Fig. [141]5D). In summary, patients with low lactate metabolism-related risk score showed advantages in overall survival due to better immune cell infiltration, higher anti-tumor immune activity, and better responses to certain drugs and immune therapies. These findings highlight the potential of lactate metabolism intensity as an important biomarker for predicting prognosis and personalizing treatment in NSCLC patients. Fig. 5. [142]Fig. 5 [143]Open in a new tab Analysis of the risk score and five prognostic biomarkers in relation to immune therapy response and immune cell infiltration. (A) Boxplots illustrating the relationship between risk groups and TIDE response scores generated from TCGA-LUAD and the relationship between PD1 immune checkpoint inhibitor response and risk scores generated from [144]GSE135222 and [145]GSE126044 (DCB: durable clinical benefit, NDB: non-durable clinical benefit). (B) Scatter plot showing the linear correlation between the 5-gene risk score and TIDE response scores. Pearson correlation coefficients (r) and p-values are shown. (C) Boxplot showing the expression level differences of the five LM-related prognostic biomarkers between treatment response groups. (D) Boxplots illustrating the expression levels of the five LM-related prognostic biomarkers in response to immune checkpoint inhibitor (aPD1) therapy in LUAD patients, comparing major pathologic response (pathologic complete response) (MPR(pCR)) and non-major pathologic response (NMPR) groups in [146]GSE207422, [147]GSE135222, [148]GSE126044 bulk RNA-seq data. Statistical significance was determined using Wilcoxon rank-sum test: *P < 0.05, **P < 0.01, ***P < 0.001. In summary, through comprehensive studies on prognosis, survival analysis, immune infiltration, and immunotherapy response, we validated the accuracy and biological significance of the lactate metabolism-related genes screened by the GA-EnCox algorithm. These genes not only effectively predict the survival prognosis of NSCLC patients but are also closely associated with the tumor immune microenvironment and indicate the efficacy of immunotherapy. Verification of the lactate metabolism-related prognostic biomarkers in cell-lines and NSCLC patient samples To validate the differences of lactate metabolism-related prognostic biomarkers between NSCLC patient tissues and normal tissues at protein level, we obtained the immunohistochemical (IHC) quantification results of the five prognostic biomarkers in from the Human Protein Atlas (HPA). As shown in Fig. [149]6A, HMGA1, KRT8, PSMB7, and CCT6A were highly expressed in tumor tissues, with expression levels markedly higher than those in normal lung tissues. Although LYPD3 exhibited relatively low protein expression level in both tumor and normal lung tissues, its expression in tumor tissues was still higher than those in normal lung tissues. Fig. 6. [150]Fig. 6 [151]Open in a new tab Validation of the five prognostic biomarkers using IHC. (A) Validation of the differences of lactate metabolism-related prognostic biomarkers between NSCLC patient tissues and normal tissues at protein level, IHC results and average TPM of the five prognostic biomarkers were obtained from the Human Protein Atlas (HPA). (B) Kaplan–Meier survival curves stratified by HMGA1 expression at transcriptome or protein level in the TCGA-LUAD and in-house NSCLC cohort I. (C) IHC results showing the expression level differences of HMGA1 between high risk and low risk patients from in-house NSCLC cohort I. (D) Boxplots illustrating the protein levels of KRT8 comparing high NSCLC grade and low NSCLC grade in in-house NSCLC cohort II. Statistical significance was determined using t-test: *P < 0.05, **P < 0.01, ***P < 0.001. Furthermore, to validate the prognostic effect of the protein expression of the five lactate metabolism-related biomarkers and their relationship with clinical-pathological variables (e.g., grade and stage) in well-defined, consistently treated in-house cohorts, we collaborated with Xuzhou Central Hospital and obtained 18 tumor tissue samples from NSCLC patients from 2021.1 to 2023.12. We divided the patients into two cohorts, cohort I contained 10 patients with accurate survival OS and OS. Time were used to verify the prognostic effect of prognostic biomarkers, and cohort II contained 8 patients with detailed pathological information were used to analyze the association between prognostic biomarkers and grade and stage. Information of these two cohorts is in Supplementary Tables [152]2 and [153]3. IHC staining and quantification were performed for HMGA1 in cohort I and KRT8 in cohort II. In cohort I, High protein abundance of HMGA1 is significantly associated with poor prognosis (Fig. [154]6B, p = 0.012). RNA-Seq data from TCGA-LUAD also demonstrated that HMGA1 significantly affected patient survival (Fig. [155]6B, p = 0.0016). In the comparison of HMGA1 staining results between patients within different overall survival time in cohort I, it was evident that HMGA1 expression was significantly higher in tissues of NSCLC patients with shorter overall survival time (Fig. [156]6C). Additionally, in cohort II, KRT8 protein expression was significantly associated with tumor grade, with tumor grade higher than II showing significantly higher KRT8 protein expression than those with other patients (Fig. [157]6D). In summary, we validated the predictive strength of the biomarkers in the prognostic model and their relationships with tumor stage and grade in both publicly available IHC data and an in-house NSCLC patient cohort. These results highlight the prognosis prediction effectiveness of our lactate metabolism-related prognostic model and its associated biomarkers. In previous study, HMGA1 was verified to promote the proliferation of the NSCLC cell line^[158]32. Here, we further validated its effects on other malignant phenotypes by conducting colony formation, invasion, migration, and wound healing assays in NSCLC cell lines. In the A549 cell model, the inhibition of HMGA1 expression (A549-Si-1 and A549-Si-2) significantly decreased the capabilities of colony formation, invasion, migration, and wound healing compared with the control group (A549-SINC) (Fig. [159]7A). This indicates that HMGA1 plays a crucial role in promoting these cellular behaviors. Statistical analysis further confirmed these differences, with all four indices significantly lower in the siRNA interference group compared to the control group. In the case of overexpressed HMGA1 (A549-OE), the abilities of colony formation, invasion, migration, and wound healing were significantly higher than those in the control group (A549-HA, Fig. [160]7B). This further demonstrates that high expression of HMGA1 enhances these cellular behaviors. Similarly, most indices showed statistically significant differences, highlighting the pivotal role of HMGA1 in regulating these cellular functions. The tumor phenotype experiments in H1299 and PC9 with knocked out and overexpressed HMGA1 yielded similar results (Supplementary Figs. [161]3–[162]5). These findings further support the potential critical role of HMGA1 in cancer progression, including tumor growth, metastasis, and invasion. Fig. 7. [163]Fig. 7 [164]Open in a new tab The effect of HMGA1 on colony formation, invasion, migration, and wound healing in A549 cells. (A) Analysis of A549 cells with suppressed HMGA1 expression. Colony formation, invasion, migration, and wound healing percentages of control cells (A549-SiNC) are compared to HMGA1 knockdown cells (A549-Si-1 and A549-Si-2) using siRNA. (B) Analysis of A549 cells with HMGA1 overexpression. Colony formation, invasion, migration, and wound healing percentages of control cells (A549-HA) are compared to cells with increased HMGA1 expression (A549-OE). Results are presented as mean ± SD. Statistical significance was determined using appropriate tests (e.g., t-test or ANOVA): **P < 0.01, ***P < 0.001. Discussion Our study systematically revealed the significant role of lactate metabolism in non-small cell lung cancer (NSCLC) through multi-level analysis, elucidating the impact of associated genes on patient prognosis. First, we extracted gene sets related to lactate production and transportation from the MSigDB and GeneCard database. After rigorous filtering, we identified a final set of 364 LM-related genes. In independent single-cell sequencing cohorts of malignant epithelial cells, we found that the UCell scores of these LM-related genes were significantly positively correlated with malignant cells, epithelial cells and stromal cells. This indicates that lactate metabolism is not only active in tumor cells but also plays an essential role in non-malignant cells under the tumor microenvironment. At the patient level, we applied 364 lactate metabolism-related genes to multiple independent cohorts. Based on the GSVA scores of the LM gene set (LM-strength), patients were categorized into two groups: high lactate metabolism strength (LM-high) and low lactate metabolism strength (LM-low). Survival analysis results indicated that patients in the LM-low group had significantly better prognosis compared to the LM-high group, underscoring the reliability of tumor-specific lactate metabolism-related biomarkers derived from scRNA-Seq a prognostic indicator. To identify the key genes that play a core role in lactate metabolism and prognosis, we employed the GA-EnCox algorithm to screen five significant prognostic genes from 364 lactate metabolism-related genes: CCT6A, HMGA1, KRT8, PSMB7, and LYPD3. To address the challenges in NSCLC prognostic prediction and biomarker screening, we integrated bulk transcriptomic and single-cell transcriptomic data while considering the targetability of prognostic biomarkers. We developed a tumor prognostic prediction model using a genetic algorithm and an Elastic Cox model. This approach formulates the prognostic biomarker optimization problem based on the Elastic Cox model, with tumor prognosis indicators as the optimization objective. The genetic algorithm is employed to solve the above optimization problem for optimizing the selection of key prognosis biomarkers iteratively, resulting in the GA-EnCox prognostic model, which demonstrates superior prognostic prediction performance compared to the existing tumor prognostic models across multiple NSCLC datasets. The study validates the effectiveness of the NSCLC prognostic biomarkers identified by the proposed model from various perspectives, including immune infiltration and drug response. The prognostic model constructed based on these five genes demonstrated excellent performance in multiple independent datasets, with survival analysis showing that patients in the low-risk group had significantly higher survival rates than those in the high-risk group. Multivariate Cox regression analysis further confirmed the independent prognostic value of this model. Compared to other lactate metabolism-related prognostic models, our model shows a clear advantage in predictive power, with the highest average C-index. This may be attributed to our optimized gene selection using the GA-EnCox algorithm and the critical roles of the selected genes in tumor metabolism and progression. Previous studies often focused on individual genes or did not fully consider the lactate metabolism pathway, our research systematically analyzed the overall impact of lactate metabolism, providing more accurate prognostic predictions. Our findings indicate that these five genes have been confirmed to be associated with malignant tumor phenotypes in NSCLC, demonstrating the validity of our approach. We are the first to propose these five genes as a collective set of biomarkers, among which CCT6A is introduced for the first time as a prognostic biomarker. Specifically, silencing CCT6A inhibited the proliferation and migration of LUAD cells and increased apoptosis rates. CCT6A interacts with the STAT1 protein, protecting STAT1 from ubiquitin-mediated degradation, enhancing its stability, and promoting the transcription of hexokinase 2 (HK2), thereby stimulating aerobic glycolysis and progression of LUAD^[165]33. Additionally, aberrant expression of KRT8 is closely related to tumor progression, including drug resistance, cell migration, and cell adhesion^[166]34,[167]35. PSMB7 belongs to the human proteasome gene family (PSM), which consists of 49 genes and plays a critical role in cancer proteasome inhibition. PSMB7 promotes the proliferation, metastasis, and bortezomib (Bortezomib) resistance of multiple myeloma^[168]36. This evidence supports the potential of PSMB7 as a therapeutic target. LYPD3, also known as C4.4A, is a membrane protein partially anchored to the cell surface through glycosylphosphatidylinositol (GPI). Studies have shown that LYPD3 is highly expressed in several human malignancies and is associated with poor prognosis. Overactivation of the JUP/AGR2/LYPD3 signaling axis contributes to the regulation of melanoma cell stemness and promotes glycolysis through a pathway dependent on glucose transporter 1^[169]37. HMGA1 is an abundant non-histone chromatin protein associated with embryonic development, cancer, and cellular senescence. HMGA1 reduces the sensitivity of esophageal squamous cell carcinoma (ESCC) to ferroptosis by upregulating the expression of SLC7A11, maintaining intracellular glutathione homeostasis, and inhibiting the accumulation of malondialdehyde (MDA), thereby restricting ferroptosis^[170]38. Furthermore, HMGA1 interacts with the transcription factor SP1 to enhance its binding to the TKT promoter, thereby activating the pentose phosphate pathway and promoting the proliferation and tumor growth of ESCC cells^[171]6. In breast cancer, HMGA1 mediates the epithelial-mesenchymal transition (EMT) of invasive breast cancer through the Wnt/β-catenin pathway^[172]39. These studies support our findings that high HMGA1 expression is associated with immune therapy response, suggesting that HMGA1 may serve as a potential biomarker for predicting the efficacy of immunotherapy. To explore further the mechanisms behind the prognostic differences associated with lactate metabolism-related risk, we analyzed the differences between the high-risk and low-risk groups in terms of gene expression, immune infiltration, drug sensitivity, and immunotherapy response. In the differential gene expression analysis, for instance, genes such as KRT81, REG4, LYPD3, ANLN, and UCA1 were notably upregulated in the high-risk group, whereas genes like TAC4, BTG2 and PGC were significantly downregulated in the low-risk group. Pathway enrichment analysis of the differentially expressed genes revealed significant activation of tumor proliferation-related pathways in the high-risk group, while pathways related to cell adhesion and immune response were notably suppressed. Regarding immune infiltration, the high-risk subtype exhibited lower tumor purity scores, immune scores, and stromal scores compared to the low-risk subtype, indicating higher tumor purity and lower infiltration of immune and stromal cells in the high-risk phenotype. Further deconvolution analysis in bulk RNA-Seq and correlation analysis in independent scRNA-Seq cohorts revealed significantly higher infiltration of M0 macrophages in the high-risk subtype, whereas the low-risk subtype showed higher infiltration of CD4 + memory T cells. We also analyzed the differences in immune infiltration between the high and low risk groups in multiple scRNA-Seq databases. Macrophages and regulatory T cells in the high risk group were generally highly infiltrated, while monocytes and DC in the low risk group were generally highly infiltrated. In drug sensitivity analysis, we found a significant correlation between risk score and the IC50 values of various drugs. Patients in the high-risk group were more sensitive to drugs like Thapsigargin_180, Fulvestrant_1414, suggesting that these drugs may be more effective against patients with high lactate metabolism-related risk phenotypes. Conversely, patients in the low-risk group were more sensitive to drugs like I.BET.762_275, ACY.1215_264, indicating potential efficacy against patients with low lactate metabolism-related risk phenotypes. This offers the possibility of personalized drug selection based on risk status. Regarding immunotherapy response, patients in the low-risk group were more likely to respond to immune checkpoint inhibitors. This observation aligns with their tumor microenvironment characterized by lower levels of immune exclusion score and MDSC. Among the five biomarkers, four exhibited significant differences in the immune response subgroup. Notably, CCT6A and HGMA1 were confirmed in an independent NSCLC aPD-1 treatment cohort through bulk RNA-Seq, indicating that both serve as biomarkers for immunotherapy response inhibition. Finally, we further validated these lactate metabolism-related prognosis biomarkers at protein level in public database and in-house NSCLC patient cohorts. We validated the predictive strength of the biomarkers in the prognostic model and their relationships with tumor stage and grade. This further validates the accuracy of our prognostic model and the clinical significance of the prognostic biomarkers. This study has certain clinical significance. By quantifying the expression of these five biomarkers in the tumor tissue of patients, it is possible to predict survival risk score and immunotherapy response to a certain extent, thereby guiding therapeutic strategies. Moreover, combination therapy represents a promising direction. Currently, several clinical trials targeting cancers such as renal carcinoma and melanoma are underway, combining biomarker inhibitors with existing immune checkpoint blockade (ICB) therapies, such as PD1 or PDL1-targeted immune checkpoint inhibitors with VEGFR-targeted TKIs^[173]40. This indicates that combining lactate metabolism-related biomarkers with immune checkpoint inhibitors might be a highly promising therapeutic strategy. However, these biomarkers and the prognostic models constructed based on them still have many limitations in clinical practice: The tumor microenvironment is highly complex, with numerous factors influencing immune infiltration, the model based on only five genes is insufficient to comprehensively capture immune suppression and other patient phenotypes. Other malignant phenotypes associated with drug resistance also exist. Lactate-related prognostic biomarkers need to be combined with other biomarkers, requiring further research into new biomarkers. In summary, we proposed a novel workflow and methodology for identifying lactate-related prognostic biomarkers in tumors, and the identified biomarkers have been validated in multiple studies. In contrast to previous studies, which primarily focused on the function of individual genes or pathways, our work systematically integrates the correlation between lactate metabolism and NSCLC prognosis and obtained a set of biomarkers. By constructing a model that includes five key genes, we have improved the accuracy and reliability of prognostic predictions. The key genes identified—CCT6A, HMGA1, KRT8, PSMB7, and LYPD3—provide important clues for further exploration of lactate metabolism mechanisms and the development of new therapeutic strategies. Furthermore, this procedure exhibits general applicability and can be employed for the identification of any phenotype-related prognostic biomarkers in any type of cancer. This study was expected to provide important reference value for prognosis assessment and personalized treatment decision of cancer patients. However, this study has some limitations. The specific regulatory mechanisms of lactate metabolism on the immune microenvironment require further investigation, and the prognostic model established needs to be validated by more prospective clinical trials。 Methods To address the challenges in NSCLC prognostic prediction and biomarker screening, we developed a tumor prognostic prediction model using a genetic algorithm and an Elastic Cox model. This approach formulates the prognostic biomarker optimization problem based on the Elastic Cox model, with tumor prognosis indicators as the optimization objective, while the genes related to lactate metabolism as the optimization variables. Selected biomarkers should meet the following constraints: highly express in malignant cell subsets in multiple scRNA-Seq cohorts, have correlation to lactate metabolism strength and accord with targetability. The genes used to calculate prognostic risk in this model serve as prognostic biomarkers. The genetic algorithm is employed to solve the above optimization problem for optimizing the selection of key prognosis biomarkers iteratively, resulting in the GA-EnCox prognostic model. Selection of lactate metabolism-related genes Firstly, lactate production and transport-related gene sets were obtained from MSigDB (Shown in Supplementary Table [174]1). Additionally, genes with a relevance score greater than 10 for “lactate” were retrieved from GeneCard, resulting in a basic LM gene set comprising 288 genes. Within three independent NSCLC scRNA-Seq cohorts: Wu Zhou et al cohorts, Goveia et al cohorts and Maynard et al cohorts^[175]26–[176]28, genes showing a significant positive correlation with the basic LM UCell score in malignant cells (PCC > 0, p-adjust < 1e−4) were selected. Concurrently, Seurat FindMarker was utilized to identify markers highly expressed in malignant cells compared to other cells (logFC > 0.15, min.pct = 0.1, p-adjust < 1e−4). The intersection of these two criteria formed the LM-related gene set. To ensure gene intervenability, markers from immune cells (e.g., B cells, DCs) were excluded. Ultimately, intersecting LM-related gene sets from three independent NSCLC scRNA-Seq cohorts, a subset of 364 intervenable LM-related genes was identified. Bulk RNA-Seq and microarray data processing: RNA-Seq count matrices for TCGA-LUAD were obtained from the GDC data portal, and patient survival and clinical information were sourced from the UCSC Xena database (n = 596). The datasets [177]GSE13213 (n = 117), [178]GSE26939 (n = 115), [179]GSE30219 (n = 85), [180]GSE31210 (n = 226), and [181]GSE41271 (n = 182) were utilized to validate the prognostic classification capabilities of the LM-related gene set and the LM-related gene prognostic model. The RNA-Seq data of NSCLC patients before PDL1 ICB treatment from [182]GSE126044, [183]GSE135222, [184]GSE207422 were used to validate the relationship between the LM risk score and the response to PDL1 ICB treatment, with all aforementioned data downloaded from GEO. LM-strength was quantified as the GSVA^[185]41,[186]42 score of LM-related genes. Prior to calculating the GSVA enrichment scores, the count matrices were transformed to log2(TPM + 1). Differential expression between high-risk and low-risk samples was analyzed using the DeSeq2^[187]43 method (FDR < 0.01). ESTIMATE was employed to calculate tumor purity, stromal scores, and immune infiltration scores, thereby evaluating the content of immune and stromal cells in the tumor microenvironment. CIBERSORTx^[188]44 was used to estimate the proportions of various immune cell infiltrations (e.g., B cells, T cells, monocytes) from bulk RNA-Seq data. OncoPredict^[189]45 was employed to predict the response of each LUAD sample to small-molecule drugs from the GDSC and CTRP databases (IC50), while TIDE was employed to estimate the levels of immune checkpoint abundance, T cell functionality, and immunotherapy response. ScRNA-Seq data processing The processing of single-cell RNA sequencing (scRNA-Seq) data was conducted using the R Seurat package^[190]46. For all of scRNA-Seq data, cells were first filtered to retain those with mitochondrial gene content of less than 20%, ribosomal gene content of less than 50%, gene count greater than 200, and a housekeeping gene score (mean expression of ACTB, GAPDH, and MALAT1) greater than 1^[191]47. Normalization was performed using SCTransform under default parameters, and data from multiple patients were integrated using Harmony^[192]48 (max.iter = 20). Clustering of the integrated expression matrix was then conducted with the top 50 principal components and a resolution of 0.6. For [193]GSE207422, Cell type annotation of single-cell subsets was executed using both SingleR^[194]49 (HPCA data as reference) and ScType^[195]50 (“Lung” and “Immune System” as tissue) methods. LM-strength in scRNA-Seq was quantified by the UCell score of LM-related genes. Further analysis in [196]GSE207422 involved clustering T-NK cell and monocyte subsets separately (resolution = 0.3, with clustering results provided in Supplementary Fig. [197]1 and further subdividing subsets based on marker genes (selected by FindAllMarkers function under default parameters). For Adams Kaminski et al. 2020 and He Fan et al. 2021, Cell type annotations were curated from^[198]51. To calculate the correlation between gene expression from scRNA-Seq and the proportion of various cell types in the tumor microenvironment, AggregateExpression was used to compute the average gene expression across all cells, converted to log2(CPM + 1), and the Pearson correlation coefficient (PCC) was calculated relative to the proportions of different cell types in the TME. Construction of GA-EnCox: A LM-related prognostic biomarker selection method For the 364 LM-related genes, the GA-EnCox algorithm was employed to select prognostic genes and construct a prognostic model. In the genetic algorithm, each bit of the chromosome corresponds to a prognostic biomarker (gene) to be selected, with a value of either 0 or 1. The chromosome consists of 364 bits, where a value of 1 at a specific position indicates that the corresponding prognostic biomarker (gene) is selected. The initial population consists of 500 chromosomes, all of which are randomly generated. The algorithm flowchart is illustrated in Fig. [199]3A. The crossover strategy adopted was a two-point crossover, with a crossover probability of 0.5. The mutation strategy implemented was a bit-flip mutation, with a mutation probability of 0.2. The selection strategy used was tournament selection, and the number of iterations of the genetic algorithm was set to 200. For fitness calculation, the Cox Elastic-net model was employed to evaluate the survival risk of the selected LM genes. Compared to Lasso-Cox, the Cox Elastic-net model incorporates an L2 regularization term into the partial likelihood function to balance model fitting precision and generalization capability. The partial likelihood function is as follows: graphic file with name 41598_2025_85620_Article_Equa.gif Inline graphic represents the partial likelihood function in Cox regression, Inline graphic denotes the regression coefficients, Inline graphic represents the regularization coefficient, and Inline graphic signifies the coefficient for the L1 penalty term in the regularization component. For the Cox Elastic-net model, parameters are set with Inline graphic =0.9 and Inline graphic =0.2. A fivefold cross-validation is performed on the TCGA-LUAD dataset using the average C-index as the fitness of each chromosome in GA, in which the C-index is calculated based on the predicted survival risk. The genes corresponding to the chromosome with the highest fitness throughout the iterations is selected to fit the model. Genes with non-zero variable coefficients are chosen as the final LM-associated prognostic genes. These LM-associated prognostic genes are used to calculate patient survival risk through linear regression. The Cox Elastic-net and genetic algorithm are implemented based on Python’s scikit-survival and DEAP^[200]52 libraries, respectively. Tumor phenotyping experimental procedures Patient samples collection Tissue specimens were collected from 16 patients with non-small cell lung cancer (NSCLC) at Xuzhou Central Hospital, Xuzhou, China, through surgical resections performed between January 2021 and December 2023. None of the patients had received preoperative radiotherapy, chemotherapy, or targeted therapy. We divided the patients into two cohorts. cohort I composed of 10 patients with accurate survival time was used to verify the prognostic effect of prognostic biomarkers; cohort II composed of 8 patients with detailed pathological stages was used to analyze the relationship between prognostic biomarkers and grade and stage. Tissue samples were fixed in formalin, trimmed into pieces measuring approximately 1.5 × 2.0 cm with a thickness of 0.2–0.3 cm, and processed through dehydration and clearing steps to remove water and alcohol. The tissues were then embedded in paraffin, sectioned into 4–5 micron slices using a microtome, and mounted onto adhesive slides to produce unstained tissue slides for immunohistochemical analysis. Clinical and pathological characteristics of the patients are summarized in Supplemental Table [201]2. The study was approved by the Ethics Committee of Xuzhou Central Hospital for Biomedical Research (Approval No. XZXY-LJ-20191210-045), and written informed consent was obtained from all participants. All procedures involving human participants were conducted in accordance with ethical guidelines and regulations. Cell culture and transfection Human lung adenocarcinoma cell lines A549, H1299 were obtained from ATCC (cat # CCL-185, cat # CRL-5803, respectively). PC9 cell line were purchased from Chinese Academy of Sciences (Beijing, China). A549 and H1299 were cultured in RPMI-1640 medium supplemented with 10% FBS and 1% P/S at 37 °C under 5% CO2 conditions. PC9 were cultured in DMEM medium supplemented with 10% FBS and 1% P/S at 37 °C under 5% CO2 conditions. HMGA1 knockdown was performed using two specific siRNAs. Over expression of HMGA1 was achieved by transfecting pcDNA3.1-HMGA1 plasmid. Transfections were conducted using jetPRIME® (Polyplus) according to the manufacturer’s instructions. Colony formation assay Cells in the logarithmic growth phase were digested with trypsin and resuspended in complete medium (RPMI-1640 with 10% FBS for A549, H1299 and DMEM with 10% FBS for PC9) to create a cell suspension, which was then counted. A total of 1,000 cells per well were seeded into 6-well culture plates for each experimental group and cultured for 7–10 days. The medium was replaced and the cell status was observed every 3 days. Upon completion of the cloning process, cells were photographed under a microscope, washed once with PBS, and fixed with 1 mL of 4% paraformaldehyde for 30–60 min, followed by another wash with PBS. Subsequently, 1 mL of crystal violet staining solution was added to each well and incubated for 10–20 min to stain the cells. Transwell migration and invasion assay Invasion assays were conducted using transwell chambers coated with Matrigel (8 μm pore size). The lower chamber was filled with 800 μL of culture medium containing 10% FBS. Place the chamber into a 24-well plate and add 200 μL cell suspension with a concentration of 3 × 10^5 cells containing serum-free medium into the upper chamber. After culturing for 36 to 48 h, remove the chamber, rinse it with PBS, and use a cotton swab to remove the cells and Matrigel from the upper chamber. Fix the chamber with 4% paraformaldehyde for 20 min, stain with 0.1% crystal violet for 30 min, and rinse it three times with PBS. Cells from five random areas per well were photographed and counted using ImageJ software. As for migration assays, matrigel was not utilized, 2 × 10^5 cells was used in the upper chamber and culturing lasted for 24 h. Wound healing assay Cells were cultured in 6-well plates. Upon reaching 90% confluence, a sterile pipette tip was employed to draw a straight line across the cell layer to create a wound, followed by washing the cells with PBS to get rid of cellular fragments. Fresh culture medium was added, and wound closure was observed at 24 h using an inverted microscope. The wound area was measured using ImageJ software. Immunohistochemistry Unstained tissue slides in cohort I were used for IHC analysis to detect the target proteins HGMA1 while slides in cohort II were used for KRT8 staining. Slides were incubated with either rabbit anti-HMGA1 primary antibody (cat # 29895–1-ap, Proteintech, dilution 1:100) or rabbit anti-KRT8 primary antibody (cat # 17514–1-ap, Proteintech, dilution 1:1000). The chromogenic reaction was performed using DAB substrate, and the slides were counterstained with hematoxylin, which stained the nuclei blue, while positive signals appeared as brownish-yellow. Images were captured using Olympus microscopes and the PANNORAMIC-MIDI scanner. Aipathwell software was utilized for image analysis. The expression of HGMA1 and KRT8 was quantified by IRS and H-score. The immunoreactivity score (IRS) was calculated as the product of staining intensity (SI) and the percentage of positive cells (PP). Staining intensity was graded as 0 (no staining), 1 (light yellow, weak positive), 2 (brown–yellow, moderate positive), or 3 (brown, strong positive). The percentage of positive cells was scored as 0 (0–5%), 1 (6–25%), 2 (26–50%), 3 (51–75%), or 4 (> 75%). Additionally, the H-Score was determined using the formula H-Score = ∑(pi × i) where i represents the staining intensity (0: no staining, 1: weak positive, 2: moderate positive, 3: strong positive) and pi represents the percentage of cells at each intensity level. For survival analysis of HMGA1 in cohort I, we set the threshold for high expression at H-score = 100. Statistical analysis All data were analyzed using R software (Version 4.3.2). The Wilcoxon rank-sum test was employed for between-group comparisons, with all statistical tests being two-tailed and p-values less than 0.05 considered statistically significant. Correlation analysis was conducted using Pearson’s correlation method, and correlation p-values were adjusted using the BH method. Survival analysis was performed using the Kaplan–Meier method, with differences in survival assessed via the log-rank test. The prognostic C-index was utilized for model evaluation. Supplementary Information [202]Supplementary Material 1.^ (4.3MB, docx) Acknowledgements