Abstract

   Lactate metabolism (LM) plays a crucial role in tumor progression and
   therapy resistance in non-small cell lung cancer (NSCLC). Several
   methods had been developed for NSCLC prognosis prediction based on
   lactate metabolism-related information. The existing methods for the
   construction of prognosis prediction models are mostly based on single
   models such as linear models, SVM, and decision trees. Prognosis
   biomarkers and prognosis prediction models based on this kind of
   methods often have limited prognostic performance. In this study, we
   proposed a novel methodology for constructing prognosis prediction
   model and identifying lactate-related prognostic biomarkers in NSCLC.
   We first screened for lactate metabolism-related malignant genes from
   the scRNA-Seq data of NSCLC malignant cells. We proposed a Cox
   elastic-net regression combined with genetic algorithm (GA-EnCox) to
   predict prognosis and optimize the selection of key biomarkers. We
   identified five key LM-related genes (LYPD3, KRT8, CCT6A, PSMB7, and
   HMGA1) that significantly correlated with patient prognosis in LUAD
   cohorts. The prognostic model constructed with these genes outperformed
   other currently popular models across multiple datasets, demonstrating
   stable predictive capability. Survival analysis based on bulk RNA-Seq
   data demonstrated that the low-risk group had significantly better
   overall survival compared to the high-risk group. Further analysis
   revealed that lactate metabolism-related prognosis risk might be
   associated with monocyte lineages such as macrophages and DC’s
   infiltration and these prognosis biomarkers may indicate the
   therapeutic responses of immune checkpoint inhibitors for NSCLC
   patients. More importantly, we validated HMGA1 and KRT8 at protein
   level and their association with histologic grades, stages, and
   clinical outcomes in consistently treated in-house NSCLC cohorts.
   Finally, we experimentally validated one of the biomarkers, HMGA1,
   confirming its role in promoting malignant phenotypes of NSCLC. This
   study provides valuable insights into the role of lactate
   metabolism-related biomarkers and their impact on patient outcomes, it
   was expected to provide important reference value for prognosis
   assessment and personalized treatment decision of NSCLC patients.

   Subject terms: Tumour biomarkers, Non-small-cell lung cancer,
   Computational biology and bioinformatics

Introduction

   Lung cancer is the most frequently diagnosed cancer worldwide and the
   leading cause of cancer-related deaths^[40]1–[41]3. Non-small cell lung
   cancer (NSCLC) accounts for 80–85% of all lung cancer cases^[42]4.
   Despite advancements in therapeutic approaches, the prognosis for NSCLC
   remains poor, with a low five-year survival rate^[43]5. The high
   heterogeneity and complex molecular mechanisms of NSCLC complicate
   treatment efforts, necessitating a deeper understanding of its
   biological characteristics to develop novel therapeutic
   strategies^[44]6. The significance of glucose metabolism in cancer has
   been recognized for nearly a century, with studies demonstrating a high
   dependence of tumor cells on glucose, known as the “Warburg
   effect”^[45]7. Under this effect, tumor cells preferentially utilize
   anaerobic glycolysis over oxidative phosphorylation even under aerobic
   conditions. Although anaerobic glycolysis has a lower ATP yield, its
   rapid rate provides the necessary energy and metabolic precursors for
   rapid proliferation^[46]8,[47]9. The substantial lactate production in
   tumor metabolism is transported out of cells by monocarboxylate
   transporters MCT1 and MCT4, a process critical for maintaining tumor
   cell survival and growth^[48]10. Lactate increases the resistance of
   chemo-therapy such as PI3K inhibitors or tamoxifen in colon cancer and
   estrogen receptor (ER) positive breast cancer^[49]11–[50]13. Also,
   metabolic symbiosis mediated by lactate results in drug resistance of
   TKI inhibitors in gastric cancer^[51]14.

   Identifying NSCLC-related biomarkers to predict prognosis and
   immunotherapy responses is essential for developing effective
   personalized treatment strategies. Several genes, such as LMNB2 and
   FNDC4, have already been confirmed as prognostic biomarkers for NSCLC.
   However, lactate metabolism-related prognostic biomarkers for NSCLC
   have not yet been thoroughly investigated. Although several studies
   have demonstrated that targeting lactate metabolism-related enzymes and
   transporters effectively inhibits tumor growth^[52]15–[53]18. The
   development of inhibitors progressing to clinical application remains
   limited. The narrow substrate-binding site of LDHA poses significant
   challenges for its selective and effective inhibition in cancer, and no
   LDH inhibitors with significant selectivity and in vivo efficacy have
   been identified thus far^[54]19. Moreover, given the critical role MCTs
   play in maintaining tissue homeostasis, inhibiting MCTs may lead to
   toxicities, such as metabolic alterations (ketoacidosis) due to reduced
   ketone body uptake in MCT1-deficient cells^[55]10. Consequently,
   exploring alternative lactate metabolism-related targets for indirect
   regulation of lactate metabolism is warranted.

   Several sets of lactate-associated tumor biomarkers in NSCLC have been
   proposed^[56]3,[57]20–[58]25, yet they exhibit notable limitations. The
   existing methods are mostly based on single models such as linear
   models, SVM, and decision trees. Prognosis biomarkers selected based on
   a single model often have limited prognostic performance. Additionally,
   these methods rarely consider the targetability of the prognostic
   biomarkers identified by the model and barely optimize the selection of
   key biomarkers affecting prognostic prediction performance based on
   prognostic prediction metrics.

   In this study, we proposed a novel methodology for constructing
   prognosis prediction model and identifying lactate-related prognostic
   biomarkers in NSCLC. This approach formulates the prognostic biomarker
   optimization problem based on the Elastic Cox model, with tumor
   prognosis indicators as the optimization objective. The genetic
   algorithm (GA) is employed to solve the above optimization problem for
   optimizing the selection of key prognosis biomarkers iteratively,
   resulting in the GA-EnCox prognostic model, which demonstrates superior
   prognostic prediction performance compared to the existing tumor
   prognostic models across multiple NSCLC datasets. The overview of this
   study is illustrated in Fig. [59]1. We first identified lactate
   metabolism-related malignant genes (LM-related gene) by integrating
   genes implicated in the lactate metabolism process with malignant cell
   markers from scRNA-Seq data. Further, using the GA-EnCox method, we
   selected five biomarkers from LM-related gene, which are highly
   correlated with NSCLC prognosis, and validated the superior prognostic
   predictive capacity of the 5-LM-related gene signature across various
   independent NSCLC gene expression databases. We then analyzed the
   impact of predicted risk on pathway activity, immune cell infiltration,
   small molecule drug response, and immunotherapy response. Finally, we
   validated HMGA1 and KRT8 at protein level and their association with
   histologic grades, stages, and clinical outcomes in consistently
   treated in-house NSCLC cohort and validate HGMA1 effect on malignant
   phenotypes in NSCLC cell-lines.

Fig. 1.

   [60]Fig. 1
   [61]Open in a new tab

   The overview of the development, validation, and clinical utility of
   core LM-related prognostic biomarkers. Abbreviations: LUAD, Lung
   adenocarcinoma; NSCLC, non-small-cell lung cancer; LM: lactate
   metabolism.

Result

Lactate metabolism-related genes identified from NSCLC scRNA-Seq

   We obtained a gene set associated with lactate production and transport
   from MSigDB and retrieved genes related to lactate metabolism from
   GeneCard, resulting in a basic LM-related gene set comprising 277 genes
   (Supplementary Table [62]1). We next considered screening
   tumor-specific lactate-associated genes from multiple independent
   scRNA-Seq cohorts. In malignant cells in Wu Zhou et al cohorts^[63]26,
   Goveia et al. cohorts and Maynard et al cohorts^[64]27,[65]28, we
   selected genes that were significantly positively correlated with the
   enrichment scores of the basic-LM gene set and were specifically highly
   expressed in malignant cells. To ensure potential for therapeutic
   intervention, we excluded marker genes of cells that play a positive
   role in tumor immunity. Tumor-specific lactate-related genes were
   identified independently from three distinct scRNA-Seq cohorts. The
   intersection of these three sets resulted in a final collection of 364
   tumor-specific lactate-related genes (Fig. [66]2A).

Fig. 2.

   [67]Fig. 2
   [68]Open in a new tab

   Analysis and prognostic validation of lactate metabolism-related genes
   identified from single-cell sequencing data. (A) Overlap of lactate
   metabolism-related genes identified in three independent NSCLC
   scRNA-Seq cohorts. (B) DimPlot showing malignant cells and lactate
   metabolism intensity in three cohorts. (C) Violin plot displaying the
   enrichment of lactate metabolism-related genes across various cell
   types. (D) Kaplan–Meier survival curves for the TCGA-LUAD,
   [69]GSE13213, and [70]GSE31210 cohorts comparing survival probabilities
   between LM-high and LM-low groups.

   Next, we performed UMI dimensionality reduction clustering analysis on
   single-cell sequencing data from NSCLC tissues and we assessed lactate
   metabolism intensity based on 364 LM-related genes by UCell in
   different cohorts (Fig. [71]2B) and examined the enrichment of lactate
   metabolism-related genes in various cell types (Fig. [72]2C). The
   results indicated that, in addition to malignant cells, epithelial
   cells and stromal cells also exhibited higher LM strength. To further
   investigate lactate metabolism in NSCLC patients, we applied these 364
   LM-related genes to the TCGA-LUAD cohort to classify patients according
   to the intensity of lactate metabolism. Based on the GSVA levels of
   LM-related genes, we divided the patients into LM-high and LM-low
   groups. We analyzed the chromosome effects of lactate metabolism
   associated groups in three independent cohorts: TCGA-LUAD,
   [73]GSE13213, and [74]GSE31210. (Fig. [75]2D). In the TCGA-LUAD
   dataset, survival analysis revealed a significant difference in
   survival rates between the LM-high and LM-low groups (p = 0.00064),
   with the LM-low group demonstrating significantly better survival
   outcomes compared to the LM-high group (HR [95% CI] = 0.595 [0.445,
   0.795]). Similarly, survival analysis in the [76]GSE13213 also
   indicated that patients in the LM-low group had significantly better
   survival outcomes than those in the LM-high group (p < 0.0001, HR [95%
   CI] = 0.257 [0.146, 0.453]). Additionally, in the [77]GSE31210, we
   observed analogous results, with the LM-low group exhibiting
   significantly higher survival rates compared to the LM-high group
   (p < 0.0001, HR [95% CI] = 0.178 [0.092, 0.346]). Across multiple
   independent cohorts, the application of LM-related gene sets identified
   consistently revealed that lactate metabolism intensity is closely
   related to the prognosis of NSCLC patients, with the LM-low group
   demonstrating significantly better survival outcomes compared to the
   LM-high group. In summary, these data suggest that lactate metabolism
   intensity may be an important factor influencing the prognosis of NSCLC
   patients.

GA-EnCox identified lactate metabolism-related core biomarkers with stronger
predictive ability

   The intensity of lactate metabolism is closely associated with patient
   survival prognosis, but the specific LM-related genes that play core
   roles in this process remain unclear. To investigate further the
   relationship between lactate metabolism intensity and prognosis in LUAD
   patients, we developed a tumor prognostic prediction model using
   genetic algorithm and Elastic Cox, to predict NSCLC patients prognosis
   and identify the key biomarkers for prognosis, thereby providing more
   reliable biomarkers for personalized treatment. Therefore, we utilized
   the GA-EnCox (Cox elastic-net combined with genetic algorithm)
   algorithm for screening from 364 LM-related genes (Fig. [78]3A). This
   algorithm aims to combine the global search capability of genetic
   algorithms with the feature selection advantages of the Cox elastic-net
   model to find the optimal combination of prognostic biomarker genes. As
   shown in Fig. [79]3B, the fitness of each chromosome in GA increased
   gradually and reached stability. After iterating 100 times, the average
   C-index tends to stabilize, reaching the maximum value at iter = 173,
   C-index = 0.6542, corresponding to the selection of 190 LM-related
   genes. Increasing the L1-penalty weight in EnCox for gene selection,
   five genes with non-zero regression coefficients in EnCox were
   identified as the final LM-related prognostic genes: LYPD3, KRT8,
   CCT6A, PSMB7, and HMGA1. The regression coefficients of these genes, as
   shown in Fig. [80]3C, indicate their significant role in predicting
   prognosis in LUAD patients. To validate the prognostic value and
   biological significance of the lactate metabolism-related key genes
   identified, we further analyzed these five genes.

Fig. 3.

   [81]Fig. 3
   [82]Open in a new tab

   Identification and validation of lactate metabolism-related prognostic
   genes in LUAD patients. (A) Workflow diagram illustrating the
   application of the GA-EnCox algorithm to identify optimal lactate
   metabolism-related genes for prognosis in LUAD patients. (B) Plot of
   fitness (C-index) across iterations of the GA-EnCox algorithm, showing
   stabilization after approximately 100 iterations. (C) Bar graph showing
   the regression coefficients of the five selected lactate
   metabolism-related genes (LYPD3, KRT8, CCT6A, PSMB7, HMGA1). (D)
   Kaplan–Meier survival curves stratified by risk groups in the
   TCGA-LUAD, [83]GSE26939, and [84]GSE30129 datasets. High-risk groups
   exhibited significantly lower survival rates compared to low-risk
   groups.

   To begin, we stratified patients into risk groups based on the
   expression of five genes in the TCGA-LUAD, [85]GSE26939, and
   [86]GSE30129 cohorts, followed by the generation of Kaplan–Meier
   survival curves (Fig. [87]3D). The findings indicated that patients in
   the low-risk group exhibited significantly higher survival rates
   compared to those in the high-risk group, with the differences being
   highly statistically significant. These results validate the stable
   predictive capability of our selected gene combination for the
   prognosis of NSCLC patients. Further, we conducted univariate and
   multivariate Cox regression analyses on TCGA-LUAD, [88]GSE13123,
   [89]GSE26939 cohorts. As shown in Table [90]1, in the univariate Cox
   regression analysis, LM risk score (calculated based on the expression
   of five LM-related genes) demonstrated statistical significance in all
   three cohorts. For example, in TCGA-LUAD, patients in the high LM risk
   group exhibited a significantly increased risk of mortality (HR = 1.97,
   95% CI 1.46–2.67, p < 0.0001), nearly doubling the risk compared to the
   low-risk group. In [91]GSE26939, LM risk score is the only
   statistically significant variable in univariate Cox regression. These
   findings indicate that the LM risk score effectively predicts patient
   survival outcomes. In the multivariate Cox regression analysis, as
   shown in Table [92]1, LM risk score still demonstrated statistical
   significance and showed the strongest prognostic differentiation
   significance across all three cohorts. To summarize, whether in
   univariate Cox analysis or multivariate Cox analysis, the survival risk
   calculated by the prognostic model consisting of the five prognostic
   genes outperformed clinical or pathological information in predicting
   patient prognosis. This suggests that incorporating the LM risk score
   alongside existing clinical characteristics can enhance the predictive
   capability for patient prognosis.

Table 1.

   Univariate and multivariate cox regression analysis of prognostic
   factors in NSCLC cancer patients.
   Variable Univariable cox Multivariable cox
   HR p value HR p value
   TCGA-LUAD
    Age  > 60/ < 60 1.14 (0.83–1.58) 0.421 1.3 (0.93–1.8) 0.122
    M_stage  > M0/M0 1.02 (0.74–1.4) 0.927 1.01 (0.73–1.39) 0.969
    T_stage  > T2/ < = T2 1.53 (1.11–2.13) 0.010 0.98 (0.69–1.41) 0.932
    N_stage  > N0/N0 2.58 (1.93–3.47)  < 0.001 1.3 (0.81–2.1) 0.277
    Grade  > II/ < = II 2.91 (2.14–3.95)  < 0.001 2.03 (1.21–3.4) 0.007
    LM_risk High_risk/Low_risk 1.97 (1.46–2.67)  < 0.001 1.69 (1.23–2.31)
   0.001
   [93]GSE13213
    Age  > 60/ < 60 1.04 (0.59–1.83) 0.891 1.42 (0.75–2.69) 0.282
    Gender F/M 1.36 (0.77–2.39) 0.286 1.1 (0.58–2.06) 0.777
    Stage  > = II/ < II 2.54 (1.45–4.45) 0.001 1.31 (0.43–3.93) 0.636
    T_stage  > T1/T1 1.54 (0.87–2.73) 0.142 1.19 (0.65–2.18) 0.577
    N_stage  > N0/N0 2.71 (1.53–4.8) 0.001 2.28 (0.73–7.14) 0.157
    LM_risk High_risk/Low_risk 3.23 (1.75–5.93)  < 0.001 3.15 (1.64–6.05)
   0.001
   [94]GSE26939
    Age  > 60/ < 60 1.69 (0.93–3.07) 0.086 1.35 (0.7–2.59) 0.375
    Gender F/M 1.63 (0.96–2.78) 0.069 1.44 (0.78–2.64) 0.241
    Grade III/ < III 1.27 (0.75–2.17) 0.377 1.22 (0.7–2.13) 0.491
    Stage  > = II/ < II 1.64 (0.96–2.79) 0.069 1.73 (0.99–3.03) 0.056
    Smoke 1/0 1 (0.43–2.34) 1.000 0.79 (0.32–1.98) 0.617
    Risk High_risk/Low_risk 1.81 (1.06–3.09) 0.029 1.88 (1.07–3.3) 0.027
   [95]Open in a new tab

   The most significant prognostic factor in are highlighted in bold.

   We compared the prognostic model constructed based on these five genes
   with other currently popular prognostic gene models to highlight the
   performance of the GA-EnCox model. The comparison was conducted across
   six different NSCLC datasets: TCGA-LUAD, [96]GSE13213, [97]GSE26939,
   [98]GSE30219, [99]GSE31210, and [100]GSE41271 (Table [101]2). We used
   the C-index (concordance index) as an indicator of the model’s
   discriminatory power, ranging from 0.5 to 1, with values closer to 1
   indicating stronger predictive ability.

Table 2.

   Comparative performance rankings of prognostic models across multiple
   NSCLC gene expression datasets.
   TCGA-LUAD [102]GSE13213 [103]GSE26939 [104]GSE30219 [105]GSE31210
   [106]GSE41271 Average_rank
   ^[107]22 0.539 0.538 0.547 0.558 0.656 0.580 7.67
   ^[108]21 0.662 0.594 0.500 0.620 0.691 0.558 6.50
   ^[109]29 0.658 0.680 0.647 0.765 0.704 0.685 3.00
   ^[110]25 0.668 0.644 0.631 0.701 0.714 0.644 3.50
   ^[111]24 0.669 0.573 0.583 0.668 0.720 0.604 4.67
   ^[112]20 0.643 0.635 0.625 0.680 0.671 0.630 5.50
   ^[113]3 0.660 0.678 0.665 0.654 0.785 0.649 3.17
   Ours 0.663 0.690 0.665 0.707 0.724 0.674 2.00
   [114]Open in a new tab

   The best and second best results are highlighted in bold and underlined
   respectively.

   The comparison results indicated that our model performed outstandingly
   across most datasets, achieving an average ranking of 2.00, making it
   the top-performing model among all. Following closely was the model
   bulit by^[115]29, with an average ranking of 3.00. Xu et al.^[116]3 and
   Zhao et al.^[117]25 also showed good performance, with an average
   ranking of 3.17 and 3.50, respectively. Overall evaluation suggests
   that the prognostic model constructed based on LYPD3, KRT8, CCT6A,
   PSMB7, and HMGA1 consistently demonstrated excellent performance across
   multiple independent validation datasets, with average rankings
   significantly surpassing those of other models (Table [118]2). These
   results underscore the accuracy and efficacy of the GA-EnCox algorithm,
   indicate that the prognostic model composed of these five genes
   possesses stable predictive capability, providing a potentially
   powerful tool for prognostic assessment in NSCLC.

Exploring differences in gene expression, pathways, and immune infiltration
in risk groups in LUAD

   To explore further the differences between high-risk and low-risk
   groups in LUAD, we analyzed the differential genes, pathways, tumor
   purity, and immune infiltration between the high-risk and low-risk
   groups. Differential expression analysis revealed significant
   variations in gene expression between the high-risk and low-risk
   groups, which identified 840 downregulated and 842 upregulated genes in
   the high-risk group. The distribution of these differentially expressed
   genes can be visually observed in the volcano plot (Fig. [119]4A). For
   instance, genes such as KRT81, REG4, LYPD3, ANLN, and UCA1 were notably
   upregulated in the high-risk group, whereas genes like TAC4, BTG2 and
   PGC were significantly downregulated in the low-risk group. These
   differences in gene expression may be closely related to changes in
   lactate metabolism activity. Pathway enrichment analysis of the
   differentially expressed genes (Fig. [120]4B) indicated that the
   high-risk group exhibited significantly elevated activity in pathways
   related to cell proliferation, such as cell cycle and DNA replication.
   Conversely, the activity of pathways associated with immune response
   and cell adhesion, like cell adhesion molecule and cytokine-cytokine
   receptor interaction, was significantly reduced, indicating a more
   malignant tumor phenotype and notably weakened tumor immunity in the
   high-risk group.。

Fig. 4.

   [121]Fig. 4
   [122]Open in a new tab

   Analysis of risk groups identified by 5-LM-gene prognostic model in
   lung adenocarcinoma patients. (A) Volcano plot representing the
   differentially expressed genes between high-risk and low-risk groups.
   The cutoff for adjusted p-value is 0.01. The plot highlights 840
   downregulated genes and 842 upregulated genes in the high-risk group
   compared to the low-risk group. (B) KEGG pathway enrichment analysis of
   differentially expressed genes between the high-risk and low-risk
   groups. Pathways enriched by upregulated genes are shown in red, while
   those enriched by downregulated genes are shown in blue. (C) Boxplots
   for ESTIMATE Score, Immune Score, and Stromal Score comparing the tumor
   microenvironment characteristics between high-risk and low-risk groups.
   (D) Boxplots showing the proportions of various immune cell types
   between high-risk and low-risk groups in TCGA-LUAD as assessed by the
   CIBERSORTx algorithm. (E) Boxplots showing the proportions of monocytes
   and cDC1 in NSCLC tissues between high-risk and low-risk groups from He
   Fan et al. scRNA-Seq cohort. (F) Boxplots showing the proportions of
   monocytes and macrophages in NSCLC tissues between high-risk and
   low-risk groups from Adams Kaminski et al. scRNA-Seq cohort.
   Statistical significance was determined using Wilcoxon rank-sum test:
   *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001.

   To explore further the mechanisms underlying this prognostic
   difference, we performed further analyses on immune infiltration.
   First, substantial differences were observed between the high-risk
   group and the low-risk group in terms of tumor malignancy and immune
   cell infiltration. Evaluation of the tumor purity using the ESTIMATE
   (Fig. [123]4C) revealed that the ESTIMATE Score, Immune Score, and
   Stromal Score of the low lactate group were significantly higher than
   those of the high lactate group. This indicates that the TME of
   patients in the low lactate group harbors a higher content of immune
   and stromal cells, contributing to lower tumor malignancy. Further
   analysis using the CIBERSORTx algorithm to assess the infiltration
   levels of various immune cells (Fig. [124]4D) showed that the low risk
   group exhibited higher proportions of most immune cell types,
   particularly key immune cells such as B cells, T cells, and DCs. The
   deconvolution results of [125]GSE50081 and [126]GSE68465 (Supplementary
   Fig. [127]1). Also showed that in the low-risk group, tumor
   immune-promoting cell infiltration such as DC, memory CD4 + T cells was
   higher than that in the high-risk group, while macrophage infiltration
   was higher in the high risk group. To further verify the difference in
   the immune cell infiltration between the high-risk group and the
   low-risk group, we analyzed two independent scRNA-Seq cohort from Adams
   Kaminski et al. 2020^[128]30 and He Fan et al. 2021^[129]31. We
   calculated potential risk from the pseudo count of two scRNA-Seq cohort
   using the 5-gene-prognosis model obtained from TCGA-LUAD, and
   classified patients in them into high-risk and low-risk groups. We
   found that in two independent scRNA-Seq cohort, the proportion of
   monocyte and DC germline infiltration in the low-risk group was
   significantly higher than that in the high-risk group
   (Figs. [130]4E,F), while the proportion of macrophage infiltration in
   the low-risk group was significantly lower than that in the high-risk
   group (Figs. [131]4E,F). These findings further confirmed that lactate
   metabolism-related prognosis risk was related to monocyte lineages such
   as macrophages and DC’s infiltration.

   To explore further the relationship between these 5 lactate
   metabolism-related genes and immune cell infiltration in the tumor
   tissues of NSCLC patients, we utilized an independent scRNA-Seq dataset
   ([132]GSE207422). The distribution of different cell types was
   visualized using the UMAP dimensionality reduction method
   (Supplementary Fig. [133]1). Subsequently, we analyzed the correlation
   between the five genes and the infiltration proportions of different
   immune cell subsets (Supplementary Fig. [134]1). CCT6A, HMGA1, and
   PSMB7 were significantly positively correlated with the number of
   endothelial cells and the infiltration proportion of
   ALDH3A1 + CD8 + Treg cells, while CCT6A was significantly negatively
   correlated with neural crest cell infiltration. These findings suggest
   that the selected lactate metabolism-related genes are closely
   associated with immunosuppressive cells infiltration, further
   supporting the immunosuppressive role of these genes in tumor immune
   regulation and demonstrating the biological significance of the
   GA-EnCox algorithm.

The five proposed prognostic biomarkers and prognostic model guide small
molecule therapy and immunotherapy

   Regarding drug response, we analyzed the relationship between risk and
   the sensitivity to various drugs (IC50). Results indicated that risk
   score significantly influenced patient drug response (Supplementary
   Fig. [135]2). For Thapsigargin_180, risk score was significantly
   negatively correlated with drug response (r = − 0.593, p = 3.31e−49),
   suggesting that higher risk score corresponds to greater sensitivity to
   this drug. Similar negative correlations were observed for
   Fulvestrant_1414, Gefitinib_1010, GSK1904529A_202, indicating that
   patients with higher lactate metabolism-related risk score might
   respond better to these drugs. Conversely, for I.BET.762_275, risk
   score was significantly positively correlated with drug response
   (r = 0.404, p = 3.09e-21), suggesting that higher lactate
   metabolism-related risk score correlates with lower sensitivity to this
   drug. This implies that high risk score might increase resistance to
   certain drugs, affecting treatment outcomes. Therefore, five
   biomarkers-related risk score could serve as a predictor for patient
   response to specific drugs, providing a basis for personalized
   treatment strategies.

   To evaluate the predictive value of risk score for immune therapy
   response, we conducted a TIDE (Tumor Immune Dysfunction and Exclusion)
   analysis. The results showed (Fig. [136]5A) that the high-risk group
   had significantly higher TIDE response score than the low-risk group,
   indicating that patients with high lactate metabolism-related risk
   score might be less sensitive to immune checkpoint inhibitors and other
   immune therapies. Overall, a significant positive correlation was
   observed between the risk score and the TIDE response score, which
   further substantiates this conclusion. (Fig. [137]5B). In two
   independent LUAD immunotherapy response cohorts, the predicted risk
   scores for treatment responders were significantly lower than those for
   non-responders, further validating the predictive ability of the risk
   score for immunotherapy response (Fig. [138]5A). Further analysis
   revealed that (Supplementary Fig. [139]2) the immune exclusion score
   was significantly lower in the low-risk group, suggesting that their
   TME is more conducive to immune cell infiltration and function.
   Conversely, the high-risk group had significantly higher levels of
   MDSC, suggesting that high lactate metabolism-related risk score may
   promote the activity of immunosuppressive cells, inhibiting anti-tumor
   immune responses. Four of the five prognostic biomarkers (CCT6A, HMGA1,
   KRT8, LYPD3) exhibited significantly higher expression in the
   non-responsive group compared to the responsive group. (Fig. [140]5C).
   We validated the relationship between the expression levels of these
   five key genes and the efficacy of immune checkpoint inhibitor (aPD1)
   therapy in NSCLC patients. In three independent PD1 ICB treatment
   cohorts, CCT6A and HMGA1 were all highly expressed in the MPR (pCR)
   group compared to the NMPR group, thereby validating them as biomarkers
   for immunotherapy response, indicative of immunotherapy inhibition
   (Fig. [141]5D). In summary, patients with low lactate
   metabolism-related risk score showed advantages in overall survival due
   to better immune cell infiltration, higher anti-tumor immune activity,
   and better responses to certain drugs and immune therapies. These
   findings highlight the potential of lactate metabolism intensity as an
   important biomarker for predicting prognosis and personalizing
   treatment in NSCLC patients.

Fig. 5.

   [142]Fig. 5
   [143]Open in a new tab

   Analysis of the risk score and five prognostic biomarkers in relation
   to immune therapy response and immune cell infiltration. (A) Boxplots
   illustrating the relationship between risk groups and TIDE response
   scores generated from TCGA-LUAD and the relationship between PD1 immune
   checkpoint inhibitor response and risk scores generated from
   [144]GSE135222 and [145]GSE126044 (DCB: durable clinical benefit, NDB:
   non-durable clinical benefit). (B) Scatter plot showing the linear
   correlation between the 5-gene risk score and TIDE response scores.
   Pearson correlation coefficients (r) and p-values are shown. (C)
   Boxplot showing the expression level differences of the five LM-related
   prognostic biomarkers between treatment response groups. (D) Boxplots
   illustrating the expression levels of the five LM-related prognostic
   biomarkers in response to immune checkpoint inhibitor (aPD1) therapy in
   LUAD patients, comparing major pathologic response (pathologic complete
   response) (MPR(pCR)) and non-major pathologic response (NMPR) groups in
   [146]GSE207422, [147]GSE135222, [148]GSE126044 bulk RNA-seq data.
   Statistical significance was determined using Wilcoxon rank-sum test:
   *P < 0.05, **P < 0.01, ***P < 0.001.

   In summary, through comprehensive studies on prognosis, survival
   analysis, immune infiltration, and immunotherapy response, we validated
   the accuracy and biological significance of the lactate
   metabolism-related genes screened by the GA-EnCox algorithm. These
   genes not only effectively predict the survival prognosis of NSCLC
   patients but are also closely associated with the tumor immune
   microenvironment and indicate the efficacy of immunotherapy.

Verification of the lactate metabolism-related prognostic biomarkers in
cell-lines and NSCLC patient samples

   To validate the differences of lactate metabolism-related prognostic
   biomarkers between NSCLC patient tissues and normal tissues at protein
   level, we obtained the immunohistochemical (IHC) quantification results
   of the five prognostic biomarkers in from the Human Protein Atlas
   (HPA). As shown in Fig. [149]6A, HMGA1, KRT8, PSMB7, and CCT6A were
   highly expressed in tumor tissues, with expression levels markedly
   higher than those in normal lung tissues. Although LYPD3 exhibited
   relatively low protein expression level in both tumor and normal lung
   tissues, its expression in tumor tissues was still higher than those in
   normal lung tissues.

Fig. 6.

   [150]Fig. 6
   [151]Open in a new tab

   Validation of the five prognostic biomarkers using IHC. (A) Validation
   of the differences of lactate metabolism-related prognostic biomarkers
   between NSCLC patient tissues and normal tissues at protein level, IHC
   results and average TPM of the five prognostic biomarkers were obtained
   from the Human Protein Atlas (HPA). (B) Kaplan–Meier survival curves
   stratified by HMGA1 expression at transcriptome or protein level in the
   TCGA-LUAD and in-house NSCLC cohort I. (C) IHC results showing the
   expression level differences of HMGA1 between high risk and low risk
   patients from in-house NSCLC cohort I. (D) Boxplots illustrating the
   protein levels of KRT8 comparing high NSCLC grade and low NSCLC grade
   in in-house NSCLC cohort II. Statistical significance was determined
   using t-test: *P < 0.05, **P < 0.01, ***P < 0.001.

   Furthermore, to validate the prognostic effect of the protein
   expression of the five lactate metabolism-related biomarkers and their
   relationship with clinical-pathological variables (e.g., grade and
   stage) in well-defined, consistently treated in-house cohorts, we
   collaborated with Xuzhou Central Hospital and obtained 18 tumor tissue
   samples from NSCLC patients from 2021.1 to 2023.12. We divided the
   patients into two cohorts, cohort I contained 10 patients with accurate
   survival OS and OS. Time were used to verify the prognostic effect of
   prognostic biomarkers, and cohort II contained 8 patients with detailed
   pathological information were used to analyze the association between
   prognostic biomarkers and grade and stage. Information of these two
   cohorts is in Supplementary Tables [152]2 and [153]3. IHC staining and
   quantification were performed for HMGA1 in cohort I and KRT8 in cohort
   II. In cohort I, High protein abundance of HMGA1 is significantly
   associated with poor prognosis (Fig. [154]6B, p = 0.012). RNA-Seq data
   from TCGA-LUAD also demonstrated that HMGA1 significantly affected
   patient survival (Fig. [155]6B, p = 0.0016). In the comparison of HMGA1
   staining results between patients within different overall survival
   time in cohort I, it was evident that HMGA1 expression was
   significantly higher in tissues of NSCLC patients with shorter overall
   survival time (Fig. [156]6C). Additionally, in cohort II, KRT8 protein
   expression was significantly associated with tumor grade, with tumor
   grade higher than II showing significantly higher KRT8 protein
   expression than those with other patients (Fig. [157]6D).

   In summary, we validated the predictive strength of the biomarkers in
   the prognostic model and their relationships with tumor stage and grade
   in both publicly available IHC data and an in-house NSCLC patient
   cohort. These results highlight the prognosis prediction effectiveness
   of our lactate metabolism-related prognostic model and its associated
   biomarkers.

   In previous study, HMGA1 was verified to promote the proliferation of
   the NSCLC cell line^[158]32. Here, we further validated its effects on
   other malignant phenotypes by conducting colony formation, invasion,
   migration, and wound healing assays in NSCLC cell lines. In the A549
   cell model, the inhibition of HMGA1 expression (A549-Si-1 and
   A549-Si-2) significantly decreased the capabilities of colony
   formation, invasion, migration, and wound healing compared with the
   control group (A549-SINC) (Fig. [159]7A). This indicates that HMGA1
   plays a crucial role in promoting these cellular behaviors. Statistical
   analysis further confirmed these differences, with all four indices
   significantly lower in the siRNA interference group compared to the
   control group. In the case of overexpressed HMGA1 (A549-OE), the
   abilities of colony formation, invasion, migration, and wound healing
   were significantly higher than those in the control group (A549-HA,
   Fig. [160]7B). This further demonstrates that high expression of HMGA1
   enhances these cellular behaviors. Similarly, most indices showed
   statistically significant differences, highlighting the pivotal role of
   HMGA1 in regulating these cellular functions. The tumor phenotype
   experiments in H1299 and PC9 with knocked out and overexpressed HMGA1
   yielded similar results (Supplementary Figs. [161]3–[162]5). These
   findings further support the potential critical role of HMGA1 in cancer
   progression, including tumor growth, metastasis, and invasion.

Fig. 7.

   [163]Fig. 7
   [164]Open in a new tab

   The effect of HMGA1 on colony formation, invasion, migration, and wound
   healing in A549 cells. (A) Analysis of A549 cells with suppressed HMGA1
   expression. Colony formation, invasion, migration, and wound healing
   percentages of control cells (A549-SiNC) are compared to HMGA1
   knockdown cells (A549-Si-1 and A549-Si-2) using siRNA. (B) Analysis of
   A549 cells with HMGA1 overexpression. Colony formation, invasion,
   migration, and wound healing percentages of control cells (A549-HA) are
   compared to cells with increased HMGA1 expression (A549-OE). Results
   are presented as mean ± SD. Statistical significance was determined
   using appropriate tests (e.g., t-test or ANOVA): **P < 0.01,
   ***P < 0.001.

Discussion

   Our study systematically revealed the significant role of lactate
   metabolism in non-small cell lung cancer (NSCLC) through multi-level
   analysis, elucidating the impact of associated genes on patient
   prognosis. First, we extracted gene sets related to lactate production
   and transportation from the MSigDB and GeneCard database. After
   rigorous filtering, we identified a final set of 364 LM-related genes.
   In independent single-cell sequencing cohorts of malignant epithelial
   cells, we found that the UCell scores of these LM-related genes were
   significantly positively correlated with malignant cells, epithelial
   cells and stromal cells. This indicates that lactate metabolism is not
   only active in tumor cells but also plays an essential role in
   non-malignant cells under the tumor microenvironment. At the patient
   level, we applied 364 lactate metabolism-related genes to multiple
   independent cohorts. Based on the GSVA scores of the LM gene set
   (LM-strength), patients were categorized into two groups: high lactate
   metabolism strength (LM-high) and low lactate metabolism strength
   (LM-low). Survival analysis results indicated that patients in the
   LM-low group had significantly better prognosis compared to the LM-high
   group, underscoring the reliability of tumor-specific lactate
   metabolism-related biomarkers derived from scRNA-Seq a prognostic
   indicator.

   To identify the key genes that play a core role in lactate metabolism
   and prognosis, we employed the GA-EnCox algorithm to screen five
   significant prognostic genes from 364 lactate metabolism-related genes:
   CCT6A, HMGA1, KRT8, PSMB7, and LYPD3.

   To address the challenges in NSCLC prognostic prediction and biomarker
   screening, we integrated bulk transcriptomic and single-cell
   transcriptomic data while considering the targetability of prognostic
   biomarkers. We developed a tumor prognostic prediction model using a
   genetic algorithm and an Elastic Cox model. This approach formulates
   the prognostic biomarker optimization problem based on the Elastic Cox
   model, with tumor prognosis indicators as the optimization objective.
   The genetic algorithm is employed to solve the above optimization
   problem for optimizing the selection of key prognosis biomarkers
   iteratively, resulting in the GA-EnCox prognostic model, which
   demonstrates superior prognostic prediction performance compared to the
   existing tumor prognostic models across multiple NSCLC datasets. The
   study validates the effectiveness of the NSCLC prognostic biomarkers
   identified by the proposed model from various perspectives, including
   immune infiltration and drug response.

   The prognostic model constructed based on these five genes demonstrated
   excellent performance in multiple independent datasets, with survival
   analysis showing that patients in the low-risk group had significantly
   higher survival rates than those in the high-risk group. Multivariate
   Cox regression analysis further confirmed the independent prognostic
   value of this model. Compared to other lactate metabolism-related
   prognostic models, our model shows a clear advantage in predictive
   power, with the highest average C-index. This may be attributed to our
   optimized gene selection using the GA-EnCox algorithm and the critical
   roles of the selected genes in tumor metabolism and progression.
   Previous studies often focused on individual genes or did not fully
   consider the lactate metabolism pathway, our research systematically
   analyzed the overall impact of lactate metabolism, providing more
   accurate prognostic predictions.

   Our findings indicate that these five genes have been confirmed to be
   associated with malignant tumor phenotypes in NSCLC, demonstrating the
   validity of our approach. We are the first to propose these five genes
   as a collective set of biomarkers, among which CCT6A is introduced for
   the first time as a prognostic biomarker. Specifically, silencing CCT6A
   inhibited the proliferation and migration of LUAD cells and increased
   apoptosis rates. CCT6A interacts with the STAT1 protein, protecting
   STAT1 from ubiquitin-mediated degradation, enhancing its stability, and
   promoting the transcription of hexokinase 2 (HK2), thereby stimulating
   aerobic glycolysis and progression of LUAD^[165]33. Additionally,
   aberrant expression of KRT8 is closely related to tumor progression,
   including drug resistance, cell migration, and cell
   adhesion^[166]34,[167]35. PSMB7 belongs to the human proteasome gene
   family (PSM), which consists of 49 genes and plays a critical role in
   cancer proteasome inhibition. PSMB7 promotes the proliferation,
   metastasis, and bortezomib (Bortezomib) resistance of multiple
   myeloma^[168]36. This evidence supports the potential of PSMB7 as a
   therapeutic target. LYPD3, also known as C4.4A, is a membrane protein
   partially anchored to the cell surface through
   glycosylphosphatidylinositol (GPI). Studies have shown that LYPD3 is
   highly expressed in several human malignancies and is associated with
   poor prognosis. Overactivation of the JUP/AGR2/LYPD3 signaling axis
   contributes to the regulation of melanoma cell stemness and promotes
   glycolysis through a pathway dependent on glucose transporter
   1^[169]37. HMGA1 is an abundant non-histone chromatin protein
   associated with embryonic development, cancer, and cellular senescence.
   HMGA1 reduces the sensitivity of esophageal squamous cell carcinoma
   (ESCC) to ferroptosis by upregulating the expression of SLC7A11,
   maintaining intracellular glutathione homeostasis, and inhibiting the
   accumulation of malondialdehyde (MDA), thereby restricting
   ferroptosis^[170]38. Furthermore, HMGA1 interacts with the
   transcription factor SP1 to enhance its binding to the TKT promoter,
   thereby activating the pentose phosphate pathway and promoting the
   proliferation and tumor growth of ESCC cells^[171]6. In breast cancer,
   HMGA1 mediates the epithelial-mesenchymal transition (EMT) of invasive
   breast cancer through the Wnt/β-catenin pathway^[172]39. These studies
   support our findings that high HMGA1 expression is associated with
   immune therapy response, suggesting that HMGA1 may serve as a potential
   biomarker for predicting the efficacy of immunotherapy.

   To explore further the mechanisms behind the prognostic differences
   associated with lactate metabolism-related risk, we analyzed the
   differences between the high-risk and low-risk groups in terms of gene
   expression, immune infiltration, drug sensitivity, and immunotherapy
   response. In the differential gene expression analysis, for instance,
   genes such as KRT81, REG4, LYPD3, ANLN, and UCA1 were notably
   upregulated in the high-risk group, whereas genes like TAC4, BTG2 and
   PGC were significantly downregulated in the low-risk group. Pathway
   enrichment analysis of the differentially expressed genes revealed
   significant activation of tumor proliferation-related pathways in the
   high-risk group, while pathways related to cell adhesion and immune
   response were notably suppressed. Regarding immune infiltration, the
   high-risk subtype exhibited lower tumor purity scores, immune scores,
   and stromal scores compared to the low-risk subtype, indicating higher
   tumor purity and lower infiltration of immune and stromal cells in the
   high-risk phenotype. Further deconvolution analysis in bulk RNA-Seq and
   correlation analysis in independent scRNA-Seq cohorts revealed
   significantly higher infiltration of M0 macrophages in the high-risk
   subtype, whereas the low-risk subtype showed higher infiltration of
   CD4 + memory T cells. We also analyzed the differences in immune
   infiltration between the high and low risk groups in multiple scRNA-Seq
   databases. Macrophages and regulatory T cells in the high risk group
   were generally highly infiltrated, while monocytes and DC in the low
   risk group were generally highly infiltrated.

   In drug sensitivity analysis, we found a significant correlation
   between risk score and the IC50 values of various drugs. Patients in
   the high-risk group were more sensitive to drugs like Thapsigargin_180,
   Fulvestrant_1414, suggesting that these drugs may be more effective
   against patients with high lactate metabolism-related risk phenotypes.
   Conversely, patients in the low-risk group were more sensitive to drugs
   like I.BET.762_275, ACY.1215_264, indicating potential efficacy against
   patients with low lactate metabolism-related risk phenotypes. This
   offers the possibility of personalized drug selection based on risk
   status. Regarding immunotherapy response, patients in the low-risk
   group were more likely to respond to immune checkpoint inhibitors. This
   observation aligns with their tumor microenvironment characterized by
   lower levels of immune exclusion score and MDSC. Among the five
   biomarkers, four exhibited significant differences in the immune
   response subgroup. Notably, CCT6A and HGMA1 were confirmed in an
   independent NSCLC aPD-1 treatment cohort through bulk RNA-Seq,
   indicating that both serve as biomarkers for immunotherapy response
   inhibition.

   Finally, we further validated these lactate metabolism-related
   prognosis biomarkers at protein level in public database and in-house
   NSCLC patient cohorts. We validated the predictive strength of the
   biomarkers in the prognostic model and their relationships with tumor
   stage and grade. This further validates the accuracy of our prognostic
   model and the clinical significance of the prognostic biomarkers.

   This study has certain clinical significance. By quantifying the
   expression of these five biomarkers in the tumor tissue of patients, it
   is possible to predict survival risk score and immunotherapy response
   to a certain extent, thereby guiding therapeutic strategies. Moreover,
   combination therapy represents a promising direction. Currently,
   several clinical trials targeting cancers such as renal carcinoma and
   melanoma are underway, combining biomarker inhibitors with existing
   immune checkpoint blockade (ICB) therapies, such as PD1 or
   PDL1-targeted immune checkpoint inhibitors with VEGFR-targeted
   TKIs^[173]40. This indicates that combining lactate metabolism-related
   biomarkers with immune checkpoint inhibitors might be a highly
   promising therapeutic strategy. However, these biomarkers and the
   prognostic models constructed based on them still have many limitations
   in clinical practice: The tumor microenvironment is highly complex,
   with numerous factors influencing immune infiltration, the model based
   on only five genes is insufficient to comprehensively capture immune
   suppression and other patient phenotypes. Other malignant phenotypes
   associated with drug resistance also exist. Lactate-related prognostic
   biomarkers need to be combined with other biomarkers, requiring further
   research into new biomarkers.

   In summary, we proposed a novel workflow and methodology for
   identifying lactate-related prognostic biomarkers in tumors, and the
   identified biomarkers have been validated in multiple studies. In
   contrast to previous studies, which primarily focused on the function
   of individual genes or pathways, our work systematically integrates the
   correlation between lactate metabolism and NSCLC prognosis and obtained
   a set of biomarkers. By constructing a model that includes five key
   genes, we have improved the accuracy and reliability of prognostic
   predictions. The key genes identified—CCT6A, HMGA1, KRT8, PSMB7, and
   LYPD3—provide important clues for further exploration of lactate
   metabolism mechanisms and the development of new therapeutic
   strategies. Furthermore, this procedure exhibits general applicability
   and can be employed for the identification of any phenotype-related
   prognostic biomarkers in any type of cancer. This study was expected to
   provide important reference value for prognosis assessment and
   personalized treatment decision of cancer patients. However, this study
   has some limitations. The specific regulatory mechanisms of lactate
   metabolism on the immune microenvironment require further
   investigation, and the prognostic model established needs to be
   validated by more prospective clinical trials。

Methods

   To address the challenges in NSCLC prognostic prediction and biomarker
   screening, we developed a tumor prognostic prediction model using a
   genetic algorithm and an Elastic Cox model. This approach formulates
   the prognostic biomarker optimization problem based on the Elastic Cox
   model, with tumor prognosis indicators as the optimization objective,
   while the genes related to lactate metabolism as the optimization
   variables. Selected biomarkers should meet the following constraints:
   highly express in malignant cell subsets in multiple scRNA-Seq cohorts,
   have correlation to lactate metabolism strength and accord with
   targetability. The genes used to calculate prognostic risk in this
   model serve as prognostic biomarkers. The genetic algorithm is employed
   to solve the above optimization problem for optimizing the selection of
   key prognosis biomarkers iteratively, resulting in the GA-EnCox
   prognostic model.

Selection of lactate metabolism-related genes

   Firstly, lactate production and transport-related gene sets were
   obtained from MSigDB (Shown in Supplementary Table [174]1).
   Additionally, genes with a relevance score greater than 10 for
   “lactate” were retrieved from GeneCard, resulting in a basic LM gene
   set comprising 288 genes. Within three independent NSCLC scRNA-Seq
   cohorts: Wu Zhou et al cohorts, Goveia et al cohorts and Maynard et al
   cohorts^[175]26–[176]28, genes showing a significant positive
   correlation with the basic LM UCell score in malignant cells (PCC > 0,
   p-adjust < 1e−4) were selected. Concurrently, Seurat FindMarker was
   utilized to identify markers highly expressed in malignant cells
   compared to other cells (logFC > 0.15, min.pct = 0.1, p-adjust < 1e−4).
   The intersection of these two criteria formed the LM-related gene set.
   To ensure gene intervenability, markers from immune cells (e.g., B
   cells, DCs) were excluded. Ultimately, intersecting LM-related gene
   sets from three independent NSCLC scRNA-Seq cohorts, a subset of 364
   intervenable LM-related genes was identified.

Bulk RNA-Seq and microarray data processing:

   RNA-Seq count matrices for TCGA-LUAD were obtained from the GDC data
   portal, and patient survival and clinical information were sourced from
   the UCSC Xena database (n = 596). The datasets [177]GSE13213 (n = 117),
   [178]GSE26939 (n = 115), [179]GSE30219 (n = 85), [180]GSE31210
   (n = 226), and [181]GSE41271 (n = 182) were utilized to validate the
   prognostic classification capabilities of the LM-related gene set and
   the LM-related gene prognostic model. The RNA-Seq data of NSCLC
   patients before PDL1 ICB treatment from [182]GSE126044, [183]GSE135222,
   [184]GSE207422 were used to validate the relationship between the LM
   risk score and the response to PDL1 ICB treatment, with all
   aforementioned data downloaded from GEO. LM-strength was quantified as
   the GSVA^[185]41,[186]42 score of LM-related genes. Prior to
   calculating the GSVA enrichment scores, the count matrices were
   transformed to log2(TPM + 1). Differential expression between high-risk
   and low-risk samples was analyzed using the DeSeq2^[187]43 method
   (FDR < 0.01). ESTIMATE was employed to calculate tumor purity, stromal
   scores, and immune infiltration scores, thereby evaluating the content
   of immune and stromal cells in the tumor microenvironment.
   CIBERSORTx^[188]44 was used to estimate the proportions of various
   immune cell infiltrations (e.g., B cells, T cells, monocytes) from bulk
   RNA-Seq data. OncoPredict^[189]45 was employed to predict the response
   of each LUAD sample to small-molecule drugs from the GDSC and CTRP
   databases (IC50), while TIDE was employed to estimate the levels of
   immune checkpoint abundance, T cell functionality, and immunotherapy
   response.

ScRNA-Seq data processing

   The processing of single-cell RNA sequencing (scRNA-Seq) data was
   conducted using the R Seurat package^[190]46. For all of scRNA-Seq
   data, cells were first filtered to retain those with mitochondrial gene
   content of less than 20%, ribosomal gene content of less than 50%, gene
   count greater than 200, and a housekeeping gene score (mean expression
   of ACTB, GAPDH, and MALAT1) greater than 1^[191]47. Normalization was
   performed using SCTransform under default parameters, and data from
   multiple patients were integrated using Harmony^[192]48
   (max.iter = 20). Clustering of the integrated expression matrix was
   then conducted with the top 50 principal components and a resolution of
   0.6.

   For [193]GSE207422, Cell type annotation of single-cell subsets was
   executed using both SingleR^[194]49 (HPCA data as reference) and
   ScType^[195]50 (“Lung” and “Immune System” as tissue) methods.
   LM-strength in scRNA-Seq was quantified by the UCell score of
   LM-related genes. Further analysis in [196]GSE207422 involved
   clustering T-NK cell and monocyte subsets separately (resolution = 0.3,
   with clustering results provided in Supplementary Fig. [197]1 and
   further subdividing subsets based on marker genes (selected by
   FindAllMarkers function under default parameters). For Adams Kaminski
   et al. 2020 and He Fan et al. 2021, Cell type annotations were curated
   from^[198]51.

   To calculate the correlation between gene expression from scRNA-Seq and
   the proportion of various cell types in the tumor microenvironment,
   AggregateExpression was used to compute the average gene expression
   across all cells, converted to log2(CPM + 1), and the Pearson
   correlation coefficient (PCC) was calculated relative to the
   proportions of different cell types in the TME.

Construction of GA-EnCox: A LM-related prognostic biomarker selection method

   For the 364 LM-related genes, the GA-EnCox algorithm was employed to
   select prognostic genes and construct a prognostic model. In the
   genetic algorithm, each bit of the chromosome corresponds to a
   prognostic biomarker (gene) to be selected, with a value of either 0 or
   1. The chromosome consists of 364 bits, where a value of 1 at a
   specific position indicates that the corresponding prognostic biomarker
   (gene) is selected. The initial population consists of 500 chromosomes,
   all of which are randomly generated. The algorithm flowchart is
   illustrated in Fig. [199]3A. The crossover strategy adopted was a
   two-point crossover, with a crossover probability of 0.5. The mutation
   strategy implemented was a bit-flip mutation, with a mutation
   probability of 0.2. The selection strategy used was tournament
   selection, and the number of iterations of the genetic algorithm was
   set to 200. For fitness calculation, the Cox Elastic-net model was
   employed to evaluate the survival risk of the selected LM genes.
   Compared to Lasso-Cox, the Cox Elastic-net model incorporates an L2
   regularization term into the partial likelihood function to balance
   model fitting precision and generalization capability. The partial
   likelihood function is as follows:
   graphic file with name 41598_2025_85620_Article_Equa.gif

   Inline graphic represents the partial likelihood function in Cox
   regression, Inline graphic denotes the regression coefficients, Inline
   graphic represents the regularization coefficient, and Inline graphic
   signifies the coefficient for the L1 penalty term in the regularization
   component. For the Cox Elastic-net model, parameters are set with
   Inline graphic =0.9 and Inline graphic =0.2. A fivefold
   cross-validation is performed on the TCGA-LUAD dataset using the
   average C-index as the fitness of each chromosome in GA, in which the
   C-index is calculated based on the predicted survival risk. The genes
   corresponding to the chromosome with the highest fitness throughout the
   iterations is selected to fit the model. Genes with non-zero variable
   coefficients are chosen as the final LM-associated prognostic genes.
   These LM-associated prognostic genes are used to calculate patient
   survival risk through linear regression. The Cox Elastic-net and
   genetic algorithm are implemented based on Python’s scikit-survival and
   DEAP^[200]52 libraries, respectively.

Tumor phenotyping experimental procedures

Patient samples collection

   Tissue specimens were collected from 16 patients with non-small cell
   lung cancer (NSCLC) at Xuzhou Central Hospital, Xuzhou, China, through
   surgical resections performed between January 2021 and December 2023.
   None of the patients had received preoperative radiotherapy,
   chemotherapy, or targeted therapy. We divided the patients into two
   cohorts. cohort I composed of 10 patients with accurate survival time
   was used to verify the prognostic effect of prognostic biomarkers;
   cohort II composed of 8 patients with detailed pathological stages was
   used to analyze the relationship between prognostic biomarkers and
   grade and stage. Tissue samples were fixed in formalin, trimmed into
   pieces measuring approximately 1.5 × 2.0 cm with a thickness of
   0.2–0.3 cm, and processed through dehydration and clearing steps to
   remove water and alcohol. The tissues were then embedded in paraffin,
   sectioned into 4–5 micron slices using a microtome, and mounted onto
   adhesive slides to produce unstained tissue slides for
   immunohistochemical analysis. Clinical and pathological characteristics
   of the patients are summarized in Supplemental Table [201]2. The study
   was approved by the Ethics Committee of Xuzhou Central Hospital for
   Biomedical Research (Approval No. XZXY-LJ-20191210-045), and written
   informed consent was obtained from all participants. All procedures
   involving human participants were conducted in accordance with ethical
   guidelines and regulations.

Cell culture and transfection

   Human lung adenocarcinoma cell lines A549, H1299 were obtained from
   ATCC (cat # CCL-185, cat # CRL-5803, respectively). PC9 cell line were
   purchased from Chinese Academy of Sciences (Beijing, China). A549 and
   H1299 were cultured in RPMI-1640 medium supplemented with 10% FBS and
   1% P/S at 37 °C under 5% CO2 conditions. PC9 were cultured in DMEM
   medium supplemented with 10% FBS and 1% P/S at 37 °C under 5% CO2
   conditions. HMGA1 knockdown was performed using two specific siRNAs.
   Over expression of HMGA1 was achieved by transfecting pcDNA3.1-HMGA1
   plasmid. Transfections were conducted using jetPRIME® (Polyplus)
   according to the manufacturer’s instructions.

Colony formation assay

   Cells in the logarithmic growth phase were digested with trypsin and
   resuspended in complete medium (RPMI-1640 with 10% FBS for A549, H1299
   and DMEM with 10% FBS for PC9) to create a cell suspension, which was
   then counted. A total of 1,000 cells per well were seeded into 6-well
   culture plates for each experimental group and cultured for 7–10 days.
   The medium was replaced and the cell status was observed every 3 days.
   Upon completion of the cloning process, cells were photographed under a
   microscope, washed once with PBS, and fixed with 1 mL of 4%
   paraformaldehyde for 30–60 min, followed by another wash with PBS.
   Subsequently, 1 mL of crystal violet staining solution was added to
   each well and incubated for 10–20 min to stain the cells.

Transwell migration and invasion assay

   Invasion assays were conducted using transwell chambers coated with
   Matrigel (8 μm pore size). The lower chamber was filled with 800 μL of
   culture medium containing 10% FBS. Place the chamber into a 24-well
   plate and add 200 μL cell suspension with a concentration of 3 × 10^5
   cells containing serum-free medium into the upper chamber. After
   culturing for 36 to 48 h, remove the chamber, rinse it with PBS, and
   use a cotton swab to remove the cells and Matrigel from the upper
   chamber. Fix the chamber with 4% paraformaldehyde for 20 min, stain
   with 0.1% crystal violet for 30 min, and rinse it three times with PBS.
   Cells from five random areas per well were photographed and counted
   using ImageJ software. As for migration assays, matrigel was not
   utilized, 2 × 10^5 cells was used in the upper chamber and culturing
   lasted for 24 h.

Wound healing assay

   Cells were cultured in 6-well plates. Upon reaching 90% confluence, a
   sterile pipette tip was employed to draw a straight line across the
   cell layer to create a wound, followed by washing the cells with PBS to
   get rid of cellular fragments. Fresh culture medium was added, and
   wound closure was observed at 24 h using an inverted microscope. The
   wound area was measured using ImageJ software.

Immunohistochemistry

   Unstained tissue slides in cohort I were used for IHC analysis to
   detect the target proteins HGMA1 while slides in cohort II were used
   for KRT8 staining. Slides were incubated with either rabbit anti-HMGA1
   primary antibody (cat # 29895–1-ap, Proteintech, dilution 1:100) or
   rabbit anti-KRT8 primary antibody (cat # 17514–1-ap, Proteintech,
   dilution 1:1000). The chromogenic reaction was performed using DAB
   substrate, and the slides were counterstained with hematoxylin, which
   stained the nuclei blue, while positive signals appeared as
   brownish-yellow. Images were captured using Olympus microscopes and the
   PANNORAMIC-MIDI scanner. Aipathwell software was utilized for image
   analysis. The expression of HGMA1 and KRT8 was quantified by IRS and
   H-score. The immunoreactivity score (IRS) was calculated as the product
   of staining intensity (SI) and the percentage of positive cells (PP).
   Staining intensity was graded as 0 (no staining), 1 (light yellow, weak
   positive), 2 (brown–yellow, moderate positive), or 3 (brown, strong
   positive). The percentage of positive cells was scored as 0 (0–5%), 1
   (6–25%), 2 (26–50%), 3 (51–75%), or 4 (> 75%). Additionally, the
   H-Score was determined using the formula H-Score = ∑(pi × i) where i
   represents the staining intensity (0: no staining, 1: weak positive, 2:
   moderate positive, 3: strong positive) and pi represents the percentage
   of cells at each intensity level. For survival analysis of HMGA1 in
   cohort I, we set the threshold for high expression at H-score = 100.

Statistical analysis

   All data were analyzed using R software (Version 4.3.2). The Wilcoxon
   rank-sum test was employed for between-group comparisons, with all
   statistical tests being two-tailed and p-values less than 0.05
   considered statistically significant. Correlation analysis was
   conducted using Pearson’s correlation method, and correlation p-values
   were adjusted using the BH method. Survival analysis was performed
   using the Kaplan–Meier method, with differences in survival assessed
   via the log-rank test. The prognostic C-index was utilized for model
   evaluation.

Supplementary Information

   [202]Supplementary Material 1.^ (4.3MB, docx)

Acknowledgements