Abstract Hepatocellular carcinoma (HCC) is a chronic liver disease characterized by persistent tumor growth, contributing significantly to mortality rates worldwide. Consequently, there is an urgent need to develop effective diagnostic and treatment strategies for HCC. This study aims to identify crucial genes for early HCC diagnosis to mitigate disease progression and to investigate differences in immune cell infiltration between early-stage and late-stage HCC. We integrated two published datasets for a comprehensive analysis, identifying 575 DEGs subjected to GSEA to reveal pathways distinguishing early-stage from late-stage HCC. Notably, the gene SLC6A8 emerged as a potential diagnostic biomarker for late-stage HCC through machine learning (LASSO-LR/SVM-RFE/RF-Boruta). ROC curves for SLC6A8 were utilized to evaluate diagnostic accuracy. The ImmuCellAI algorithm assessed immune cell composition differences between early and late-stage HCC, revealing that SLC6A8 expression positively correlates with resting Tfh cells and Th2, while negatively correlating with B cells, indicating its association with immune cell infiltration patterns. To strengthen our results, we further analyzed SLC6A8 expression using single-cell transcriptome data, confirming notably overexpression in late-stage HCC, particularly in key liver cell types such as Hepatocyte cells. Overall, our study nominates SLC6A8 as a dual biomarker for HCC Staging and precision therapy. Supplementary Information The online version contains supplementary material available at 10.1007/s12672-025-02351-3. Introduction Hepatocellular carcinoma (HCC), a fourth leading cause of death over the world in 2018, has been a globally severe disease [[28]1]. Most patients are diagnosed at a late stage, and the only potentially curative approaches for treating are hepatic resection and liver transplantation, which have low treatment rate [[29]2]. So, it is critical to find the diagnostic biomarker of HCC as early as possible. However, rather than focusing on the severity of HCC, many researchers have tried to build a classification to distinguish between normal and HCC patients [[30]3–[31]5]. In Addition, tumor immune microenvironment plays an important role in tumor progression and treatment outcome [[32]6, [33]7]. In recent years, immunotherapy has become a hopeful approach for treating cancer [[34]8]. However, response rates among HCC patients have been limited [[35]9]. Therefore, conducting a comprehensive analysis of different types of immune cells within tumors could aid in identifying new biomarkers for prognosis and therapeutic efficacy in HCC patients. In recent decades, HCC has been a major focus of research interest. For instance, specific genes like CLTA, TALDO1 and CSTB, identified as gene signature through single cell RNA sequencing, have been associated with survival outcome and immunotherapy response [[36]10]. Hang D et al. conducted mass spectrometry analysis for metabolomics to identify new predictive biomarkers and pathways in HCC [[37]11]. Genes such as Receptor Activity Modifying Protein 3 (RAMP3) and CD68 Molecule (CD68) have shown significantly expressed in HCC based on machine-learning algorithms [[38]12]. Overall, these genes are considered vital candidates for potential diagnostic markers in HCC clinical settings. Recently, the rapid development of next-generation sequencing technologies provides amount of RNA sequencing data of HCC. However, RNA sequencing is typically conducted in "bulk," capturing average gene expression patterns from a multitude of cells [[39]13]. Notably, single-cell RNA sequencing single-cell RNA sequencing is an innovative sequencing technique that offers valuable insights into the characterization of individual immune cells or tumor cells [[40]14]. At the same time, bioinformatic analysis and machine learning have emerged as increasingly promising strategies for comprehensive and in-depth analysis of large datasets, such as transcriptome sequences, and interdisciplinary collaborations have been instrumental in advancing clinical therapeutic methods [[41]15]. The advent of modern computer-assisted medical science has provided significant guidance and hope for previously untreatable diseases, such as utilizing the XGBoost algorithm for HCC diagnosis [[42]12]. To meet the demand for early diagnosis, numerous efforts have focused on developing new methods based on deep learning analysis [[43]16]. Achieving accurate clinical diagnosis of COPD remains a critical goal. In this study, we screened for potential biomarkers indicating the late stage of HCC using established bioinformatic tools. Two transcriptome datasets were selected from the published database (TCGA, ICGC) for analysis. After identifying DEGs (Differential expressed genes) using limma, we conducted GO and GSEA analyses on these DEGs. Notably, we identified a candidate diagnostic biomarker, SLC6A8, which was the intersection of genes selected by LASSO, SVM-REF and RF-Boruta methods and validated it using another dataset. Additionally, we investigated changes in immune cell composition between early-stage and late-stage HCC samples using ImmuCellAI (Immune Cell Abundance Identifier) to analyze immune cell infiltration. Furthermore, we linked immune cell infiltration to a potential HCC diagnostic biomarker, validated through single-cell data and in vitro experiments. Overall, we identified a novel gene that may guide future clinical management of HCC patients. Materials and method Patient samples Frozen HCC tissues were randomly obtained with informed consent from patients who underwent radical resections in the Second Affiliated Hospital of Harbin Medical University. After selection, 32 HCC patients were enrolled in this study and informed consent was obtained from all patients. Ethical consent was granted from the Ethics Committee of the second Affiliated Hospital of Harbin Medical University, Harbin, China. Patients enrolled in our study received neither radiation therapy nor chemotherapy prior to surgery. Data collection and download Standardized RNA-Seq reads (Release 28) of LIHC-US and LIRI-JP projects were obtained from The Cancer Genome Atlas (TCGA, [44]https://cancergenome.nih.gov/) and International Cancer Genome Collaboration (ICGC, [45]https://dcc.icgc.org/). [46]GSE14520, the validation dataset with chip data was downloaded from GEO database ([47]https://www.ncbi.nlm.nih.gov/geo/). Only patients with tumor stage and samples from the primary site were included in the analysis. For LIHC-US (TCGA), 231 patients were retained. For LIRI-JP (ICGC), 344 patients were met the criterion. Finally, we extracted 218 patients from the [48]GSE14520. The HCC scRNA-seq dataset [49]GSE149614 was load from the GEO database ([50]https://www.ncbi.nlm.nih.gov/geo/) and included 10 patients, only the primary tumor samples including 3 TNM stage I, 1 TNM stage II, 2 TNM stage III and 4 TNM stage IV samples will be taken for downstream analysis. According to the TNM, HCC disease was divided into four stages: stage I, stage II, stage III, and stage IV. In the following analysis, we defined stage I and stage II as HCC early-stage, stage III and stage IV as HCC late-stage. A merge data cohort comprising of LIHC-US and LIRI-JP datasets through “limma” and “snm” R packages, and the [51]GSE14520 was as the validation dataset to confirm the analysis results. The [52]GSE149614, an HCC scRNA-seq dataset, comprising of 4 HCC early-stage and 6 HCC early-stage patients was also taken as the validation dataset to investigate the results at single-cell level. The clinical information of these cohorts involved in this study had presented in Supplemental Table 1. Data integration The LIHC-US and LIRI-JP datasets were merged and processed using the R packages limma (version 3.58.11) and SNM (version 1.50.0) for batch effect correction. First, the voom function from limma was applied to the raw count data (dge object) with parameters design = covDesignNorm and normalize.method = "none" to stabilize variance. Subsequently, the SNM algorithm was implemented with raw.dat = vdge$E (voom-transformed expression matrix), adjusting for biological variables (bio.var) while removing technical confounding factors (adj.var) using rm.adj = TRUE. Diagnostic outputs were enabled via verbose = TRUE and diagnose = TRUE in Supplemental Fig. 1D [[53]17, [54]18]. What’s more, the results of other approaches used to remove the batch effect in the different dataset had showed in Supplemental Figure 1A–C. Principal component analysis (PCA) shows the samples distribution before and after batch correction. Differential expression genes DEGs form the merged dataset were with the cutoff criteria of |logFC|> 0.5, adjusted-pvalue < 0.05 by using “limma” (version 3.58.11) R package. A linear model was then fitted with lmFit and differential expression was assessed using the eBayes function. Functional enrichment analysis Gene ontology (GO) enrichment analysis was performed to investigate the DEG biological significant by using the clusterProfiler (version 4.10.1) R package [[55]19]. The Molecular Signatures Database (MSigDB) Hallmark Gene Sets and Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway were used for pathway enrichment analysis by using msigdbr (version 7.5.1) and massdatabase (version 1.0.10) R package. GSVA analysis was performed using the GSVA (version 1.50.1) R package [[56]20]. Both GSEA and GSVA were performed with the criteria of P- value < 0.05. Immune cell infiltration The ImmuCellAI, a gene set signature-based method, is a deconvolution algorithm, which can estimate the abundance of 24 immune cell types [[57]21]. A single sample gene set enrichment analysis (ssGSEA) algorithm was applied to deconvolve bulk gene expression matrix. Immune cell infiltration (24 various cell types including 18 T-cell subsets) was precisely quantified in each gene expression profile. The detailed process was carried out by using ImmuneCellAI (version 0.1.0) R package. Potential biomarkers selection To find the potential prognostic gene biomarkers from the DEGs, we screened them by three machine-learning algorithms. (1) least absolute shrinkage and selection operator with logistic regression (LASSO-LR: family = "binomial", keep = TRUE, type.measure = "mse", nfolds = 10, alpha = 1, nlambda = 100), (2) support vector machine-recursive feature elimination (SVM-RFE: sizes = seq(1, (ncol(X_train)—1), 20), metric = "RMSE", rfeControl = rfeControl(functions = rfFuncs, method = "cv", number = 5)) algorithms, (3) Boruta with random forest (Brouta-RF: method = "repeatedcv", number = 10, repeats = 3, search = "random", classProbs = TRUE, verboseIter = TRUE). LASSO can identify genes significantly associated with different samples, which represents a regression analysis algorithm that applies regularization for variable selection using “glmnet” (version 4.1.7) R package [[58]22]. In this study, LASSO with logistic regression first reduced the dimensionality successfully (425 original genes to 20 meaningful genes). Specifically, LASSO method with crossing validation using the mean of square error as cost function was performed, then they shrank into a few more important features according to mean of square error. SVM-RFE represents a widely used supervised machine-learning protocol for classification and regression, used to find the best variables by deleting SVM-generated eigenvectors. The “Caret” (version 6.0.94) R package via grid search method is employed to select hyperparameters for all classifiers using tenfold cross-validation for the training dataset. SVM-RFE can identify the diagnostic value of biomarkers with higher discriminative power using the “e1071” (version 1.7.13) R package. The Boruta algorithm is a wrapper built around the random forest classification algorithm. The Boruta (version 8.0.0) R package is to capture all the important, interesting features you might have in the dataset with respect to an outcome variable. The “Caret” (version 6.0.94) R package via grid search method is employed to select hyperparameters of random forest for all classifiers using tenfold cross-validation for the training dataset. Brouta-RF can identify the diagnostic value of biomarkers with higher discriminative power using the “randomforest” (version 4.7–1.1) R package. Ultimately, we combined the overlapping genes among LASSO-LR, SVM-RFE and Brouta-RF algorithm for further analysis. A two-sided P value < 0.05 between the early stage and the late stage of HCC was statistically significant on these genes. Then, we validated their expression level to estimate their ability to be candidate diagnostic biomarkers in the [59]GSE14520 data. ROC of diagnostic biomarker The nonparametric Wilcoxon rank-sum test was used to perform inter-group comparisons of continuous variables. The degree of efficacy of each diagnostic biomarker was assessed using receiver operating characteristic (ROC) curves by “pROC” (version 1.18.5) and “multipleROC” (version 0.1.1) R package, which is the gold standard to prove the diagnostic accuracy and test the efficacy of diagnostic biomarkers in the [60]GSE14520 cohort. Single cell transcriptome data processing and analyzing [61]GSE149614 raw data was downloaded from GEO databases. In the process of single cell transcriptome data processing, we carried out normalization, scaling, clustering of cells and achieved 6 main cell types using Seurat (version 5.0.3) R package. Single cells were extracted with the criteria of nFeature_RNA > 250, percent.mt < 20%, percent. Log10GenesPerUMI > 0.8, nCount_RNA > 500 to removing doublet and dead cells by Seurat R package (version 5.0.3). Then we normalized the filtered gene-barcode matrices using “LogNormalize” method with the “NormalizeData” function. The top 2000 highly variable genes were found by the “FindVariableFeatures” function using the “vst” method which were centered and scaled using “ScaleData” before. Then we performed principal component analysis (PCA) based on these 2000 highly variable genes with the intention of dimensionality reduction, then dimensionality-reduced clusters were showed on the 2D map produced with the t-distributed t-SNE using function “FindNeighbors”, “FindClusters” and “runTSNE” from Seurat. Kruskal–Wallis test was used to estimate the difference of gene expression level. Real-time quantitative reverse transcriptase-PCR Total RNA was extracted from liver cancer tissue using Trizol (Invitrogen, Carlsbad, America) reagent before cDNA was obtained using the Trans-Script All in-one First-strand cDNA Synthesis Supermix for qPCR Kit (TransGen Biotech, Beijing, China). Real-time quantitative PCR was performed in Step One ABI real-time. PCR System through SYBR Green Master (Roche, Basel, Switzerland). GAPDH, Forward primer: 5’-CATGTTCGTCATGGGTGTGAA-3’, Reverse primer:5’-GGCATGGACTGTGGTCATGAG-3’. SLC6A8, Forward primer: 5’-GGCAGCTACAACCGCTTCAACA-3’, Reverse primer:5’-CAGGATGGAGAAGACCACGAAG-3’. Statistical analysis R (version 4.3.3) and Rstudio (version 2023.12.1 + 402) were used for statistical analysis. Wilcox-rank-sum test was carried out to analyze the significant differences of features between the early-stage and late-stage groups. Spearman correlation coefficient was used to identify the correlations between genes and immune cells. A P-value < 0.05 indicated statistical significance (*P < 0.05; **P < 0.01; ***P < 0.001). Results Identification of DEGs in the HCC early stage and late stage We performed differential expression analysis between 397 early-stage HCC samples and 178 late-stage HCC samples in the merge cohort (LIHC-US and LIRI-JP) by utilizing the “limma” R package. Figure [62]1 showed the schematic workflow of this study. Fig. 1. [63]Fig. 1 [64]Open in a new tab Schematic workflow of this study 137 genes were significantly enriched in the early-stage group and 291 genes significantly enriched in the late-stage group (Fig. [65]2A, |logFC|> 0.5 and AdjustedPvalue < 0.05, Supplemental Table 2). The heatmap of DEGs also had been presented in Fig. [66]2B. Fig. 2. [67]Fig. 2 [68]Open in a new tab DEGs between the early-stage group and the late-stage group in HCC merge dataset (LIHC-US and LIRI-JP). A Volcano of DEGs. Dark green dots represent enriched-in-early-stage genes, light purple dots represent enriched-in-late-stage genes and grey dots represent non-significant genes. B Heatmap of DEGs. The colors of column annotations represent the different groups Functional analysis of DEGs by GO and KEGG enrichment analysis To investigate the potential biological significances of DEG, we used clusterProfiler and GSVA R package to characterize the GO and KEGG pathway. The GO analysis showed that EEGs involved with response to xenobiotic stimulus, steroid metabolic process, cytoplasmic vesicle lumen, high − density lipoprotein particle, endopeptidase regulator activity, steroid hydroxylase activity, and so on (Fig. [69]3A, Supplemental Table 3). Fig. 3. [70]Fig. 3 [71]Open in a new tab Functional analysis of DEGs between the early-stage group and the late-stage group in HCC merge cohort. A Dotplot of enriched GO terms in DEGs. B Dotplot of enriched KEGG pathway in DEGs. C GSEA results in HCC early-stage samples. D GSEA results in HCC late-stage samples Moreover, p53 signaling pathway, ECM − receptor interaction, ABC transporters, Steroid hormone biosynthesis, and PPAR signaling pathway were identified by GSEA (Fig. [72]3B, Supplemental Table 4). In particular, nitrogen metabolism, one carbon pool by folate, taurine and hypotaurine metabolism and tryptophan metabolism were mainly enriched in the HCC early stage (Fig. [73]3C). Inversely, cell cycle, mismatch repair, p53 signaling pathway, and sphingolipid metabolism were significantly enriched in the HCC late stage (Fig. [74]3D). Additionally, GSVA results showed ABC transporters was enriched in the HCC late stage (Supplemental Figure 2). Significant changes between two stages in immune cells by ImmuneCellAI With the ImmuCellAI, a latest immune cell infiltration algorithm, we obtained the immune cell expression matrix from the whole gene expression matrix of the merge dataset (Fig. [75]4A). Not only Tfh cells and Th2 cells of the HCC early stage were significantly higher than the HCC late stage, but also CD8 naïve cells, Monocyte and Neutrophil were enriched in the HCC early stage. In the opposite, the HCC late stage showed a higher proportion of B cells compared to HCC early stage (Fig. [76]4B). The heatmap of immune cells also showed the tendency between two stages in samples (Fig. [77]4C). Furthermore, B cells were negatively correlated with most immune cells including CD8 naïve cells, Tr1 cells and Monocyte (Fig. [78]4D). On the other hand, Tfh cells, Th2 cells, and Monocyte were positively correlated with CD8 naïve cells (Fig. [79]4D). Fig. 4. [80]Fig. 4 [81]Open in a new tab Evaluation and visualization of immune cell infiltration between the HCC early stage and late stage in merge cohort. A Stacked barplot of infiltrating immune cells in the HCC early-stage and late-stage samples. B Boxplot of differential expression of 21 infiltrating immune cells between the HCC early stage and late stage in merge cohort. Wilcoxon rank-sum test was used for differential analysis. *P < 0.05; **P < 0.01. ns, non-significance (Dark green in the boxes represents the HCC early stage and light purple represents the HCC late stage). C Heatmap of 21 infiltrating immune cells. D Correlation matrix of 21 immune cell infiltration between the HCC early stage and late stage in merge cohort. Red and blue indicate positive and negative correlations, respectively. The darker color shows the stronger the correlation Potential gene biomarkers identified by multiple machine learning approaches To find the potential gene biomarkers to distinguish the early- and late-stage of HCC, we utilized three machine learning algorithms to identify the diagnostic markers from the DEGs. 20 potential DEGs were identified by the least absolute shrinkage and selection operator (LASSO) logistic regression algorithm (Supplemental Figure 3A). What’s more, we used the 20 biomarkers to build LASSO logistic regression diagnostic model, whose performance was pretty good (AUC = 0.732, Supplemental Figure 3B). Of the 20 biomarkers, 7 DEGs had positive coefficients and 13 DEGs had negative coefficients (Supplemental Figure 3C, Supplemental Table 5). The Boruta algorithm was used to classify 39 features from the DEGs (Supplemental Figure 3D). Moreover, the random forest diagnostic model based on the 39 biomarkers had also well performance (AUC = 0.704, Supplemental Figure 3E). SLC6A8 gene had the highest importance score (Mean decrease accuracy from random forest) in Supplemental Figure 3F (Supplemental Table 6). Another machine learning algorithm named the support vector machine-recursive feature elimination (SVM-RFE) to identify 41 features (Supplemental Figure 3G, Supplemental Table 7). and the performance of the diagnostic model for distinguish the HCC early stage and the HCC late stage was also well (AUC = 0.698, Supplemental Figure 3H). Next, 10 intersection biomarkers including SLC6A8, FTCD, CYP2C9, ANGPT2, ENO1, CNGA1, KCNJ15, SLC39A4, ETV1, and ACSL6 were extracted from abovementioned features (Fig. [82]5A). Sequentially, we validated the gene expression level of the 10 biomarkers in the validation [83]GSE14520 dataset and found only 6 genes (SLC6A8, FTCD, CYP2C9, ANGPT2, ENO1, CNGA1) had the consistent results with the merge dataset (Supplemental Figure 4A, B). 3 genes (FTCD, CYP2C9, CNGA1) were significantly enriched in the HCC early stage, while other 3 genes (SLC6A8, ANGPT2, ENO1) were significantly higher in the HCC late stage in both merge dataset and validation dataset (Supplemental Figure 4A, B). Fig. 5. [84]Fig. 5 [85]Open in a new tab Validation and ROC of SLC6A8 diagnostic marker identified from the merge datasets by multiple machine learning algorithms. A The overlapping genes among three machine learning methods. B, C The expression level of SLC6A8 diagnostic marker between the HCC early stage and the HCC late stage in the merge dataset and validation [86]GSE14520 dataset, respectively. Wilcoxon rank-sum test was used for differential analysis. D, E ROC validation of diagnostic validity of the SLC6A8 diagnostic marker in the merge dataset and validation [87]GSE14520 dataset, respectively (Sens: sensitivity; Spec: Specificity; PPV: Positive Predictive Value; NPV: Negative Predictive Value) To find the diagnostic makers from the 6 genes, we focused on the SLC6A8 gene which had higher expression in the HCC late stage (Fig. [88]5B, C) and performed the ROC analysis to check whether the SLC6A8 gene had well performance of diagnostic validity in both merge dataset. Finally, the AUC of ROC validation were 0.654 (Fig. [89]5D) and 0.701 (Fig. [90]5E) in the merge dataset and validation dataset, respectively, indicating the SLC6A8 gene had not bad performance as a diagnostic biomarker. Correlation analysis between SLC6A8 and immune cells We used spearman correlation to characterize the association between 6 genes (SLC6A8, FTCD, CYP2C9, ANGPT2, ENO1, CNGA1) and infiltrating immune cells. The gene expression of most these genes had strong correlation with the 21 immune cells (Fig. [91]6A). SLC6A8 was significantly associated with 15 immune cells including DC, Exhausted, Th17 and CD8 naïve. Especially, the DC and CD8 naïve had the highest correlation coefficient with SLC6A8 (Fig. [92]6B). SLC6A8 had significantly positive correlation with DC cells (r = 0.284, p = 3.9e−12, Fig. [93]6C) and other cells (Supplemental Figure 5A–I), and significantly negative correlation with CD8 naïve (r = − 0.22, p = 1e−07, Fig. [94]6D) and other cells (Supplemental Figure 5J–L). Fig. 6. [95]Fig. 6 [96]Open in a new tab Correlation between SLC6A8 and 21 immune cells. A Heatmap of correlation between 6 biomarkers (SLC6A8, FTCD, CYP2C9, ANGPT2, ENO1, CNGA1) and 21 immune cells. Color represents the spearman correlation coefficient. *P < 0.05; + P < 0.01. B Lollipop diagram of correlation between SLC6A8 and 22 immune cell infiltration. Circle color represents the p-value (red indicates significance) and circle size represents the absolute value of coefficient. C, D Scatter plots of correlation between SLC6A8 and immune cells whose had the highest correlation coefficients (DC and CD8_naive). Dark green represents the HCC early-stage samples and light purple represents the HCC late-stage samples Furthermore, we explore the associations between SLC6A8 and marker genes of immune cells as reported by Miao, Ya-Ru, et al. (2020) in Advanced Science. Upon analysis, we found significant correlations between SLC6A8 and the majority of marker genes for dendritic cells (Supplemental Figure 6A). For instance, the chemokine receptor CXCR3, which exerts substantial immunological influence on cancer, was positively correlated with SLC6A8 (Supplemental Figure 6B). In contrast, the CYP4F3 gene showed a negative correlation (Supplemental Figure 6C). Similar observations were made in another immune cell type, CD8 naïve, where we noted genes that were both positively and negatively correlated with the SLC6A8 gene (Supplemental Figure 6D–F). Expression level of SLC6A8 in single-cell transcriptomic data The cell types of [97]GSE149614 dataset were showed by using t-SNE based on the Seurat-class object with the 2000 highly variable genes after data processing including normalization, scaling, clustering and so on (Fig. [98]7A). The expression of individual cells and each cell types for SLC6A8 are displayed in the Fig. [99]7B, C, respectively. Interesting, Hepatocyte, which was the dominant cells in liver presented the highest expression level of SLC6A8 among 6 main cell types (Fig. [100]7C). Following, we separated cells according to the tumor stage of patients into two groups (HCC early stage and HCC late stage) to validate whether the HCC enhanced SLC6A8 gene expression, which had been discovery in the bulk-seq RNA dataset. As expected, SLC6A8 gene had significantly higher expression (p < 2e−16) in the HCC late stage compared to the HCC early stage in Hepatocyte (Fig. [101]7D) by Wilcoxon rank-sum test, which was consistent with the previous analysis. Fig. 7. [102]Fig. 7 [103]Open in a new tab SLC6A8 expression in single-cell [104]GSE149614 transcriptomic dataset. A The t-distributed stochastic neighbor embedding (t-SNE) plot of the 6 identified main cell types in [105]GSE173896 dataset. B t-SNE map highlighting the expression of SLC6A8 gene. C Bubble plot showing the expression of the SLC6A8 related different cell types. Dot size represents the percent expressed, and color represents average expression. D Correlation of SLC6A8 with 6 types of main cell types between the HCC early stage and late stage in single cell transcriptome data. E SLC6A8 expression of early-stage and late-stage HCC patients. Wilcoxon rank-sum test was used for differential analysis. *P < 0.05 Subsequently, we detected the expression of SLC6A8 in patients with early and late liver cancer. Compared to early-stage HCC patients, the expression of SLC6A8 was higher in late-stage HCC patients (Fig. [106]7E). This result indicated that the expression level of SLC6A8 was related to the classification of liver cancer. Discussion HCC with high morbidity and mora rate has become a leading cause of cancer-related death worldwide [[107]23]. Furthermore, HCC was diagnosed at the late tumor stage, which missing the appropriate treatment options [[108]24]. Therefore, it is crucial for patient management to predict the disease at an early stage. Hence, the development of gene biomarkers that distinguish the early stage from the late stage of HCC is the primary goal of this study. To explore the potential biomarkers in the prediction of the early stage, we used multiple machine learning algorithms to compare the early stage and late stage of HCC based on the differentially expressed genes (DEGs). Apart from that, we also investigated the biological function of DEGs and immune cells between the early stage and the late stage. In this study, 137 down-regulated genes and 291 up-regulated genes were identified, and some of DEGs were enriched in p53 signaling pathway and tryptophan metabolism, which are cancer-related pathway [[109]25, [110]26]. Interestingly, p53 signaling pathway with p53 protein mutations resulting in uncontrolled cell proliferation and cancer tumors [[111]27], was enriched in the late stage of HCC, suggesting that there was more the loss of tumor-suppressing function at the late stage of HCC. Inversely, tryptophan metabolism promoting tumor cell intrinsic malignant properties as well as restricts antitumor immunity [[112]28], was enriched in the early stage of HCC, indicating that the immune system had initially changed in the early stage. Moreover, the immune infiltration analysis also found that the immunity between two stages was significantly different, such as Tfh (T follicular helper) cells, providing essential help to B cells for effective antibody-mediated immune responses [[113]29], and Th2, facilitating tissue repair [[114]30] were both enriched in the early stage. Ten DEGs from the intersection of genes i.e. SLC6A8, FTCD, CYP2C9, ANGPT2, ENO1, CNGA1, KCNJ15, SLC39A4, ETV1, and ACSL6 were screened by using LASSO analysis, RF-Boruta and SVM-REF. Previously, different reports uncover their involvement in a variety of malignancies [[115]31–[116]35]. From comparing their expression value between two stages of HCC in the merge and validation cohort, we found SLC6A8, ANGPT2 and ENO1 were observed as up-regulated genes in the late stage. Then, we used ROC curves to assess the efficacy of SLC6A8 on the diagnosis of HCC. The AUC of SLC6A8 were 0.654 and 0.701 in the merge cohort and the validation cohort, suggesting that SLC6A8 is a potential diagnostic biomarker to distinguish the early stage and the late stage of HCC. Especially, it has been shown that SLC6A8 is associated with the initiation and progression of human cancers [[117]36]. Evidently, another study also demonstrated that SLC6A8 knockdown suppresses the invasion and migration of HCC [[118]37]. Compared to established HCC biomarkers such as AFP (alpha-fetoprotein) and DCP (des-gamma-carboxy prothrombin), which exhibit poor detection in early-stage HCC [[119]38], SLC6A8 showed higher performance. Clinically, integrating SLC6A8 detection into existing AFP-based diagnostic panels may improve early detection rates, particularly in AFP-negative patients [[120]38]. Future studies should prioritize developing SLC6A8-targeted assays for serum or imaging-based validation. There was different immune capacity between two stages of HCC, but whether the immunity associated with the SLC6A8 is still unknown. To further analyze these relationships, we used spearman correlation coefficient and found that the up-regulated gene SLC6A8 was positively associated with DC (dendritic cells), which provide antigens and co-stimulatory signals to cells of the adaptive immune system [[121]39], and negatively associated with CD8-naïve cells (naïve CD8 + T cells), which selectively detect and eradicate cancer cells by targeting the antigens including tumor-specific neoantigens and self-antigens from tumors [[122]40]. In our study, other immune cells also associated with gene SLC6A8, inferring that SLC6A8 might affect the tumor progression by regulating the immune cells. In the literature, it has revealed the positive relationship between DC cells and SLC6A8-mediated creatine transport [[123]41]. Another recent study also demonstrated that SLC6A8 may be involved in the development of cancer by participating in the Notch signaling pathway, which playing important role in the specification of the immune cells in the NSCLC [[124]31]. Although the correlation coefficient in (Fig. [125]6C, D) appears modest (r = 0.284; r = − 0.22), the statistically significant p-value (p < 0.05) likely reflects the large sample size (n = 575), which increases statistical power to detect even small effects. However, the biological relevance of such weak associations requires cautious interpretation and further validation in independent cohorts. We also investigated whether the expression of gene SLC6A8 at single-cell transcript level and in vitro trial was the same as bulk-RNA sequencing level. Interestingly, six cell types including Hepatocyte which is the main functional cells of the live [[126]42] were identified in the liver tissue which was consistent with the previous studies [[127]43]. Furthermore, we found that the most important live cell Hepatocyte cells had higher values of SLC6A8 in the late stage of HCC. At the same time, the Myeloid cell also had slightly significantly higher values of SLC6A8 in the late stage of HCC. Additionally, the in vitro trial also showed the obviously significant expression changes of SLC6A8 between two groups. With the repeated verification of discovery on the SLC6A8, we believe the gene SLC6A8 may be the potential diagnostic marker. To validate the mechanistic role of SLC6A8 in HCC, future studies could employ CRISPR/Cas9-mediated knockout models to assess its impact on tumor proliferation and metastasis. Additionally, metabolomic profiling of SLC6A8-deficient HCC cells may reveal alterations in creatine-phosphate-ATP cycling, linking its function to metabolic reprogramming. In vivo validation using patient-derived xenografts (PDXs) treated with SLC6A8 inhibitors (e.g., β-guanidinopropionic acid [[128]4]) could further establish its therapeutic potential. In conclusion, we demonstrated that gene SLC6A8 was significantly up-regulated in the late stage of HCC based on the transcriptomic data. In addition, gene SLC6A8 was associated with immune cell infiltration, which provides a potential target for more precise and personalized immunotherapy. Therefore, the gene SLC6A8 may be a potential diagnostic marker. Conclusion In summary, we identified the DEGs by comparing the early stage and the late stage of HCC and found that they were associated with the cancer-related pathway and SLC6A8, a potential diagnostic biomarker for clinical diagnosis between the early stage and the late stage of HCC was verified not only in the single-cell transcriptomic data but also in vitro trial. Limitation of the study We have identified potential diagnostic biomarker of the early stage and the late stage of HCC based on the transcriptomic data and verified it in the single-cell transcriptomic data and in vitro trial in our study. One of the limitations is that the sample size of our merge data may be not very large. But we have integrated the two most influential research (LIHC-US and LIRI-JP) with more than 100 samples into the merge dataset. Furthermore, this study only focused on biomarker exploration in the transcriptomic levels. The future study could develop the combination biomarkers from the genomic, epigenetic and metabolomic data. Furthermore, the future time-series studies may employ the methodologies provided by the review to conduct causal inferences [[129]44], thereby identifying the causal relationships between biomarkers and the progression of cancer. While the AUC scores (0.65 and 0.7) indicate moderate diagnostic accuracy, we acknowledge that further model optimization (e.g., hyperparameter tuning or ensemble methods) or integration of multimodal data features (e.g., imaging or clinical variables) could enhance the robustness of the model in future studies. Another limitation is the heterogeneity of HCC. Future studies should expand to multi-ethnic cohorts and incorporate spatial transcriptomics to address tumor microenvironment heterogeneity. Supplementary Information [130]12672_2025_2351_MOESM1_ESM.tiff^ (1.4MB, tiff) Supplementary materials 1: Figure. 1. PCA of multiple types of gene expression matrix for batch removal on merge data cohort (LIHC-US and LIRI-JP). A. PCA of raw gene expression matrix. B. PCA of adjusted gene expression matrix by ComBat function from sva (version 3.50.0) R package. C. PCA of adjusted gene expression matrix by removeBatchEffect function from limma (version 3.58.1) R package. D. PCA of adjusted gene expression matrix through VOOM and SNM function from limma (version 3.58.11) and SNM (version 1.50.0) R package. Dark green represents the HCC early-stage samples and light purple represents the HCC late-stage samples. [131]12672_2025_2351_MOESM2_ESM.tiff^ (1.6MB, tiff) Supplementary materials 2: Figure 2. GSVA enrichment analysis of gene expression matrix between the HCC early stage and late stage in merge data. [132]12672_2025_2351_MOESM3_ESM.tiff^ (3.5MB, tiff) Supplementary materials 3: Figure 3. Machine learning algorithms to screen potential diagnostic markers from DEGs between the HCC early stage and late stage in merge data. A. LASSO logistic regression found 20 biomarkers. B. The performance of LASSO logistic regression diagnostic model built by 20 biomarkers. C. The coefficient of 20 biomarkers in LASSO logistic regression. D. 39 biomarkers were identified by Boruta and random forest. E. The performance of random forest diagnostic model built by 39 biomarkers. F. The importance scores of 39 biomarkers in random forest. G. 41 biomarkers were identified by Recursive Feature Elimination feature selection. H. The performance of SVM diagnostic model built by 41 biomarkers. [133]12672_2025_2351_MOESM4_ESM.tiff^ (1.7MB, tiff) Supplementary materials 4: Figure 4. 10 overlapping genes screened from machine learning algorithms were validated in the validation cohort. A. The expression level of overlapping genes SLC6A8, FTCD, CYP2C9, ANGPT2, ENO1, CNGA1, KCNJ15, SLC39A4, ETV1, and ACSL6 between the HCC early stage and the HCC late stage in the merge dataset, respectively. B. The expression level of overlapping genes SLC6A8, FTCD, CYP2C9, ANGPT2, ENO1, CNGA1, KCNJ15, SLC39A4, ETV1, and ACSL6 between the HCC early stage and the HCC late stage in the validation GSE14520 dataset, respectively. [134]12672_2025_2351_MOESM5_ESM.tiff^ (2.2MB, tiff) Supplementary materials 5: Figure 5. Correlation between SLC6A8 and 12 immune cells. A-B. Scatter plots of correlation between SLC6A8 expression and different immune cell contents, Exhausted, NKT, nTreg, CD4 naïve, Cytotoxic, CD8 T, B cell, Tr1, NK, Monocyte, Neutrophil, and Th17. [135]12672_2025_2351_MOESM6_ESM.tiff^ (1.7MB, tiff) Supplementary materials 6: Figure 6. Associations between SLC6A8 and marker genes of Immune cells from (Miao, Ya‐Ru, et al. 2020. Advanced science). Correlation between SLC6A8 and marker genes of dendritic cells (A), CXCR3 (B) and CYP4F3 (C). Correlation between SLC6A8 and marker genes of CD8 naïve (D), TKNS2 (E) and MOGAT2 (F). The correlation coefficient is calculated using spearman. [136]12672_2025_2351_MOESM7_ESM.xlsx^ (724.6KB, xlsx) Supplementary materials 7: Table 1: Clinicopathologic characteristics of patients. Supplemental Table 2: Differentially expressed genes between the early stage and the late stage in the merged Hepatocellular Carcinoma (HCC) cohort. Supplemental Table 3: Enriched Gene Ontology terms of Differentially expressed genes. Supplemental Table 4: Enriched KEGG pathway of differentially expressed genes by GSEA. Supplemental Table 5: Gene signature from feature selection by least absolute shrinkage and selection operator (LASSO) and logistic regression (LR). Supplemental Table 6: Gene markers from feature selection by Boruta and random forest (RF). Supplemental Table 7: Gene markers from feature selection by Recursive Feature Elimination (RFE) and support vector machine (SVM). [137]Supplementary materials 8^ (3.5MB, docx) Acknowledgements