Abstract

   Hepatocellular carcinoma (HCC) is a chronic liver disease characterized
   by persistent tumor growth, contributing significantly to mortality
   rates worldwide. Consequently, there is an urgent need to develop
   effective diagnostic and treatment strategies for HCC. This study aims
   to identify crucial genes for early HCC diagnosis to mitigate disease
   progression and to investigate differences in immune cell infiltration
   between early-stage and late-stage HCC. We integrated two published
   datasets for a comprehensive analysis, identifying 575 DEGs subjected
   to GSEA to reveal pathways distinguishing early-stage from late-stage
   HCC. Notably, the gene SLC6A8 emerged as a potential diagnostic
   biomarker for late-stage HCC through machine learning
   (LASSO-LR/SVM-RFE/RF-Boruta). ROC curves for SLC6A8 were utilized to
   evaluate diagnostic accuracy. The ImmuCellAI algorithm assessed immune
   cell composition differences between early and late-stage HCC,
   revealing that SLC6A8 expression positively correlates with resting Tfh
   cells and Th2, while negatively correlating with B cells, indicating
   its association with immune cell infiltration patterns. To strengthen
   our results, we further analyzed SLC6A8 expression using single-cell
   transcriptome data, confirming notably overexpression in late-stage
   HCC, particularly in key liver cell types such as Hepatocyte cells.
   Overall, our study nominates SLC6A8 as a dual biomarker for HCC Staging
   and precision therapy.

Supplementary Information

   The online version contains supplementary material available at
   10.1007/s12672-025-02351-3.

Introduction

   Hepatocellular carcinoma (HCC), a fourth leading cause of death over
   the world in 2018, has been a globally severe disease [[28]1]. Most
   patients are diagnosed at a late stage, and the only potentially
   curative approaches for treating are hepatic resection and liver
   transplantation, which have low treatment rate [[29]2]. So, it is
   critical to find the diagnostic biomarker of HCC as early as possible.
   However, rather than focusing on the severity of HCC, many researchers
   have tried to build a classification to distinguish between normal and
   HCC patients [[30]3–[31]5]. In Addition, tumor immune microenvironment
   plays an important role in tumor progression and treatment outcome
   [[32]6, [33]7]. In recent years, immunotherapy has become a hopeful
   approach for treating cancer [[34]8]. However, response rates among HCC
   patients have been limited [[35]9]. Therefore, conducting a
   comprehensive analysis of different types of immune cells within tumors
   could aid in identifying new biomarkers for prognosis and therapeutic
   efficacy in HCC patients.

   In recent decades, HCC has been a major focus of research interest. For
   instance, specific genes like CLTA, TALDO1 and CSTB, identified as gene
   signature through single cell RNA sequencing, have been associated with
   survival outcome and immunotherapy response [[36]10]. Hang D et al.
   conducted mass spectrometry analysis for metabolomics to identify new
   predictive biomarkers and pathways in HCC [[37]11]. Genes such as
   Receptor Activity Modifying Protein 3 (RAMP3) and CD68 Molecule (CD68)
   have shown significantly expressed in HCC based on machine-learning
   algorithms [[38]12]. Overall, these genes are considered vital
   candidates for potential diagnostic markers in HCC clinical settings.

   Recently, the rapid development of next-generation sequencing
   technologies provides amount of RNA sequencing data of HCC. However,
   RNA sequencing is typically conducted in "bulk," capturing average gene
   expression patterns from a multitude of cells [[39]13]. Notably,
   single-cell RNA sequencing single-cell RNA sequencing is an innovative
   sequencing technique that offers valuable insights into the
   characterization of individual immune cells or tumor cells [[40]14]. At
   the same time, bioinformatic analysis and machine learning have emerged
   as increasingly promising strategies for comprehensive and in-depth
   analysis of large datasets, such as transcriptome sequences, and
   interdisciplinary collaborations have been instrumental in advancing
   clinical therapeutic methods [[41]15]. The advent of modern
   computer-assisted medical science has provided significant guidance and
   hope for previously untreatable diseases, such as utilizing the XGBoost
   algorithm for HCC diagnosis [[42]12]. To meet the demand for early
   diagnosis, numerous efforts have focused on developing new methods
   based on deep learning analysis [[43]16]. Achieving accurate clinical
   diagnosis of COPD remains a critical goal.

   In this study, we screened for potential biomarkers indicating the late
   stage of HCC using established bioinformatic tools. Two transcriptome
   datasets were selected from the published database (TCGA, ICGC) for
   analysis. After identifying DEGs (Differential expressed genes) using
   limma, we conducted GO and GSEA analyses on these DEGs. Notably, we
   identified a candidate diagnostic biomarker, SLC6A8, which was the
   intersection of genes selected by LASSO, SVM-REF and RF-Boruta methods
   and validated it using another dataset. Additionally, we investigated
   changes in immune cell composition between early-stage and late-stage
   HCC samples using ImmuCellAI (Immune Cell Abundance Identifier) to
   analyze immune cell infiltration. Furthermore, we linked immune cell
   infiltration to a potential HCC diagnostic biomarker, validated through
   single-cell data and in vitro experiments. Overall, we identified a
   novel gene that may guide future clinical management of HCC patients.

Materials and method

Patient samples

   Frozen HCC tissues were randomly obtained with informed consent from
   patients who underwent radical resections in the Second Affiliated
   Hospital of Harbin Medical University. After selection, 32 HCC patients
   were enrolled in this study and informed consent was obtained from all
   patients. Ethical consent was granted from the Ethics Committee of the
   second Affiliated Hospital of Harbin Medical University, Harbin, China.
   Patients enrolled in our study received neither radiation therapy nor
   chemotherapy prior to surgery.

Data collection and download

   Standardized RNA-Seq reads (Release 28) of LIHC-US and LIRI-JP projects
   were obtained from The Cancer Genome Atlas (TCGA,
   [44]https://cancergenome.nih.gov/) and International Cancer Genome
   Collaboration (ICGC, [45]https://dcc.icgc.org/). [46]GSE14520, the
   validation dataset with chip data was downloaded from GEO database
   ([47]https://www.ncbi.nlm.nih.gov/geo/). Only patients with tumor stage
   and samples from the primary site were included in the analysis. For
   LIHC-US (TCGA), 231 patients were retained. For LIRI-JP (ICGC), 344
   patients were met the criterion. Finally, we extracted 218 patients
   from the [48]GSE14520. The HCC scRNA-seq dataset [49]GSE149614 was load
   from the GEO database ([50]https://www.ncbi.nlm.nih.gov/geo/) and
   included 10 patients, only the primary tumor samples including 3 TNM
   stage I, 1 TNM stage II, 2 TNM stage III and 4 TNM stage IV samples
   will be taken for downstream analysis.

   According to the TNM, HCC disease was divided into four stages: stage
   I, stage II, stage III, and stage IV. In the following analysis, we
   defined stage I and stage II as HCC early-stage, stage III and stage IV
   as HCC late-stage. A merge data cohort comprising of LIHC-US and
   LIRI-JP datasets through “limma” and “snm” R packages, and the
   [51]GSE14520 was as the validation dataset to confirm the analysis
   results. The [52]GSE149614, an HCC scRNA-seq dataset, comprising of 4
   HCC early-stage and 6 HCC early-stage patients was also taken as the
   validation dataset to investigate the results at single-cell level. The
   clinical information of these cohorts involved in this study had
   presented in Supplemental Table 1.

Data integration

   The LIHC-US and LIRI-JP datasets were merged and processed using the R
   packages limma (version 3.58.11) and SNM (version 1.50.0) for batch
   effect correction. First, the voom function from limma was applied to
   the raw count data (dge object) with parameters design = covDesignNorm
   and normalize.method = "none" to stabilize variance. Subsequently, the
   SNM algorithm was implemented with raw.dat = vdge$E (voom-transformed
   expression matrix), adjusting for biological variables (bio.var) while
   removing technical confounding factors (adj.var) using rm.adj = TRUE.
   Diagnostic outputs were enabled via verbose = TRUE and diagnose = TRUE
   in Supplemental Fig. 1D [[53]17, [54]18]. What’s more, the results of
   other approaches used to remove the batch effect in the different
   dataset had showed in Supplemental Figure 1A–C. Principal component
   analysis (PCA) shows the samples distribution before and after batch
   correction.

Differential expression genes

   DEGs form the merged dataset were with the cutoff criteria of
   |logFC|> 0.5, adjusted-pvalue < 0.05 by using “limma” (version 3.58.11)
   R package. A linear model was then fitted with lmFit and differential
   expression was assessed using the eBayes function.

Functional enrichment analysis

   Gene ontology (GO) enrichment analysis was performed to investigate the
   DEG biological significant by using the clusterProfiler (version
   4.10.1) R package [[55]19]. The Molecular Signatures Database (MSigDB)
   Hallmark Gene Sets and Kyoto Encyclopedia of Genes and Genomes (KEGG)
   Pathway were used for pathway enrichment analysis by using msigdbr
   (version 7.5.1) and massdatabase (version 1.0.10) R package. GSVA
   analysis was performed using the GSVA (version 1.50.1) R package
   [[56]20]. Both GSEA and GSVA were performed with the criteria of P-
   value < 0.05.

Immune cell infiltration

   The ImmuCellAI, a gene set signature-based method, is a deconvolution
   algorithm, which can estimate the abundance of 24 immune cell types
   [[57]21]. A single sample gene set enrichment analysis (ssGSEA)
   algorithm was applied to deconvolve bulk gene expression matrix. Immune
   cell infiltration (24 various cell types including 18 T-cell subsets)
   was precisely quantified in each gene expression profile. The detailed
   process was carried out by using ImmuneCellAI (version 0.1.0) R
   package.

Potential biomarkers selection

   To find the potential prognostic gene biomarkers from the DEGs, we
   screened them by three machine-learning algorithms. (1) least absolute
   shrinkage and selection operator with logistic regression (LASSO-LR:
   family = "binomial", keep = TRUE, type.measure = "mse", nfolds = 10,
   alpha = 1, nlambda = 100), (2) support vector machine-recursive feature
   elimination (SVM-RFE: sizes = seq(1, (ncol(X_train)—1), 20),
   metric = "RMSE", rfeControl = rfeControl(functions = rfFuncs,
   method = "cv", number = 5)) algorithms, (3) Boruta with random forest
   (Brouta-RF: method = "repeatedcv", number = 10, repeats = 3,
   search = "random", classProbs = TRUE, verboseIter = TRUE).

   LASSO can identify genes significantly associated with different
   samples, which represents a regression analysis algorithm that applies
   regularization for variable selection using “glmnet” (version 4.1.7) R
   package [[58]22]. In this study, LASSO with logistic regression first
   reduced the dimensionality successfully (425 original genes to 20
   meaningful genes). Specifically, LASSO method with crossing validation
   using the mean of square error as cost function was performed, then
   they shrank into a few more important features according to mean of
   square error.

   SVM-RFE represents a widely used supervised machine-learning protocol
   for classification and regression, used to find the best variables by
   deleting SVM-generated eigenvectors. The “Caret” (version 6.0.94) R
   package via grid search method is employed to select hyperparameters
   for all classifiers using tenfold cross-validation for the training
   dataset. SVM-RFE can identify the diagnostic value of biomarkers with
   higher discriminative power using the “e1071” (version 1.7.13) R
   package.

   The Boruta algorithm is a wrapper built around the random forest
   classification algorithm. The Boruta (version 8.0.0) R package is to
   capture all the important, interesting features you might have in the
   dataset with respect to an outcome variable. The “Caret” (version
   6.0.94) R package via grid search method is employed to select
   hyperparameters of random forest for all classifiers using tenfold
   cross-validation for the training dataset. Brouta-RF can identify the
   diagnostic value of biomarkers with higher discriminative power using
   the “randomforest” (version 4.7–1.1) R package.

   Ultimately, we combined the overlapping genes among LASSO-LR, SVM-RFE
   and Brouta-RF algorithm for further analysis. A two-sided P
   value < 0.05 between the early stage and the late stage of HCC was
   statistically significant on these genes. Then, we validated their
   expression level to estimate their ability to be candidate diagnostic
   biomarkers in the [59]GSE14520 data.

ROC of diagnostic biomarker

   The nonparametric Wilcoxon rank-sum test was used to perform
   inter-group comparisons of continuous variables. The degree of efficacy
   of each diagnostic biomarker was assessed using receiver operating
   characteristic (ROC) curves by “pROC” (version 1.18.5) and
   “multipleROC” (version 0.1.1) R package, which is the gold standard to
   prove the diagnostic accuracy and test the efficacy of diagnostic
   biomarkers in the [60]GSE14520 cohort.

Single cell transcriptome data processing and analyzing

   [61]GSE149614 raw data was downloaded from GEO databases. In the
   process of single cell transcriptome data processing, we carried out
   normalization, scaling, clustering of cells and achieved 6 main cell
   types using Seurat (version 5.0.3) R package. Single cells were
   extracted with the criteria of nFeature_RNA > 250, percent.mt < 20%,
   percent. Log10GenesPerUMI > 0.8, nCount_RNA > 500 to removing doublet
   and dead cells by Seurat R package (version 5.0.3). Then we normalized
   the filtered gene-barcode matrices using “LogNormalize” method with the
   “NormalizeData” function. The top 2000 highly variable genes were found
   by the “FindVariableFeatures” function using the “vst” method which
   were centered and scaled using “ScaleData” before. Then we performed
   principal component analysis (PCA) based on these 2000 highly variable
   genes with the intention of dimensionality reduction, then
   dimensionality-reduced clusters were showed on the 2D map produced with
   the t-distributed t-SNE using function “FindNeighbors”, “FindClusters”
   and “runTSNE” from Seurat. Kruskal–Wallis test was used to estimate the
   difference of gene expression level.

Real-time quantitative reverse transcriptase-PCR

   Total RNA was extracted from liver cancer tissue using Trizol
   (Invitrogen, Carlsbad, America) reagent before cDNA was obtained using
   the Trans-Script All in-one First-strand cDNA Synthesis Supermix for
   qPCR Kit (TransGen Biotech, Beijing, China). Real-time quantitative PCR
   was performed in Step One ABI real-time. PCR System through SYBR Green
   Master (Roche, Basel, Switzerland). GAPDH, Forward primer:
   5’-CATGTTCGTCATGGGTGTGAA-3’, Reverse
   primer:5’-GGCATGGACTGTGGTCATGAG-3’. SLC6A8, Forward primer:
   5’-GGCAGCTACAACCGCTTCAACA-3’, Reverse
   primer:5’-CAGGATGGAGAAGACCACGAAG-3’.

Statistical analysis

   R (version 4.3.3) and Rstudio (version 2023.12.1 + 402) were used for
   statistical analysis. Wilcox-rank-sum test was carried out to analyze
   the significant differences of features between the early-stage and
   late-stage groups. Spearman correlation coefficient was used to
   identify the correlations between genes and immune cells. A
   P-value < 0.05 indicated statistical significance (*P < 0.05;
   **P < 0.01; ***P < 0.001).

Results

Identification of DEGs in the HCC early stage and late stage

   We performed differential expression analysis between 397 early-stage
   HCC samples and 178 late-stage HCC samples in the merge cohort (LIHC-US
   and LIRI-JP) by utilizing the “limma” R package. Figure [62]1 showed
   the schematic workflow of this study.

Fig. 1.

   [63]Fig. 1
   [64]Open in a new tab

   Schematic workflow of this study

   137 genes were significantly enriched in the early-stage group and 291
   genes significantly enriched in the late-stage group (Fig. [65]2A,
   |logFC|> 0.5 and AdjustedPvalue < 0.05, Supplemental Table 2). The
   heatmap of DEGs also had been presented in Fig. [66]2B.

Fig. 2.

   [67]Fig. 2
   [68]Open in a new tab

   DEGs between the early-stage group and the late-stage group in HCC
   merge dataset (LIHC-US and LIRI-JP). A Volcano of DEGs. Dark green dots
   represent enriched-in-early-stage genes, light purple dots represent
   enriched-in-late-stage genes and grey dots represent non-significant
   genes. B Heatmap of DEGs. The colors of column annotations represent
   the different groups

Functional analysis of DEGs by GO and KEGG enrichment analysis

   To investigate the potential biological significances of DEG, we used
   clusterProfiler and GSVA R package to characterize the GO and KEGG
   pathway. The GO analysis showed that EEGs involved with response to
   xenobiotic stimulus, steroid metabolic process, cytoplasmic vesicle
   lumen, high − density lipoprotein particle, endopeptidase regulator
   activity, steroid hydroxylase activity, and so on (Fig. [69]3A,
   Supplemental Table 3).

Fig. 3.

   [70]Fig. 3
   [71]Open in a new tab

   Functional analysis of DEGs between the early-stage group and the
   late-stage group in HCC merge cohort. A Dotplot of enriched GO terms in
   DEGs. B Dotplot of enriched KEGG pathway in DEGs. C GSEA results in HCC
   early-stage samples. D GSEA results in HCC late-stage samples

   Moreover, p53 signaling pathway, ECM − receptor interaction, ABC
   transporters, Steroid hormone biosynthesis, and PPAR signaling pathway
   were identified by GSEA (Fig. [72]3B, Supplemental Table 4). In
   particular, nitrogen metabolism, one carbon pool by folate, taurine and
   hypotaurine metabolism and tryptophan metabolism were mainly enriched
   in the HCC early stage (Fig. [73]3C). Inversely, cell cycle, mismatch
   repair, p53 signaling pathway, and sphingolipid metabolism were
   significantly enriched in the HCC late stage (Fig. [74]3D).
   Additionally, GSVA results showed ABC transporters was enriched in the
   HCC late stage (Supplemental Figure 2).

Significant changes between two stages in immune cells by ImmuneCellAI

   With the ImmuCellAI, a latest immune cell infiltration algorithm, we
   obtained the immune cell expression matrix from the whole gene
   expression matrix of the merge dataset (Fig. [75]4A). Not only Tfh
   cells and Th2 cells of the HCC early stage were significantly higher
   than the HCC late stage, but also CD8 naïve cells, Monocyte and
   Neutrophil were enriched in the HCC early stage. In the opposite, the
   HCC late stage showed a higher proportion of B cells compared to HCC
   early stage (Fig. [76]4B). The heatmap of immune cells also showed the
   tendency between two stages in samples (Fig. [77]4C). Furthermore, B
   cells were negatively correlated with most immune cells including CD8
   naïve cells, Tr1 cells and Monocyte (Fig. [78]4D). On the other hand,
   Tfh cells, Th2 cells, and Monocyte were positively correlated with CD8
   naïve cells (Fig. [79]4D).

Fig. 4.

   [80]Fig. 4
   [81]Open in a new tab

   Evaluation and visualization of immune cell infiltration between the
   HCC early stage and late stage in merge cohort. A Stacked barplot of
   infiltrating immune cells in the HCC early-stage and late-stage
   samples. B Boxplot of differential expression of 21 infiltrating immune
   cells between the HCC early stage and late stage in merge cohort.
   Wilcoxon rank-sum test was used for differential analysis. *P < 0.05;
   **P < 0.01. ns, non-significance (Dark green in the boxes represents
   the HCC early stage and light purple represents the HCC late stage). C
   Heatmap of 21 infiltrating immune cells. D Correlation matrix of 21
   immune cell infiltration between the HCC early stage and late stage in
   merge cohort. Red and blue indicate positive and negative correlations,
   respectively. The darker color shows the stronger the correlation

Potential gene biomarkers identified by multiple machine learning approaches

   To find the potential gene biomarkers to distinguish the early- and
   late-stage of HCC, we utilized three machine learning algorithms to
   identify the diagnostic markers from the DEGs. 20 potential DEGs were
   identified by the least absolute shrinkage and selection operator
   (LASSO) logistic regression algorithm (Supplemental Figure 3A). What’s
   more, we used the 20 biomarkers to build LASSO logistic regression
   diagnostic model, whose performance was pretty good (AUC = 0.732,
   Supplemental Figure 3B). Of the 20 biomarkers, 7 DEGs had positive
   coefficients and 13 DEGs had negative coefficients (Supplemental
   Figure 3C, Supplemental Table 5). The Boruta algorithm was used to
   classify 39 features from the DEGs (Supplemental Figure 3D). Moreover,
   the random forest diagnostic model based on the 39 biomarkers had also
   well performance (AUC = 0.704, Supplemental Figure 3E). SLC6A8 gene had
   the highest importance score (Mean decrease accuracy from random
   forest) in Supplemental Figure 3F (Supplemental Table 6). Another
   machine learning algorithm named the support vector machine-recursive
   feature elimination (SVM-RFE) to identify 41 features (Supplemental
   Figure 3G, Supplemental Table 7). and the performance of the diagnostic
   model for distinguish the HCC early stage and the HCC late stage was
   also well (AUC = 0.698, Supplemental Figure 3H).

   Next, 10 intersection biomarkers including SLC6A8, FTCD, CYP2C9,
   ANGPT2, ENO1, CNGA1, KCNJ15, SLC39A4, ETV1, and ACSL6 were extracted
   from abovementioned features (Fig. [82]5A). Sequentially, we validated
   the gene expression level of the 10 biomarkers in the validation
   [83]GSE14520 dataset and found only 6 genes (SLC6A8, FTCD, CYP2C9,
   ANGPT2, ENO1, CNGA1) had the consistent results with the merge dataset
   (Supplemental Figure 4A, B). 3 genes (FTCD, CYP2C9, CNGA1) were
   significantly enriched in the HCC early stage, while other 3 genes
   (SLC6A8, ANGPT2, ENO1) were significantly higher in the HCC late stage
   in both merge dataset and validation dataset (Supplemental Figure 4A,
   B).

Fig. 5.

   [84]Fig. 5
   [85]Open in a new tab

   Validation and ROC of SLC6A8 diagnostic marker identified from the
   merge datasets by multiple machine learning algorithms. A The
   overlapping genes among three machine learning methods. B, C The
   expression level of SLC6A8 diagnostic marker between the HCC early
   stage and the HCC late stage in the merge dataset and validation
   [86]GSE14520 dataset, respectively. Wilcoxon rank-sum test was used for
   differential analysis. D, E ROC validation of diagnostic validity of
   the SLC6A8 diagnostic marker in the merge dataset and validation
   [87]GSE14520 dataset, respectively (Sens: sensitivity; Spec:
   Specificity; PPV: Positive Predictive Value; NPV: Negative Predictive
   Value)

   To find the diagnostic makers from the 6 genes, we focused on the
   SLC6A8 gene which had higher expression in the HCC late stage
   (Fig. [88]5B, C) and performed the ROC analysis to check whether the
   SLC6A8 gene had well performance of diagnostic validity in both merge
   dataset. Finally, the AUC of ROC validation were 0.654 (Fig. [89]5D)
   and 0.701 (Fig. [90]5E) in the merge dataset and validation dataset,
   respectively, indicating the SLC6A8 gene had not bad performance as a
   diagnostic biomarker.

Correlation analysis between SLC6A8 and immune cells

   We used spearman correlation to characterize the association between 6
   genes (SLC6A8, FTCD, CYP2C9, ANGPT2, ENO1, CNGA1) and infiltrating
   immune cells. The gene expression of most these genes had strong
   correlation with the 21 immune cells (Fig. [91]6A). SLC6A8 was
   significantly associated with 15 immune cells including DC, Exhausted,
   Th17 and CD8 naïve. Especially, the DC and CD8 naïve had the highest
   correlation coefficient with SLC6A8 (Fig. [92]6B). SLC6A8 had
   significantly positive correlation with DC cells (r = 0.284,
   p = 3.9e−12, Fig. [93]6C) and other cells (Supplemental Figure 5A–I),
   and significantly negative correlation with CD8 naïve (r = − 0.22,
   p = 1e−07, Fig. [94]6D) and other cells (Supplemental Figure 5J–L).

Fig. 6.

   [95]Fig. 6
   [96]Open in a new tab

   Correlation between SLC6A8 and 21 immune cells. A Heatmap of
   correlation between 6 biomarkers (SLC6A8, FTCD, CYP2C9, ANGPT2, ENO1,
   CNGA1) and 21 immune cells. Color represents the spearman correlation
   coefficient. *P < 0.05; + P < 0.01. B Lollipop diagram of correlation
   between SLC6A8 and 22 immune cell infiltration. Circle color represents
   the p-value (red indicates significance) and circle size represents the
   absolute value of coefficient. C, D Scatter plots of correlation
   between SLC6A8 and immune cells whose had the highest correlation
   coefficients (DC and CD8_naive). Dark green represents the HCC
   early-stage samples and light purple represents the HCC late-stage
   samples

   Furthermore, we explore the associations between SLC6A8 and marker
   genes of immune cells as reported by Miao, Ya-Ru, et al. (2020) in
   Advanced Science. Upon analysis, we found significant correlations
   between SLC6A8 and the majority of marker genes for dendritic cells
   (Supplemental Figure 6A). For instance, the chemokine receptor CXCR3,
   which exerts substantial immunological influence on cancer, was
   positively correlated with SLC6A8 (Supplemental Figure 6B). In
   contrast, the CYP4F3 gene showed a negative correlation (Supplemental
   Figure 6C). Similar observations were made in another immune cell type,
   CD8 naïve, where we noted genes that were both positively and
   negatively correlated with the SLC6A8 gene (Supplemental Figure 6D–F).

Expression level of SLC6A8 in single-cell transcriptomic data

   The cell types of [97]GSE149614 dataset were showed by using t-SNE
   based on the Seurat-class object with the 2000 highly variable genes
   after data processing including normalization, scaling, clustering and
   so on (Fig. [98]7A). The expression of individual cells and each cell
   types for SLC6A8 are displayed in the Fig. [99]7B, C, respectively.
   Interesting, Hepatocyte, which was the dominant cells in liver
   presented the highest expression level of SLC6A8 among 6 main cell
   types (Fig. [100]7C). Following, we separated cells according to the
   tumor stage of patients into two groups (HCC early stage and HCC late
   stage) to validate whether the HCC enhanced SLC6A8 gene expression,
   which had been discovery in the bulk-seq RNA dataset. As expected,
   SLC6A8 gene had significantly higher expression (p < 2e−16) in the HCC
   late stage compared to the HCC early stage in Hepatocyte (Fig. [101]7D)
   by Wilcoxon rank-sum test, which was consistent with the previous
   analysis.

Fig. 7.

   [102]Fig. 7
   [103]Open in a new tab

   SLC6A8 expression in single-cell [104]GSE149614 transcriptomic dataset.
   A The t-distributed stochastic neighbor embedding (t-SNE) plot of the 6
   identified main cell types in [105]GSE173896 dataset. B t-SNE map
   highlighting the expression of SLC6A8 gene. C Bubble plot showing the
   expression of the SLC6A8 related different cell types. Dot size
   represents the percent expressed, and color represents average
   expression. D Correlation of SLC6A8 with 6 types of main cell types
   between the HCC early stage and late stage in single cell transcriptome
   data. E SLC6A8 expression of early-stage and late-stage HCC patients.
   Wilcoxon rank-sum test was used for differential analysis. *P < 0.05

   Subsequently, we detected the expression of SLC6A8 in patients with
   early and late liver cancer. Compared to early-stage HCC patients, the
   expression of SLC6A8 was higher in late-stage HCC patients
   (Fig. [106]7E). This result indicated that the expression level of
   SLC6A8 was related to the classification of liver cancer.

Discussion

   HCC with high morbidity and mora rate has become a leading cause of
   cancer-related death worldwide [[107]23]. Furthermore, HCC was
   diagnosed at the late tumor stage, which missing the appropriate
   treatment options [[108]24]. Therefore, it is crucial for patient
   management to predict the disease at an early stage. Hence, the
   development of gene biomarkers that distinguish the early stage from
   the late stage of HCC is the primary goal of this study. To explore the
   potential biomarkers in the prediction of the early stage, we used
   multiple machine learning algorithms to compare the early stage and
   late stage of HCC based on the differentially expressed genes (DEGs).
   Apart from that, we also investigated the biological function of DEGs
   and immune cells between the early stage and the late stage.

   In this study, 137 down-regulated genes and 291 up-regulated genes were
   identified, and some of DEGs were enriched in p53 signaling pathway and
   tryptophan metabolism, which are cancer-related pathway [[109]25,
   [110]26]. Interestingly, p53 signaling pathway with p53 protein
   mutations resulting in uncontrolled cell proliferation and cancer
   tumors [[111]27], was enriched in the late stage of HCC, suggesting
   that there was more the loss of tumor-suppressing function at the late
   stage of HCC. Inversely, tryptophan metabolism promoting tumor cell
   intrinsic malignant properties as well as restricts antitumor immunity
   [[112]28], was enriched in the early stage of HCC, indicating that the
   immune system had initially changed in the early stage. Moreover, the
   immune infiltration analysis also found that the immunity between two
   stages was significantly different, such as Tfh (T follicular helper)
   cells, providing essential help to B cells for effective
   antibody-mediated immune responses [[113]29], and Th2, facilitating
   tissue repair [[114]30] were both enriched in the early stage.

   Ten DEGs from the intersection of genes i.e. SLC6A8, FTCD, CYP2C9,
   ANGPT2, ENO1, CNGA1, KCNJ15, SLC39A4, ETV1, and ACSL6 were screened by
   using LASSO analysis, RF-Boruta and SVM-REF. Previously, different
   reports uncover their involvement in a variety of malignancies
   [[115]31–[116]35]. From comparing their expression value between two
   stages of HCC in the merge and validation cohort, we found SLC6A8,
   ANGPT2 and ENO1 were observed as up-regulated genes in the late stage.
   Then, we used ROC curves to assess the efficacy of SLC6A8 on the
   diagnosis of HCC. The AUC of SLC6A8 were 0.654 and 0.701 in the merge
   cohort and the validation cohort, suggesting that SLC6A8 is a potential
   diagnostic biomarker to distinguish the early stage and the late stage
   of HCC. Especially, it has been shown that SLC6A8 is associated with
   the initiation and progression of human cancers [[117]36]. Evidently,
   another study also demonstrated that SLC6A8 knockdown suppresses the
   invasion and migration of HCC [[118]37]. Compared to established HCC
   biomarkers such as AFP (alpha-fetoprotein) and DCP (des-gamma-carboxy
   prothrombin), which exhibit poor detection in early-stage HCC
   [[119]38], SLC6A8 showed higher performance. Clinically, integrating
   SLC6A8 detection into existing AFP-based diagnostic panels may improve
   early detection rates, particularly in AFP-negative patients [[120]38].
   Future studies should prioritize developing SLC6A8-targeted assays for
   serum or imaging-based validation.

   There was different immune capacity between two stages of HCC, but
   whether the immunity associated with the SLC6A8 is still unknown. To
   further analyze these relationships, we used spearman correlation
   coefficient and found that the up-regulated gene SLC6A8 was positively
   associated with DC (dendritic cells), which provide antigens and
   co-stimulatory signals to cells of the adaptive immune system
   [[121]39], and negatively associated with CD8-naïve cells (naïve
   CD8 + T cells), which selectively detect and eradicate cancer cells by
   targeting the antigens including tumor-specific neoantigens and
   self-antigens from tumors [[122]40]. In our study, other immune cells
   also associated with gene SLC6A8, inferring that SLC6A8 might affect
   the tumor progression by regulating the immune cells. In the
   literature, it has revealed the positive relationship between DC cells
   and SLC6A8-mediated creatine transport [[123]41]. Another recent study
   also demonstrated that SLC6A8 may be involved in the development of
   cancer by participating in the Notch signaling pathway, which playing
   important role in the specification of the immune cells in the NSCLC
   [[124]31]. Although the correlation coefficient in (Fig. [125]6C, D)
   appears modest (r = 0.284; r = − 0.22), the statistically significant
   p-value (p < 0.05) likely reflects the large sample size (n = 575),
   which increases statistical power to detect even small effects.
   However, the biological relevance of such weak associations requires
   cautious interpretation and further validation in independent cohorts.

   We also investigated whether the expression of gene SLC6A8 at
   single-cell transcript level and in vitro trial was the same as
   bulk-RNA sequencing level. Interestingly, six cell types including
   Hepatocyte which is the main functional cells of the live [[126]42]
   were identified in the liver tissue which was consistent with the
   previous studies [[127]43]. Furthermore, we found that the most
   important live cell Hepatocyte cells had higher values of SLC6A8 in the
   late stage of HCC. At the same time, the Myeloid cell also had slightly
   significantly higher values of SLC6A8 in the late stage of HCC.
   Additionally, the in vitro trial also showed the obviously significant
   expression changes of SLC6A8 between two groups. With the repeated
   verification of discovery on the SLC6A8, we believe the gene SLC6A8 may
   be the potential diagnostic marker. To validate the mechanistic role of
   SLC6A8 in HCC, future studies could employ CRISPR/Cas9-mediated
   knockout models to assess its impact on tumor proliferation and
   metastasis. Additionally, metabolomic profiling of SLC6A8-deficient HCC
   cells may reveal alterations in creatine-phosphate-ATP cycling, linking
   its function to metabolic reprogramming. In vivo validation using
   patient-derived xenografts (PDXs) treated with SLC6A8 inhibitors (e.g.,
   β-guanidinopropionic acid [[128]4]) could further establish its
   therapeutic potential.

   In conclusion, we demonstrated that gene SLC6A8 was significantly
   up-regulated in the late stage of HCC based on the transcriptomic data.
   In addition, gene SLC6A8 was associated with immune cell infiltration,
   which provides a potential target for more precise and personalized
   immunotherapy. Therefore, the gene SLC6A8 may be a potential diagnostic
   marker.

Conclusion

   In summary, we identified the DEGs by comparing the early stage and the
   late stage of HCC and found that they were associated with the
   cancer-related pathway and SLC6A8, a potential diagnostic biomarker for
   clinical diagnosis between the early stage and the late stage of HCC
   was verified not only in the single-cell transcriptomic data but also
   in vitro trial.

Limitation of the study

   We have identified potential diagnostic biomarker of the early stage
   and the late stage of HCC based on the transcriptomic data and verified
   it in the single-cell transcriptomic data and in vitro trial in our
   study. One of the limitations is that the sample size of our merge data
   may be not very large. But we have integrated the two most influential
   research (LIHC-US and LIRI-JP) with more than 100 samples into the
   merge dataset. Furthermore, this study only focused on biomarker
   exploration in the transcriptomic levels. The future study could
   develop the combination biomarkers from the genomic, epigenetic and
   metabolomic data. Furthermore, the future time-series studies may
   employ the methodologies provided by the review to conduct causal
   inferences [[129]44], thereby identifying the causal relationships
   between biomarkers and the progression of cancer. While the AUC scores
   (0.65 and 0.7) indicate moderate diagnostic accuracy, we acknowledge
   that further model optimization (e.g., hyperparameter tuning or
   ensemble methods) or integration of multimodal data features (e.g.,
   imaging or clinical variables) could enhance the robustness of the
   model in future studies. Another limitation is the heterogeneity of
   HCC. Future studies should expand to multi-ethnic cohorts and
   incorporate spatial transcriptomics to address tumor microenvironment
   heterogeneity.

Supplementary Information

   [130]12672_2025_2351_MOESM1_ESM.tiff^ (1.4MB, tiff)

   Supplementary materials 1: Figure. 1. PCA of multiple types of gene
   expression matrix for batch removal on merge data cohort (LIHC-US and
   LIRI-JP). A. PCA of raw gene expression matrix. B. PCA of adjusted gene
   expression matrix by ComBat function from sva (version 3.50.0) R
   package. C. PCA of adjusted gene expression matrix by removeBatchEffect
   function from limma (version 3.58.1) R package. D. PCA of adjusted gene
   expression matrix through VOOM and SNM function from limma (version
   3.58.11) and SNM (version 1.50.0) R package. Dark green represents the
   HCC early-stage samples and light purple represents the HCC late-stage
   samples.
   [131]12672_2025_2351_MOESM2_ESM.tiff^ (1.6MB, tiff)

   Supplementary materials 2: Figure 2. GSVA enrichment analysis of gene
   expression matrix between the HCC early stage and late stage in merge
   data.
   [132]12672_2025_2351_MOESM3_ESM.tiff^ (3.5MB, tiff)

   Supplementary materials 3: Figure 3. Machine learning algorithms to
   screen potential diagnostic markers from DEGs between the HCC early
   stage and late stage in merge data. A. LASSO logistic regression found
   20 biomarkers. B. The performance of LASSO logistic regression
   diagnostic model built by 20 biomarkers. C. The coefficient of 20
   biomarkers in LASSO logistic regression. D. 39 biomarkers were
   identified by Boruta and random forest. E. The performance of random
   forest diagnostic model built by 39 biomarkers. F. The importance
   scores of 39 biomarkers in random forest. G. 41 biomarkers were
   identified by Recursive Feature Elimination feature selection. H. The
   performance of SVM diagnostic model built by 41 biomarkers.
   [133]12672_2025_2351_MOESM4_ESM.tiff^ (1.7MB, tiff)

   Supplementary materials 4: Figure 4. 10 overlapping genes screened from
   machine learning algorithms were validated in the validation cohort. A.
   The expression level of overlapping genes SLC6A8, FTCD, CYP2C9, ANGPT2,
   ENO1, CNGA1, KCNJ15, SLC39A4, ETV1, and ACSL6 between the HCC early
   stage and the HCC late stage in the merge dataset, respectively. B. The
   expression level of overlapping genes SLC6A8, FTCD, CYP2C9, ANGPT2,
   ENO1, CNGA1, KCNJ15, SLC39A4, ETV1, and ACSL6 between the HCC early
   stage and the HCC late stage in the validation GSE14520 dataset,
   respectively.
   [134]12672_2025_2351_MOESM5_ESM.tiff^ (2.2MB, tiff)

   Supplementary materials 5: Figure 5. Correlation between SLC6A8 and 12
   immune cells. A-B. Scatter plots of correlation between SLC6A8
   expression and different immune cell contents, Exhausted, NKT, nTreg,
   CD4 naïve, Cytotoxic, CD8 T, B cell, Tr1, NK, Monocyte, Neutrophil, and
   Th17.
   [135]12672_2025_2351_MOESM6_ESM.tiff^ (1.7MB, tiff)

   Supplementary materials 6: Figure 6. Associations between SLC6A8 and
   marker genes of Immune cells from (Miao, Ya‐Ru, et al. 2020. Advanced
   science). Correlation between SLC6A8 and marker genes of dendritic
   cells (A), CXCR3 (B) and CYP4F3 (C). Correlation between SLC6A8 and
   marker genes of CD8 naïve (D), TKNS2 (E) and MOGAT2 (F). The
   correlation coefficient is calculated using spearman.
   [136]12672_2025_2351_MOESM7_ESM.xlsx^ (724.6KB, xlsx)

   Supplementary materials 7: Table 1: Clinicopathologic characteristics
   of patients. Supplemental Table 2: Differentially expressed genes
   between the early stage and the late stage in the merged Hepatocellular
   Carcinoma (HCC) cohort. Supplemental Table 3: Enriched Gene Ontology
   terms of Differentially expressed genes. Supplemental Table 4: Enriched
   KEGG pathway of differentially expressed genes by GSEA. Supplemental
   Table 5: Gene signature from feature selection by least absolute
   shrinkage and selection operator (LASSO) and logistic regression (LR).
   Supplemental Table 6: Gene markers from feature selection by Boruta and
   random forest (RF). Supplemental Table 7: Gene markers from feature
   selection by Recursive Feature Elimination (RFE) and support vector
   machine (SVM).
   [137]Supplementary materials 8^ (3.5MB, docx)

Acknowledgements