Abstract Background Cancer-associated fibroblasts (CAFs) are key components of the hepatocellular carcinoma (HCC) tumor microenvironment (TME). regulating tumor proliferation, metastasis, therapy resistance, immune evasion via diverse mechanisms. A deeper understanding of the l diversity of CAFs is essential for predicting patient prognosis and guiding treatment strategies. Methods We examined the diversity of CAFs in HCC by integrating single-cell, bulk, and spatial transcriptome analyses. Results Using a training cohort of 88 HCC single-cell RNA sequencing (scRNA-seq) samples and a validation cohort of 94 samples, encompassing over 1.2 million cells, we classified three fibroblast subpopulations in HCC: HLA-DRB1 + CAF, MMP11 + CAF, and VEGFA + CAF based on highly expressed genes of which, which are primarily located in normal tissue, tumor boundaries, and tumor interiors, respectively. Cell trajectory analysis revealed that VEGFA + CAFs are at the terminal stage of differentiation, which, notably, is tumor-specific. VEGFA + CAFs were significantly associated with patient survival, and the hypoxic microenvironment was found to be a major factor inducing VEGFA + CAFs. Through cellular communication with capillary endothelial cells (CapECs), VEGFA + CAFs promoted intra-tumoral angiogenesis, facilitating tumor progression and metastasis. Additionally, a machine learning model developed using high-expression genes from VEGFA + CAFs demonstrated high accuracy in predicting prognosis and sorafenib response in HCC patients. Conclusions We characterized three fibroblast subpopulations in HCC and revealed their distinct spatial distributions within the tumor. VEGFA + CAFs, which was induced by hypoxic TME, were associated with poorer prognosis, as they promote tumor angiogenesis through cellular communication with CapECs. Our findings provide novel insights and pave the way for individualized therapy in HCC patients. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-025-06192-0. Keywords: Tumor-associated fibroblasts, Hepatocellular carcinoma, VEGFA + CAF, Hypoxic microenvironment, Angiogen esis Background Liver cancer is the third leading cause of cancer-related deaths and the sixth most common cancer globally, with HCC accounting for 80–85% of all liver cancer cases [[30]1]. Despite surgical resection, the recurrence rate remains high, and patients with unresectable HCC have limited therapeutic options and poor prognosis [[31]2]. Over 80% of HCC cases are characterized by extensive hepatic fibrosis due to the activation, proliferation, and accumulation of fibroblasts [[32]3]. A prominent feature of the HCC tumor TME is the abundant presence of CAFs, which secrete various cytokines, chemokines, and growth factors that support cancer cell survival [[33]4]. CAF-derived factors not only promote cancer cell survival but also modify the immune landscape by suppressing immune effector cells and recruiting immunosuppressive cells, allowing cancer cells to evade immune surveillance [[34]5]. Fibroblasts play a critical role in TME remodeling, influencing tumor growth, metastasis, treatment resistance, and immune evasion, often likened to the "soil" in which cancer “seeds” thrive [[35]6]. Therefore, understanding the diversityof fibroblasts in the HCC TME is essential. Recent advances in scRNA-seq and spatial transcriptomics (ST) have overcome technical barriers to studying cellular heterogeneity in complex tissues like cancer [[36]7, [37]8], providing valuable insights into fibroblast diversity in HCC. CAFs in primary liver cancer have been classified into eight subtypes, with DAB2 + and SPP1 + tumor-associated macrophages (TAMs) shown to promote immune barrier formation by enhancing FAP + CAF function through TGF-β, PDGF, and ADM signaling [[38]9]. In HCC, six CAF subtypes have been identified based on gene expression and function, with POSTN + CAFs recruiting SPP1 + macrophages and elevating SPP1 expression through the IL-6/STAT3 pathway [[39]10]. Additionally, five CAF subpopulations have been classified, with CD36 + CAFs shown to recruit CD33 + myeloid-derived suppressor cells (MDSCs) in an MIF and CD74 dependent manner, enhancing MDSC-mediated immunosuppression and promoting tumor growth [[40]11]. However, these studies did not account for the potential impact of mural cells that is composed of pericytes and smooth muscle cells in mesenchymal cells on fibroblast behavior, highlighting the need for standardized approaches to exclude mural cell effects for a comprehensive, high-resolution assessment of CAFs. In this study, we characterized three distinct fibroblast subtypes in HCC, revealing their functional diversity and spatial distribution. Inference of cell differentiation trajectories highlighted the critical role of the TME in driving fibroblast differentiation. Additionally, we analyzed the complex biological functions resulting from intercellular communication patterns, paving the way for the development of therapeutic approaches targeting CAF in HCC. Methods Data collection A total of 89 HCC-related samples from the Xue et al. study [[41]12] were used as the discovery cohort, including 10 adjacent liver (AL) samples and 79 HCC samples. The raw FASTQ data (BioProject ID: PRJCA007744) is available in the Chinese National Center for Biological Information (CNCB) ([42]https://www.cncb.ac.cn/). Additionally, scRNA-seq data for 94 HCC and AL samples were obtained as a validation cohort from the Gene Expression Omnibus (GEO) database ([43]GSE149614, [44]GSE151530, [45]GSE189903, [46]GSE202642, [47]GSE156625) [[48]11, [49]13–[50]16]. For ST, samples from five HCC patients—including adjacent tumor, tumor, and tumor margin regions—were sourced from Wu et al. [[51]17], totaling 15 slides. RNA-seq and microarray data were collected from The Cancer Genome Atlas (TCGA-LIHC, n = 424), GEO ([52]GSE14520, n = 488) [[53]18], the International Cancer Genome Consortium (ICGC-LIRI-JP, n = 389), and the National Omics Data Encyclopedia (NODE, OEP000321, n = 316). Single-cell data processing Raw FASTQ data from the discovery cohort were filtered and aligned using Starsolo (v2.7.9a) with the human reference genome GRCh38. All validation cohort scRNA-seq data were directly merged after downloading from GEO. Given the large cell number in the training set, preliminary filtering, dimension reduction, and clustering were performed using scanpy (v1.6) [[54]19] in Python. Filtering criteria included: (1) 500–6000 detected genes, (2) UMI counts between 1000 and 30,000 per cell, (3) mitochondrial gene content below 10%. We selected 2000 highly variable genes (HVGs) using highly_variable_genes, calculated the top 50 principal components (PCs) via pca, regressed out mitochondrial gene effects, and scaled each gene to unit variance. Nearest neighbor graphs were constructed with the neighbors function, and clustering was performed using the louvain function (resolution = 0.1). Dimension reduction and visualization were achieved using uniform manifold approximation and projection (UMAP). For each clustered subpopulation, dimension reduction and clustering were further refined using Seurat (v4.3.0) in R. For the validation cohort, Seurat [[55]20] was used to analyze merged gene expression matrices with parameters consistent with those described above. To mitigate batch effects between samples and datasets, Harmony (v1.2.0) was applied to both discovery and validation cohorts. Identification of fibroblasts Mesenchymal cells were characterized based on established markers (LUM, COL1A1, and RGS5). The expression levels of fibroblast and mural cell markers were calculated using the addmodulescore function. Fibroblasts were classified as cells with a fibroblast score at least 0.1 higher than the mural cell score, while mural cells were defined as cells with a mural cell score exceeding the fibroblast score by 0.1. Cells with a score difference of less than 0.1 were categorized as undetermined. Functional enrichment analysis To understand the functional characteristics of the different cell subpopulations, we first identified highly expressed genes for each subpopulation using the FindAllMarkers function (adjusted P < 0.05, log2FC > 0.25). Functional enrichment analysis of the top 30 genes for each subpopulation was then performed using the GO and KEGG databases through the R package clusterProfiler (v4.10.0) [[56]21]. An FDR of less than 0.05 was considered significant. Additionally, the R package progeny (v1.24.0) [[57]22] was used to evaluate the function of 14 pathways within the cells. For single-cell data, each cell was functionally scored using the addmodulescore function. For bulk RNA-seq data, GSVA (v1.52.3) [[58]23] was used to assign functional scores to samples based on predefined gene sets. Cell preference analysis To assess the prevalence of distinct cell types across different tissues, we used the Ro/e, which represents the ratio of observed to expected cell counts. An Ro/e greater than 1 indicates enrichment of a specific cell cluster within a tissue, while an Ro/e less than 1 indicates depletion. Additionally, the odds ratio (OR) algorithm was used to further analyze cell preference. Here, an OR greater than 1.5 suggests a significant enrichment of cells in the specified sample category, while an OR less than 0.5 indicates substantial depletion. Bulk deconvolution analysis To assess the infiltration of different cell subpopulations in bulk RNA-seq and microarray data, we estimated cell abundance in each sample using CIBERSORTx [[59]24]. This was conducted in two steps. First, due to CIBERSORTx limitations, we randomly sampled each cell type, reserving 500 cells per type to create a single-cell count expression matrix and derive a signature matrix. Next, the constructed signature matrices and bulk mixture files were used for CIBERSORTx analyses. Survival analysis To link single-cell data with patient phenotypes from bulk sequencing, we utilized the R package Scissor (v2.0.0) [[60]25]. Scissor + cells were associated with poorer patient survival, while Scissor − cells correlated with better prognosis. Additionally, the bulk sequencing matrix was transformed into a cell-type abundance matrix using CIBERSORTx, which integrated single-cell and bulk sequencing data. The abundance of fibroblast subpopulations was used as a variable to assess the relationship between these subpopulations and patient survival. We performed univariate Cox regression analyses with the R package survival (v3.5.7), and Kaplan–Meier survival curves were generated for selected cell subgroups. Log-rank and Cox p-values of less than 0.05 indicated statistically significant associations. Optimal grouping thresholds and survival analyses for each cohort were calculated using the R package survminer (v0.4.9). To evaluate whether specific gene expression levels in fibroblast subpopulations could serve as a prognosis predictor, we applied the GSVA method to generate a score based on bulk sequencing gene sets and assessed its correlation with patient survival. CAFs differentiation trajectory analysis To investigate potential cell differentiation trajectory, we used the R packages Monocle2 (v2.26.0) [[61]26] and slingshot (v2.7.0) [[62]27] for cell trajectory analysis. The fitGAM function in slingshot identified genes associated with pseudotime, enabling us to track gene expression changes over the inferred differentiation timeline. Spatial transcriptomics (ST)analysis We used the Seurat package for dimensionality reduction and clustering analyses on the downloaded ST data. Unlike single-cell analyses, ST data were not subjected to quality control. To assess the spatial distribution of fibroblast subpopulations, we scored highly expressed genes of CAFs using the addmodulescore function. The spatial distribution of these subpopulations was further validated by integrating single-cell transcriptomic and ST data with the R package celltrek (v0.0.94) [[63]28], which generated sparse plots via a random forest model to simulate CAF distribution. For spatial colocalization analysis of VEGFA + CAF and CapEc, we first scored the top 50 highly expressed genes of these cell types for each Visium sample using the AddModuleScore function, estimating their abundance at each spot. Spatial coordinates were extracted, scaled, and distances to the six nearest spots were calculated using the knn function from the dbscan package (v1.1-12). Spots scoring above the 75th percentile for VEGFA + CAF were designated as starting points, while those above the 75th percentile for CapEc were defined as endpoints. Colocalized spots were identified when the distance between starting and endpoint spots was less than six. For each starting spot, the CapEc abundance of colocalized spots was normalized to obtain a standardized neighbor enrichment score. Cell communication analysis Cellular communication between fibroblasts and endothelial cells was initially analyzed using CellChat (v2.1.2) [[64]29]. First, the overall communication strength and frequency between all endothelial cells and fibroblast subpopulations were calculated. Key cell types interacting with VEGFA + CAFs were identified, along with the main senders and receivers within angiogenesis-related pathways. To infer specific signaling interactions between endothelial cells and fibroblasts and assess their impact on downstream target genes, we used the R package NicheNet (v2.1.0) [[65]30]. Here, fibroblasts served as signal senders, and endothelial cells as receivers. The top 30 ligands, receptors, and target genes ranked by aupr_corrected—were visualized in a heatmap. Functional enrichment of target and receiver genes was conducted using the clusterProfiler package, with GO, KEGG, databases as references. Pathways with a corrected p-value < 0.05 were deemed