Abstract

Background

   Cancer-associated fibroblasts (CAFs) are key components of the
   hepatocellular carcinoma (HCC) tumor microenvironment (TME). regulating
   tumor proliferation, metastasis, therapy resistance, immune evasion via
   diverse mechanisms. A deeper understanding of the l diversity of CAFs
   is essential for predicting patient prognosis and guiding treatment
   strategies.

Methods

   We examined the diversity of CAFs in HCC by integrating single-cell,
   bulk, and spatial transcriptome analyses.

Results

   Using a training cohort of 88 HCC single-cell RNA sequencing
   (scRNA-seq) samples and a validation cohort of 94 samples, encompassing
   over 1.2 million cells, we classified three fibroblast subpopulations
   in HCC: HLA-DRB1 + CAF, MMP11 + CAF, and VEGFA + CAF based on highly
   expressed genes of which, which are primarily located in normal tissue,
   tumor boundaries, and tumor interiors, respectively. Cell trajectory
   analysis revealed that VEGFA + CAFs are at the terminal stage of
   differentiation, which, notably, is tumor-specific. VEGFA + CAFs were
   significantly associated with patient survival, and the hypoxic
   microenvironment was found to be a major factor inducing VEGFA + CAFs.
   Through cellular communication with capillary endothelial cells
   (CapECs), VEGFA + CAFs promoted intra-tumoral angiogenesis,
   facilitating tumor progression and metastasis. Additionally, a machine
   learning model developed using high-expression genes from VEGFA + CAFs
   demonstrated high accuracy in predicting prognosis and sorafenib
   response in HCC patients.

Conclusions

   We characterized three fibroblast subpopulations in HCC and revealed
   their distinct spatial distributions within the tumor. VEGFA + CAFs,
   which was induced by hypoxic TME, were associated with poorer
   prognosis, as they promote tumor angiogenesis through cellular
   communication with CapECs. Our findings provide novel insights and pave
   the way for individualized therapy in HCC patients.

Supplementary Information

   The online version contains supplementary material available at
   10.1186/s12967-025-06192-0.

   Keywords: Tumor-associated fibroblasts, Hepatocellular carcinoma,
   VEGFA + CAF, Hypoxic microenvironment, Angiogen esis

Background

   Liver cancer is the third leading cause of cancer-related deaths and
   the sixth most common cancer globally, with HCC accounting for 80–85%
   of all liver cancer cases [[30]1]. Despite surgical resection, the
   recurrence rate remains high, and patients with unresectable HCC have
   limited therapeutic options and poor prognosis [[31]2]. Over 80% of HCC
   cases are characterized by extensive hepatic fibrosis due to the
   activation, proliferation, and accumulation of fibroblasts [[32]3]. A
   prominent feature of the HCC tumor TME is the abundant presence of
   CAFs, which secrete various cytokines, chemokines, and growth factors
   that support cancer cell survival [[33]4]. CAF-derived factors not only
   promote cancer cell survival but also modify the immune landscape by
   suppressing immune effector cells and recruiting immunosuppressive
   cells, allowing cancer cells to evade immune surveillance [[34]5].
   Fibroblasts play a critical role in TME remodeling, influencing tumor
   growth, metastasis, treatment resistance, and immune evasion, often
   likened to the "soil" in which cancer “seeds” thrive [[35]6].
   Therefore, understanding the diversityof fibroblasts in the HCC TME is
   essential.

   Recent advances in scRNA-seq and spatial transcriptomics (ST) have
   overcome technical barriers to studying cellular heterogeneity in
   complex tissues like cancer [[36]7, [37]8], providing valuable insights
   into fibroblast diversity in HCC. CAFs in primary liver cancer have
   been classified into eight subtypes, with DAB2 + and
   SPP1 + tumor-associated macrophages (TAMs) shown to promote immune
   barrier formation by enhancing FAP + CAF function through TGF-β, PDGF,
   and ADM signaling [[38]9]. In HCC, six CAF subtypes have been
   identified based on gene expression and function, with POSTN + CAFs
   recruiting SPP1 + macrophages and elevating SPP1 expression through the
   IL-6/STAT3 pathway [[39]10]. Additionally, five CAF subpopulations have
   been classified, with CD36 + CAFs shown to recruit
   CD33 + myeloid-derived suppressor cells (MDSCs) in an MIF and CD74
   dependent manner, enhancing MDSC-mediated immunosuppression and
   promoting tumor growth [[40]11]. However, these studies did not account
   for the potential impact of mural cells that is composed of pericytes
   and smooth muscle cells in mesenchymal cells on fibroblast behavior,
   highlighting the need for standardized approaches to exclude mural cell
   effects for a comprehensive, high-resolution assessment of CAFs.

   In this study, we characterized three distinct fibroblast subtypes in
   HCC, revealing their functional diversity and spatial distribution.
   Inference of cell differentiation trajectories highlighted the critical
   role of the TME in driving fibroblast differentiation. Additionally, we
   analyzed the complex biological functions resulting from intercellular
   communication patterns, paving the way for the development of
   therapeutic approaches targeting CAF in HCC.

Methods

Data collection

   A total of 89 HCC-related samples from the Xue et al. study [[41]12]
   were used as the discovery cohort, including 10 adjacent liver (AL)
   samples and 79 HCC samples. The raw FASTQ data (BioProject ID:
   PRJCA007744) is available in the Chinese National Center for Biological
   Information (CNCB) ([42]https://www.cncb.ac.cn/). Additionally,
   scRNA-seq data for 94 HCC and AL samples were obtained as a validation
   cohort from the Gene Expression Omnibus (GEO) database ([43]GSE149614,
   [44]GSE151530, [45]GSE189903, [46]GSE202642, [47]GSE156625) [[48]11,
   [49]13–[50]16]. For ST, samples from five HCC patients—including
   adjacent tumor, tumor, and tumor margin regions—were sourced from Wu et
   al. [[51]17], totaling 15 slides. RNA-seq and microarray data were
   collected from The Cancer Genome Atlas (TCGA-LIHC, n = 424), GEO
   ([52]GSE14520, n = 488) [[53]18], the International Cancer Genome
   Consortium (ICGC-LIRI-JP, n = 389), and the National Omics Data
   Encyclopedia (NODE, OEP000321, n = 316).

Single-cell data processing

   Raw FASTQ data from the discovery cohort were filtered and aligned
   using Starsolo (v2.7.9a) with the human reference genome GRCh38. All
   validation cohort scRNA-seq data were directly merged after downloading
   from GEO. Given the large cell number in the training set, preliminary
   filtering, dimension reduction, and clustering were performed using
   scanpy (v1.6) [[54]19] in Python. Filtering criteria included: (1)
   500–6000 detected genes, (2) UMI counts between 1000 and 30,000 per
   cell, (3) mitochondrial gene content below 10%. We selected 2000 highly
   variable genes (HVGs) using highly_variable_genes, calculated the top
   50 principal components (PCs) via pca, regressed out mitochondrial gene
   effects, and scaled each gene to unit variance. Nearest neighbor graphs
   were constructed with the neighbors function, and clustering was
   performed using the louvain function (resolution = 0.1). Dimension
   reduction and visualization were achieved using uniform manifold
   approximation and projection (UMAP). For each clustered subpopulation,
   dimension reduction and clustering were further refined using Seurat
   (v4.3.0) in R. For the validation cohort, Seurat [[55]20] was used to
   analyze merged gene expression matrices with parameters consistent with
   those described above. To mitigate batch effects between samples and
   datasets, Harmony (v1.2.0) was applied to both discovery and validation
   cohorts.

Identification of fibroblasts

   Mesenchymal cells were characterized based on established markers (LUM,
   COL1A1, and RGS5). The expression levels of fibroblast and mural cell
   markers were calculated using the addmodulescore function. Fibroblasts
   were classified as cells with a fibroblast score at least 0.1 higher
   than the mural cell score, while mural cells were defined as cells with
   a mural cell score exceeding the fibroblast score by 0.1. Cells with a
   score difference of less than 0.1 were categorized as undetermined.

Functional enrichment analysis

   To understand the functional characteristics of the different cell
   subpopulations, we first identified highly expressed genes for each
   subpopulation using the FindAllMarkers function (adjusted P < 0.05,
   log2FC > 0.25). Functional enrichment analysis of the top 30 genes for
   each subpopulation was then performed using the GO and KEGG databases
   through the R package clusterProfiler (v4.10.0) [[56]21]. An FDR of
   less than 0.05 was considered significant. Additionally, the R package
   progeny (v1.24.0) [[57]22] was used to evaluate the function of 14
   pathways within the cells. For single-cell data, each cell was
   functionally scored using the addmodulescore function. For bulk RNA-seq
   data, GSVA (v1.52.3) [[58]23] was used to assign functional scores to
   samples based on predefined gene sets.

Cell preference analysis

   To assess the prevalence of distinct cell types across different
   tissues, we used the Ro/e, which represents the ratio of observed to
   expected cell counts. An Ro/e greater than 1 indicates enrichment of a
   specific cell cluster within a tissue, while an Ro/e less than 1
   indicates depletion. Additionally, the odds ratio (OR) algorithm was
   used to further analyze cell preference. Here, an OR greater than 1.5
   suggests a significant enrichment of cells in the specified sample
   category, while an OR less than 0.5 indicates substantial depletion.

Bulk deconvolution analysis

   To assess the infiltration of different cell subpopulations in bulk
   RNA-seq and microarray data, we estimated cell abundance in each sample
   using CIBERSORTx [[59]24]. This was conducted in two steps. First, due
   to CIBERSORTx limitations, we randomly sampled each cell type,
   reserving 500 cells per type to create a single-cell count expression
   matrix and derive a signature matrix. Next, the constructed signature
   matrices and bulk mixture files were used for CIBERSORTx analyses.

Survival analysis

   To link single-cell data with patient phenotypes from bulk sequencing,
   we utilized the R package Scissor (v2.0.0) [[60]25]. Scissor + cells
   were associated with poorer patient survival, while Scissor − cells
   correlated with better prognosis.

   Additionally, the bulk sequencing matrix was transformed into a
   cell-type abundance matrix using CIBERSORTx, which integrated
   single-cell and bulk sequencing data. The abundance of fibroblast
   subpopulations was used as a variable to assess the relationship
   between these subpopulations and patient survival. We performed
   univariate Cox regression analyses with the R package survival
   (v3.5.7), and Kaplan–Meier survival curves were generated for selected
   cell subgroups. Log-rank and Cox p-values of less than 0.05 indicated
   statistically significant associations. Optimal grouping thresholds and
   survival analyses for each cohort were calculated using the R package
   survminer (v0.4.9).

   To evaluate whether specific gene expression levels in fibroblast
   subpopulations could serve as a prognosis predictor, we applied the
   GSVA method to generate a score based on bulk sequencing gene sets and
   assessed its correlation with patient survival.

CAFs differentiation trajectory analysis

   To investigate potential cell differentiation trajectory, we used the R
   packages Monocle2 (v2.26.0) [[61]26] and slingshot (v2.7.0) [[62]27]
   for cell trajectory analysis. The fitGAM function in slingshot
   identified genes associated with pseudotime, enabling us to track gene
   expression changes over the inferred differentiation timeline.

Spatial transcriptomics (ST)analysis

   We used the Seurat package for dimensionality reduction and clustering
   analyses on the downloaded ST data. Unlike single-cell analyses, ST
   data were not subjected to quality control. To assess the spatial
   distribution of fibroblast subpopulations, we scored highly expressed
   genes of CAFs using the addmodulescore function. The spatial
   distribution of these subpopulations was further validated by
   integrating single-cell transcriptomic and ST data with the R package
   celltrek (v0.0.94) [[63]28], which generated sparse plots via a random
   forest model to simulate CAF distribution.

   For spatial colocalization analysis of VEGFA + CAF and CapEc, we first
   scored the top 50 highly expressed genes of these cell types for each
   Visium sample using the AddModuleScore function, estimating their
   abundance at each spot. Spatial coordinates were extracted, scaled, and
   distances to the six nearest spots were calculated using the knn
   function from the dbscan package (v1.1-12). Spots scoring above the
   75th percentile for VEGFA + CAF were designated as starting points,
   while those above the 75th percentile for CapEc were defined as
   endpoints. Colocalized spots were identified when the distance between
   starting and endpoint spots was less than six. For each starting spot,
   the CapEc abundance of colocalized spots was normalized to obtain a
   standardized neighbor enrichment score.

Cell communication analysis

   Cellular communication between fibroblasts and endothelial cells was
   initially analyzed using CellChat (v2.1.2) [[64]29]. First, the overall
   communication strength and frequency between all endothelial cells and
   fibroblast subpopulations were calculated. Key cell types interacting
   with VEGFA + CAFs were identified, along with the main senders and
   receivers within angiogenesis-related pathways.

   To infer specific signaling interactions between endothelial cells and
   fibroblasts and assess their impact on downstream target genes, we used
   the R package NicheNet (v2.1.0) [[65]30]. Here, fibroblasts served as
   signal senders, and endothelial cells as receivers. The top 30 ligands,
   receptors, and target genes ranked by aupr_corrected—were visualized in
   a heatmap. Functional enrichment of target and receiver genes was
   conducted using the clusterProfiler package, with GO, KEGG, databases
   as references. Pathways with a corrected p-value < 0.05 were deemed