Abstract

   Pulmonary arterial hypertension (PAH) is a progressive cardiovascular
   disease characterized by elevated pulmonary arterial pressure, leading
   to right heart failure and death. Despite advancements in diagnosis and
   treatment, it remains incurable, and its mechanisms are poorly
   understood. This study aimed to integrate multi-omics data analysis and
   machine learning techniques to uncover the molecular characteristics
   and subtypes of PAH, providing insights into precise diagnosis and
   therapeutic strategies. We employed consensus clustering to classify
   PAH patients into subgroups based on multi-omics data. Differential
   expression and enrichment analyses were conducted to identify key genes
   and pathways. Machine learning models were developed to predict PAH
   subtypes and assess their diagnostic performance. PAH patients were
   divided into two subgroups: C1 and C2. The C2 subgroup showed
   significantly upregulated hypoxia-related genes, indicating distinct
   pathogenic mechanisms. Key genes associated with hypoxia, immune
   regulation, and inflammation were identified, alongside enriched
   pathways such as TNF, IL-17, and HIF-1 in the C2 subgroup. Machine
   learning models achieved high accuracy (AUC > 0.85) in distinguishing
   hypoxia-associated subtypes, supporting their utility for precise
   diagnosis. Potential therapeutic targets were identified in the TNF and
   HIF-1 pathways. This study provides novel insights into PAH’s molecular
   subtypes and their distinct mechanisms, offering diagnostic tools and
   potential therapeutic targets for personalized treatment. Validation in
   larger cohorts and experimental studies is essential to confirm the
   identified biomarkers and pathways.

Supplementary Information

   The online version contains supplementary material available at
   10.1038/s41598-025-99025-5.

   Keywords: Pulmonary arterial hypertension, Hypoxia, Machine learning,
   Immune regulation and inflammation

   Subject terms: Computational biology and bioinformatics, Mechanisms of
   disease

Introduction

   PAH is a severe cardiopulmonary disease characterized by elevated
   pulmonary arterial pressure, leading to right heart failure and
   death^[32]1. The central pathological feature of PAH is pulmonary
   vascular remodeling, involving endothelial and smooth muscle cell
   (SMCs) proliferation, vasoconstriction, and increased vascular
   resistance, which overloads the right ventricle^[33]2,[34]3. Despite
   advances in diagnosis and treatment, the disease remains incurable,
   underscoring the need to identify key pathological mechanisms for
   targeted interventions.

   Hypoxia has been demonstrated to be closely associated with the
   progression of PAH. It disrupts endothelial barrier integrity, induces
   apoptosis, and promotes vascular remodeling^[35]4,[36]5. These changes
   enhance SMCs proliferation, vessel wall thickening, and abnormal
   neovascularization, exacerbating pulmonary vascular
   resistance^[37]5–[38]7. Key hypoxia-regulated pathways include
   hypoxia-inducible factor-1 (HIF-1), which drives cell proliferation,
   angiogenesis, and metabolic shifts, and the tumor necrosis factor (TNF)
   pathway, which intensifies inflammation and endothelial
   damage^[39]8,[40]9. Understanding the role of hypoxia in PAH and
   elucidating the underlying mechanisms by which it drives disease
   progression are of crucial importance.

   With the advancement of sequencing technology, research has gradually
   shifted from Bulk RNA sequencing (Bulk RNA-seq) to Single-cell
   sequencing (scRNA-seq). This technological transformation enables
   researchers to explore microenvironment in diseases at the single-cell
   level^[41]10,[42]11. Moreover, machine learning enables the
   construction of more robust predictive models to identify key genes,
   thereby enhancing our ability to pinpoint critical molecular drivers in
   PAH pathogenesis. This study aims to systematically explore the genes
   and pathways driving PAH using computational and machine learning
   approaches. By analyzing large-scale transcriptomic and single-cell
   data, we identify feature genes and construct classification models to
   predict disease onset and identify biomarkers for early diagnosis.
   Comparative algorithm analysis highlights key genes and pathways
   central to PAH progression, providing insights into personalized
   treatment. Furthermore, identifying aberrantly expressed genes offers
   potential therapeutic targets, addressing the limited efficacy of
   current treatments and improving clinical outcomes for PAH patients.

Methods

Data collection

   We integrated three publicly available datasets from the GEO database
   ([43]https://www.ncbi.nlm.nih.gov/gds/) for the analysis of PAH:
   [44]GSE117261 (58 PAH patients and 25 controls), [45]GSE113439 (15 PAH
   patients and 11 controls), and [46]GSE53408 (12 PAH patients and 11
   controls), totaling 85 PAH patients and 47 controls. Additionally, we
   downloaded single-cell samples and their corresponding normal controls
   from [47]GSE210248 and [48]GSE228643. [49]GSE210248 includes 3 PAH
   samples and 3 controls, while [50]GSE228643 includes 4 samples of
   pulmonary fibrosis-associated PAH and 2 control samples (Table [51]1).
   Furthermore, we obtained 493 hypoxia-related genes from the Molecular
   Signatures Database
   ([52]https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) (Supplementary
   Table 1).

Table 1.

   RNA sequence data information.
   Platform      Samples                                       Sequencing type
   [53]GSE117261 IPAH:31; APAH:18; HPAH:5; Other:4; Control:25 Bulk-RNAseq
   [54]GSE113439 IPAH:6; CTD:4; CHD:4; Control:11              Bulk-RNAseq
   [55]GSE53408  IPAH:12; Control:11                           Bulk-RNAseq
   [56]GSE210248 IPAH:3; Control:3                             scRNAseq
   [57]GSE228643 PHPF:3; Control:2                             scRNAseq
   [58]Open in a new tab

   IPAH idiopathic pulmonary arterial hypertension, APAH associated
   pulmonary arterial hypertension, HPAH hereditary pulmonary arterial
   hypertension, CTD connective tissue disease, CHD congenital heart
   disease.

   It is important to note that, following the international
   classification of pulmonary arterial hypertension (2022 ESC/ERS
   Guidelines for the Diagnosis and Treatment of Pulmonary
   Hypertension)^[59]12, most Bulk-RNAseq samples included in our analysis
   are classified as PAH (Group 1 PH), ensuring consistency in sample
   selection and analysis. Our study investigates the role of hypoxia in
   the onset and progression of PAH; therefore, we included PHPF (PH
   induced by chronic lung diseases and hypoxia; Group 3 PH) in the
   scRNAseq analysis to validate the correlation between key genes and
   cellular hypoxic phenotypes.

Microarray data processing, batch effect correction, and principal component
analysis

   In this study, we downloaded three microarray datasets from the GEO
   database ([60]GSE113439, [61]GSE117261, and [62]GSE53408) and
   integrated their gene expression matrices and clinical information. To
   address potential technical biases between datasets, we applied the
   combat function from the ‘sva’ package for batch effect correction,
   ensuring the reliability of the analysis results. The processed
   expression data were then normalized using the ‘limma’ package to
   further reduce technical variations across samples. To assess the
   similarity between samples, we performed principal component analysis
   (PCA). PCA helped identify the main patterns of variation and visually
   represented the distribution of samples from different groups (e.g.,
   PAH and control), highlighting the gene expression differences between
   groups. For a deeper investigation of gene expression patterns, we
   extracted the 1,000 genes with the highest standard deviation and
   generated a gene expression heatmap.

Consensus clustering-based PAH subtype identification and analysis

   In this study, we performed consensus clustering analysis on PAH
   samples to identify potential subtypes. First, we extracted the PAH
   group samples from the pre-processed expression matrix and, in
   combination with an externally imported list of hypoxia-related genes,
   selected genes associated with the hypoxic response for further
   analysis. Based on this, we applied the ‘ConsensusClusterPlus’ package
   for consensus clustering, setting the maximum number of clusters to 9.
   We performed 500 iterations with an 80% sample extraction ratio
   (pItem = 0.8) to enhance the stability of the clustering results. To
   assess the reliability of the clustering, we used Euclidean distance as
   the similarity measure and employed the K-means clustering algorithm.
   The resulting consensus matrix and internal clustering consistency
   results were saved in PDF format. After clustering, we calculated the
   Proportion of Ambiguous Clustering (PAC) value, which was used to
   evaluate the stability of clustering across different numbers of
   clusters. The optimal number of clusters was determined based on the
   minimum PAC value, which led to the selection of the most appropriate
   subtype number for PAH samples. Once the optimal number of clusters was
   identified, we extracted the classification results for those clusters
   and integrated them with clinical data. By verifying the matching of
   sample names, we ensured that the clustering results accurately
   corresponded with the clinical data. Finally, we generated a clinical
   dataset containing subtype information for subsequent differential
   analysis and biological interpretation.

Hypoxia-related gene set analysis based on GSVA

   We performed gene set variation analysis (GSVA) to evaluate the
   enrichment of hypoxia-related genes in PAH samples. First, a hypoxia
   gene set was constructed, and GSVA was applied to compute enrichment
   scores for each sample relative to the hypoxia gene set. The resulting
   hypoxia enrichment scores were then used to generate bar plots of the
   enrichment scores for different sample groups (control, c1, and c2),
   followed by statistical comparisons.

Differential gene analysis and functional enrichment in PAH subtypes

   In this study, we utilized differential gene analysis, GSVA, and GSEA
   to evaluate the gene expression characteristics of PAH subtypes.
   Differential gene analysis was performed using the ‘limma’ package to
   identify genes that were significantly differentially expressed between
   subtypes. The gene expression changes were visualized using volcano
   plots and heatmaps. Subsequently, we conducted Gene Ontology (GO) and
   Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis
   using the ‘clusterProfiler’ package to uncover functional pathways
   associated with the differentially expressed genes^[63]13. GSVA was
   applied to perform enrichment analysis for the different subtypes, and
   GSEA was further employed to assess the functional enrichment of
   upregulated and downregulated genes between the two subtypes. Finally,
   the enrichment results were presented through multiple visualization
   methods, providing support for subsequent biological investigations.

Core gene identification in C2 subtype vs. control and C2 vs. C1 subtypes

   In this study, we extracted differentially expressed genes (DEGs) from
   the C2 subtype versus the control group and from the C2 subtype versus
   the C1 subtype, followed by an overlapping analysis of these DEGs.
   First, we imported DEG data for both the C2 subtype vs. control group
   (non-stable expression genes) and C2 vs. C1 subtype comparisons. Next,
   we used the intersection function to identify overlapping genes between
   the two gene sets, and these overlapping genes were saved as CSV files
   for further analysis. Subsequently, a Venn diagram was generated using
   the ‘ggvenn’ package to visualize the shared and unique DEGs between
   the C2 subtype, control group, and C1 subtype.

Classification model construction and evaluation using multiple machine
learning algorithms

   In this study, we employed a combination of 12 machine learning
   algorithms to construct and evaluate 113 classification models aimed at
   predicting the subtypes of PAH. First, we extracted common genes and
   samples from both the training and test datasets and performed data
   normalization to ensure the validity of the models. Within a
   cross-validation framework, feature selection and model construction
   were integrated. We utilized algorithms such as Lasso, Ridge, Elastic
   Net (Enet), Random Forest, Support Vector Machine (SVM), and XGBoost to
   select feature genes with predictive significance for classification.
   The model construction process was carried out in two steps: first,
   potential predictive variables were selected using feature selection
   algorithms to retain the most relevant gene set; then, classification
   models were built based on the selected genes. To ensure model
   robustness, a minimum number of variables was specified to prevent
   overly simplistic models that could lead to poor performance. During
   the model validation phase, we evaluated the trained models by
   calculating risk scores, predicting binary outcomes (0 or 1) for
   samples, and computing the area under the curve (AUC) values for the
   models on both the training and test datasets. These AUC values were
   used to assess the performance of the models in the classification
   task. To visually compare the performance of different models across
   datasets, heatmaps were generated to display the AUC results. The final
   results demonstrated that the models constructed using multiple
   algorithm combinations effectively predicted PAH subtypes, identifying
   the optimal model and its key feature genes. The optimal model
   exhibited high AUC values across multiple datasets, highlighting its
   superiority in PAH subtype classification.

Elastic net model construction

   Elastic Net is a regression model that integrates the advantages of
   Lasso (L1 regularization) and Ridge (L2 regularization), balancing the
   penalty on regression coefficients to reduce overfitting. Overfitting
   occurs when a model performs well on training data but poorly on test
   data, often due to excessive sensitivity to noise in the training data.
   Regularization helps mitigate this by reducing the model’s sensitivity
   to noise. Lasso performs feature selection by shrinking some
   coefficients to zero, while Ridge stabilizes the model by shrinking all
   coefficients, making it useful in cases of multicollinearity. In this
   study, we selected Elastic Net (α = 0.1) for variable selection, based
   on the optimal model derived from machine learning. We performed
   10-fold cross-validation to choose the optimal regularization parameter
   (λ) and identified a set of genes with non-zero regression coefficients
   for prediction. Using the “glmnet” package in R, we conducted Lasso
   regression analysis on hub genes, constructing the final prognostic
   risk score model. The risk score (RiskScore) was calculated using the
   following formula:
   graphic file with name d33e357.gif

   Where Coef represents the regression coefficient of the gene in the
   multivariate Cox regression analysis, and n is the total number of hub
   genes.

GeneMANIA protein–protein interaction network prediction

   GeneMANIA is a bioinformatics tool and database designed to predict
   gene functions and construct and visualize gene-gene interaction
   networks. It integrates multiple data sources to assist researchers in
   understanding the relationships between genes and inferring their
   functions. In this study, we input the identified differential genes
   into the GeneMANIA database to predict protein-protein interaction
   networks.

Clinical candidate drug prediction

   The Connectivity Map (CMap) database is a bioinformatics resource
   designed to elucidate the relationships between genes, diseases, and
   drugs. It systematically collects gene expression data following drug,
   gene, and small molecule interventions, enabling researchers to explore
   drug mechanisms, discover new drugs, or identify new uses for existing
   drugs. In this study, we obtained the upregulated DEGs from the
   comparison of C2 vs. C1, intersected them with the protein interaction
   network molecules identified through GeneMANIA analysis, and
   subsequently input the results into the CMap database for further
   analysis.

Single-cell RNA sequencing data analysis

   Single-cell RNA sequencing data were processed using the “Seurat”
   package. Quality control (QC) metrics, including the number of genes,
   number of transcripts, mitochondrial gene percentage, and ribosomal
   gene percentage, were visualized using violin plots (VlnPlot). The QC
   criteria applied were: nFeature_RNA > 200, nFeature_RNA < 5000, and
   percent.mt < 40. Correlations between these QC metrics were displayed
   using scatter plots (FeatureScatter). The data were then normalized
   using the LogNormalize method, and the top 2000 highly variable genes,
   exhibiting substantial expression variability, were identified for
   subsequent dimensionality reduction and clustering analysis.For
   dimensionality reduction, PCA was performed on the highly variable
   genes, followed by UMAP for visualization. To address potential batch
   effects, the Harmony algorithm was employed for batch correction, and
   the corrected data were visualized using UMAP. Clustering was optimized
   by constructing a clustering tree and adjusting different resolution
   parameters. Marker genes for each cluster were identified using the
   FindAllMarkers function, and the clustering results were visualized. To
   ensure accurate annotation, manual annotation was performed for each
   cluster to assign specific cell types, such as neutrophils, monocytes,
   and T cells. These annotated labels were visualized using UMAP,
   providing a clear representation of cell type distribution in the
   reduced dimensional space.

Monocle cell trajectory analysis

   To explore the dynamic changes of cells during development or disease
   progression, we performed cell trajectory analysis using Monocle 3.
   First, the normalized expression matrix, cell metadata, and highly
   variable genes were extracted from the Seurat object to construct a
   Monocle CellDataSet object. Using Monocle’s dimensionality reduction
   and ordering algorithms, we inferred the cell trajectory based on gene
   expression patterns and identified the differentiation pathways
   associated with different cell states. By visualizing the cell
   trajectory, we were able to intuitively illustrate the developmental
   process of cells over time or across different conditions, uncovering
   potential differentiation relationships and key transition points.

CellChat cell communication analysis

   To investigate intercellular communication patterns, we utilized the
   CellChat tool for analysis. First, the expression matrix and cell type
   information were extracted from the single-cell RNA sequencing data
   (e.g., Seurat object) to construct the CellChat object. CellChat then
   inferred the intercellular communication network based on a known
   ligand-receptor interaction database. By analyzing ligand-receptor
   pairs and communication pathways, we identified key signaling pathways
   between different cell types and visualized these communication
   networks, thereby revealing the complex interactions among cells and
   their potential biological implications.

Data analysis

   All analyses were performed using R version 4.3.0. A p-value of less
   than 0.05 was considered statistically significant.

Results

Principal component analysis, differential gene heatmap, and consensus
clustering reveal biological differences and subtypes in PAH patients and
control groups

   We analyzed gene expression data from PAH patients and control groups
   across three datasets: [64]GSE117261, [65]GSE113439, and [66]GSE53408.
   PCA was performed for dimensionality reduction, revealing a partial
   separation between PAH and control groups along the first two principal
   components, indicating potential biological differences (Fig. [67]1A).
   To further explore these differences, we identified the top 1,000
   differentially expressed genes and visualized them in a heatmap
   (Fig. [68]1B). The heatmap highlighted distinct genetic profiles
   between PAH and control samples, underscoring their molecular
   divergence. Consensus clustering of patient data, performed using the
   ConsensusClusterPlus package, identified K = 2 as the most stable
   clustering solution (Fig. [69]1C–E). This result suggests the presence
   of two distinct molecular subtypes within the PAH group, providing
   valuable insights into patient heterogeneity and potential avenues for
   targeted therapeutic strategies.

Fig. 1.

   [70]Fig. 1
   [71]Open in a new tab

   Clustering analysis of PAH patients based on integrated GEO datasets.
   (A) PCA plot showing the integration of three GEO datasets
   ([72]GSE117261, [73]GSE113439, and [74]GSE53408) containing data from
   PAH and control samples. The aim is to assess whether there is a
   biological difference between the two groups. (B) Heatmap of the top
   1000 most significantly differentially expressed genes, extracted from
   the integrated data. The heatmap highlights the differential expression
   between PAH and control samples, making these differences more visually
   prominent. (C–E) Consensus clustering analysis of PAH data using the
   ConsensusClusterPlus package, performed at different values of K. The
   most stable clustering module is observed when K = 2.

Identification of hypoxia-related subgroups in PAH and functional
differential analysis

   We identified hypoxia-related subgroups in PAH patients and analyzed
   gene and functional differences between hypoxic (c2) and non-hypoxic
   (c1) subgroups. Using 493 hypoxia-related genes from the MSigDB
   database, Gene Set Variation Analysis (GSVA) showed significantly
   higher expression of hypoxia-related genes in the c2 subgroup compared
   to c1 and normal controls (Fig. [75]2A). Differential gene analysis
   identified 459 upregulated and 250 downregulated genes in c2
   (Fig. [76]2B, C), with several upregulated genes closely linked to
   hypoxic responses. KEGG pathway enrichment analysis revealed
   significant involvement of TNF, IL-17, and HIF-1 pathways in c2,
   underscoring their roles in the hypoxic response (Fig. [77]2D). Based
   on these findings, we defined c2 as the hypoxia-related subgroup.

Fig. 2.

   [78]Fig. 2
   [79]Open in a new tab

   GSVA analysis of hypoxia-related subgroups and functional enrichment
   differences between hypoxia and non-hypoxia subgroups. (A) GSVA (Gene
   Set Variation Analysis) was performed using a hypoxia-related gene
   signature from MSigDB, comprising 493 genes. The analysis was conducted
   for c1 and c2 subgroups, as well as normal samples, to assess their
   hypoxia-related gene activity. The resulting GSVA scores were
   visualized in a bar plot, identifying c2 as the hypoxia-associated
   subgroup. (B) Differential gene expression analysis between c2 and c1
   subgroups was performed using the limma package (c2 vs. c1). A log fold
   change (logFC) threshold of 0.585 and a p-value cutoff of 0.05 were
   applied. (C) Volcano plot of differentially expressed genes between c2
   and c1 subgroups, with 459 upregulated genes (shown on the top) and 250
   downregulated genes. Genes with an adjusted p-value < 0.005 and an
   absolute logFC > 1.5 are indicated, highlighting several interesting
   genes. (D) Functional enrichment analysis of upregulated and
   downregulated genes. The enrichment of upregulated genes (shown in the
   plot) revealed multiple pathways potentially involved in disease
   progression, including TNF, IL-17, and HIF-1 signaling pathways. (E) GO
   (Gene Ontology) enrichment analysis of upregulated and downregulated
   genes, further revealing significant biological processes and cellular
   components associated with these gene sets.

   Gene Ontology (GO) functional enrichment analysis further highlighted
   enrichment of inflammation, oxygen response, and cell activation
   processes in c2 (Fig. [80]2E). These findings deepen our understanding
   of hypoxic mechanisms in PAH, offering insights into potential
   therapeutic targets.

Construction of clinical diagnostic model and gene selection

   Next, we constructed a clinical diagnostic model for the
   hypoxia-related subgroup. We first extracted the intersection of
   differentially expressed genes between the comparisons of c2 vs. c1 and
   c2 vs. control, resulting in a total of 475 genes (Fig. [81]3A). We
   then applied a combination of 12 algorithms with variable selection and
   prediction capabilities to model these genes, generating 113 model
   combinations in total. To ensure model simplicity, we set a minimum of
   5 variables per model, ultimately selecting 31 effective model
   combinations (Fig. [82]3B). Among these, the Elastic Net model with
   α = 0.1 performed best, showing excellent AUC values across different
   cohorts, particularly in the combat and [83]GSE254517 datasets, where
   the AUC reached 0.971 and 0.943, respectively.

Fig. 3.

   [84]Fig. 3
   [85]Open in a new tab

   Clinical diagnostic model construction: The intersection of
   differentially expressed genes (c2 vs. c1 and c2 vs. control) was
   extracted and applied to 113 combinations of 12 algorithms with
   variable selection and prediction methods for model construction. (A)
   Venn diagram of the differentially expressed genes. (B) The model was
   constrained by a minimum of five variables, resulting in 31 final
   combinations. The optimal model was achieved using an elastic net with
   alpha = 0.1. (C) Elastic net analysis identified a model comprising 16
   genes.

   Subsequently, we further analyzed the best-performing Elastic Net
   model. Through Lasso regression, we identified 16 genes closely
   associated with disease prediction (Fig. [86]3C). The formula for the
   risk score was calculated as follows:

   RiskScore = (0.6449) × VCAM1 + (− 0.0906) × TMEM204+(− 0.8461)× CCL2 +
   (− 0.3694) × HIGD1B + (− 1.1957) × HIF3A + (− 0.3488) × CYP1B1 +
   (0.0951) × ADAMTS9 + (0.0803) × LXN + (− 0.5070) × RPL22L1 + (0.3955) ×
   ETV5 + (1.3083) × AREG + (0.1288) × ITGA2 + (− 0.6339) × BTNL9 +
   (0.9003) × LIFR + (0.3237) × TFPI2 +(− 0.5212) × RGN (Fig. [87]3C).
   These genes form the core components of the diagnostic model, providing
   precise biological markers for clinical use.

Drug prediction and PPI network analysis

   We analyzed the protein-protein interaction (PPI) network of genes
   identified in our model using the GeneMANIA database, resulting in a
   network involving 36 genes (Fig. [88]4A). Intersection of these genes
   with upregulated differentially expressed genes from the c2 vs. c1
   comparison identified 14 core genes, including VCAM1, LXN, ITGA2, and
   AREG, which may play key roles in PAH pathology (Fig. [89]4B). These 14
   core genes were submitted to the CMap database for drug prediction
   analysis (Fig. [90]4C). Mechanism classification highlighted drugs
   related to phosphodiesterase inhibitors, endothelin receptor
   antagonists, calcium channel blockers, and prostacyclin analogs as
   significantly associated with the selected genes, suggesting their
   potential therapeutic relevance. CMap scoring ranked drugs by
   therapeutic potential, with lower-score drugs (marked in blue) showing
   greater promise (Fig. [91]4D), which may have been underexplored in
   clinical contexts, could provide additional insights for PAH treatment.

Fig. 4.

   [92]Fig. 4
   [93]Open in a new tab

   Drug prediction: (A) The model genes were input into the GeneMinia
   database to predict the protein-protein interaction (PPI) network. A
   PPI network of 36 genes was obtained. (B) The PPI network genes were
   intersected with the upregulated differentially expressed genes in C2
   vs. C1, resulting in 14 hub genes. (C) These 14 hub genes were
   subsequently input into the Cmap database for drug prediction.
   Phosphodiesterase inhibitors, endothelin receptor agonists, calcium
   ion-related drugs, and prostacyclin drugs were extracted and
   statistically analyzed. (D) Cmap assigns a score to each drug; the
   lower the score, the more likely the drug has therapeutic potential.

Analysis of quality control, cell classification, and gene expression
features in PAH and control samples from single-cell RNA sequencing data of
pulmonary arterial hypertension

   We analyzed the [94]GSE210248 dataset to perform quality control for
   PAH and control samples. The results (Supplementary Fig. 1A) showed
   that gene expression levels, RNA counts, and mitochondrial gene
   proportions in all samples met the analysis criteria. Based on the
   established thresholds, 22,704 cells were included in the analysis.
   Next, we performed cell classification (Supplementary Fig. 1B) and
   determined the optimal resolution value of 0.2, which divided the cells
   into 14 clusters (Fig. [95]5A, Supplementary Fig. 1B). Using relevant
   markers, the 14 clusters were classified into 10 cell types:
   Endothelium (VWF, TSPAN7), Epithelium (SFTPC, EPCAM, KRT19, KRT7),
   Fibroblast (DCN), Macrophage (CD14, C1QA, CD68), Mast Cells (TPSAB1),
   Monocyte (S100A8), NK Cells (NKG7), Smooth Muscle Cells/SMC (TAGLN,
   BGN), T Cells (CXCR4, CCL5), and T/NK Cells (NKG7, CXCR4, CCL5)
   (Fig. [96]5B, C). A bubble plot further illustrated the expression of
   marker genes in each cell type (Supplementary Fig. 1C), where bubble
   size and color indicate the expression intensity and the proportion of
   positive cells for each marker gene. Additionally, we generated
   heatmaps showing the top 10 highly expressed genes in each cell cluster
   (Supplementary Fig. 1D).

Fig. 5.

   [97]Fig. 5
   [98]Open in a new tab

   Cell subpopulation clustering and annotation: (A) The 21,139 cells were
   clustered into 14 groups at a resolution of 0.2. (B,C) Based on cell
   markers, the 14 clusters were classified into 9 cell types
   ([99]GSE210248). (D) The 22,704 cells were clustered into 13 groups at
   a resolution of 0.5. (E,F) Based on cell markers, the 13 clusters were
   classified into 9 cell types ([100]GSE228643).

Analysis of quality control, cell classification, and gene expression
features in PHPF and control samples from pulmonary arterial hypertension
with pulmonary fibrosis

   We analyzed the [101]GSE228643 dataset, starting with quality control
   for PHPF and control samples. QC results confirmed that gene expression
   levels, RNA counts, and mitochondrial gene proportions met analysis
   criteria, with 21,139 cells included after applying thresholds
   (Supplementary Fig. 2A). Cell classification was performed at an
   optimal resolution of 0.6, resulting in 15 distinct clusters
   (Fig. [102]5D, Supplementary Fig. 2B). A bubble plot illustrated marker
   gene expression for each cell type, with bubble size and color
   representing gene expression intensity and the proportion of positive
   cells, respectively (Supplementary Fig. 2C). Based on marker genes, the
   15 clusters were categorized into 9 cell types: Endothelium (VWF,
   TSPAN7), Epithelium (SFTPC, EPCAM, KRT19, KRT7), Fibroblast (DCN),
   Macrophage (CD14, C1QA, CD68), Mast Cells (TPSAB1), Monocyte (S100A8),
   NK Cells (NKG7), Smooth Muscle Cells/SMC (TAGLN, BGN), B Cells (MS4A1),
   and T/NK Cells (NKG7, CXCR4, CCL5) (Fig. [103]5E, F). We also generated
   heatmaps showing gene expression patterns across different clusters,
   which provide insights into the functional characteristics of the cell
   clusters (Supplementary Fig. 2D). These analyses support a deeper
   understanding of the differences between PHPF and CON samples and the
   cellular heterogeneity within the dataset.

Analysis of significant changes in SMC in disease groups and differential
expression of key genes

   We examined cell type distribution across tissues to investigate
   differences between disease and control groups. In the PAH dataset
   ([104]GSE210248), visualization revealed a significantly higher
   proportion of SMCs in the disease group (22.5%) compared to the control
   group (11.5%) (Fig. [105]6A, B). Similarly, in the PHPF dataset
   ([106]GSE228643), SMCs accounted for 28.5% in the disease group versus
   only 0.6% in controls (Fig. [107]6C, D), emphasizing the critical role
   of SMCs in disease progression.

Fig. 6.

   [108]Fig. 6
   [109]Open in a new tab

   Cell subpopulation clustering and annotation: (A) Distribution of
   different cells across tissues. (B) Proportion of different cell types
   across tissues ([110]GSE210248). (C) Distribution of different cells
   across tissues. (D) Proportion of different cell types across tissues
   ([111]GSE228643). (E) Expression of CYP1B1, ITGA2, RPL22L1, and VCAM1
   across different cell types and tissues ([112]GSE210248). (F)
   Expression of CYP1B1, ITGA2, RPL22L1, and VCAM1 across different cell
   types and tissues ([113]GSE228643).

   Next, we analyzed the expression of key genes (CYP1B1, ITGA2, RPL22L1,
   and VCAM1) across cell types and tissues (Fig. [114]6E, F). These genes
   were significantly upregulated in SMCs in both disease datasets,
   particularly in the disease groups. Notably, CYP1B1, implicated in SMC
   proliferation according to previous studies, likely plays a central
   role in driving SMC expansion in these conditions. These findings
   highlight the importance of SMCs and their associated genes in disease
   progression, offering valuable insights for future research and
   potential therapeutic interventions.

Analysis of four SMC clusters and exploration of their oxygen-related
properties and vascular remodeling mechanisms

   We re-clustered SMCs from the PAH single-cell dataset into four
   clusters (Fig. [115]7A) and analyzed their distribution across tissues.
   Cluster 2 showed a marked increase in disease-associated tissues,
   suggesting its critical role in disease progression (Fig. [116]7B).
   Using known marker genes, we classified the clusters into subtypes,
   identifying cluster 2 as closely associated with hypoxic conditions,
   indicating its significance under oxygen-deprived states
   (Fig. [117]7C). Gene expression analysis revealed elevated levels of
   CYP1B1, ITGA2, RPL22L1, and VCAM1 in cluster 2, aligning with previous
   studies on hypoxia-related changes, confirming cluster 2 as an
   oxygen-associated subtype (Fig. [118]7D). GO and KEGG enrichment
   analyses showed that cluster 2 is linked to collagen, ECM, AGE, growth
   factors, and the PI3K signaling pathway, highlighting its role in
   vascular remodeling (Fig. [119]7E). Pseudotime analysis suggested that
   cluster 2 may arise from cluster 1, which tends toward synthetic SMCs,
   reflecting dynamic changes and differentiation during disease
   progression (Fig. [120]7F). These findings establish cluster 2 as a
   pivotal oxygen-associated SMC subtype involved in PAH pathology and
   vascular remodeling.

Fig. 7.

   [121]Fig. 7
   [122]Open in a new tab

   SMC re-clustering and analysis: (A) The SMCs were re-clustered into 4
   groups. (B) Distribution and proportion of the 4 clusters across
   tissues, with emphasis on cluster 2. (C) Subclassification of clusters
   based on markers. (D) Expression of CYP1B1, ITGA2, RPL22L1, and VCAM1
   in the clusters. (E) GO and KEGG enrichment analysis. (F)
   Pseudotemporal ordering of the clusters.

Analysis of five SMC clusters and their relationship with synthetic phenotype
and vascular remodeling

   We re-clustered SMCs from the PHPF single-cell dataset into five
   clusters (Fig. [123]8A). Analysis of cluster distribution across
   tissues revealed that cluster 1 was predominant in diseased tissues,
   suggesting its pivotal role in the pathological process (Fig. [124]8B).
   Subtype classification using known marker genes identified cluster 1 as
   a synthetic SMC subtype, characterized by elevated expression of
   CYP1B1, ITGA2, RPL22L1, and VCAM1 (Fig. [125]8C). These findings align
   with its active involvement in matrix synthesis during disease
   progression (Fig. [126]8D). GO and KEGG enrichment analyses revealed
   that cluster 1 is significantly associated with collagen, ECM, AGE,
   growth factors, and PI3K signaling pathways, further supporting its
   role in vascular remodeling through matrix component synthesis
   (Fig. [127]8E). Pseudotime analysis indicated that cluster 1 may
   differentiate into cluster 2, with cluster 1 showing synthetic
   characteristics and cluster 2 exhibiting traits related to oxidation
   and fibroblast activity (Fig. [128]8F). These findings establish
   cluster 1 as a key synthetic SMC subtype involved in ECM synthesis and
   vascular remodeling during disease progression, with the potential to
   differentiate into other functional cell populations, highlighting its
   significance in the pathology of PHPF.

Fig. 8.

   [129]Fig. 8
   [130]Open in a new tab

   SMC re-clustering and analysis: (A) The SMCs were re-clustered into 5
   groups. (B) Distribution and proportion of the 5 clusters across
   tissues. (C) Subclassification of clusters based on markers. (D)
   Expression of CYP1B1, ITGA2, RPL22L1, and VCAM1 in the clusters. (E) GO
   and KEGG enrichment analysis. (F) Pseudotemporal ordering of the
   clusters.

Analysis of the interaction between COLLAGEN and FN1 signaling pathways in
PAH and PHPF samples and their role in immune activation and vascular
remodeling

   We analyzed the signaling pathway profiles in PAH, PHPF, and healthy
   control (CON) samples, highlighting key differences in signaling
   intensity and cell-type interactions. In PAH samples, overall signaling
   patterns revealed a significant enhancement of pathways such as
   COLLAGEN and FN1 compared to CON samples (Supplementary Fig. 3A).
   Analysis of signaling reception across cell types showed that immune
   cells and fibroblasts in PAH had stronger responses to these signals
   than other cell types (Supplementary Fig. 3B). Examination of the
   COLLAGEN signaling network indicated increased interactions among
   monocytes, macrophages, and SMCs in PAH (Supplementary Fig. 3C).
   Similarly, FN1 pathway analysis revealed heightened interactions
   between immune and stromal cells in PAH samples (Supplementary
   Fig. 3D). In PHPF samples, overall signaling and reception patterns
   also showed notably higher COLLAGEN and FN1 signaling compared to CON
   samples (Supplementary Fig. 3E, F). Network analysis of these pathways
   highlighted increased interactions between immune cells and fibroblasts
   in PHPF, with significantly elevated activity in COLLAGEN and FN1
   pathways (Supplementary Fig. 3G, H). These findings emphasize the
   critical roles of COLLAGEN and FN1 signaling in vascular remodeling and
   immune cell activation in both PAH and PHPF, underscoring their
   potential as key pathways in disease progression.

Comparison and analysis of intercellular interactions in PAH and healthy
control (CON) tissues

   We performed CellChat analysis to investigate intercellular
   interactions in PAH and healthy control (CON) tissues, uncovering key
   molecular processes underlying PAH pathology. The analysis revealed
   significantly increased immune cell interactions in PAH tissues,
   highlighting the pivotal role of immune cells in the disease
   (Supplementary Fig. 4A). Signaling strength analysis showed markedly
   higher interactions between fibroblasts and SMCs in PAH compared to
   CON, supporting their significant role in vascular remodeling
   (Supplementary Fig. 4B). Specific pathway analysis revealed enhanced
   COLLAGEN signaling in PAH, emphasizing its critical involvement in the
   remodeling process (Supplementary Figs. 3 A, B, 4 C). Receptor-ligand
   interaction analysis further demonstrated distinct alterations in
   COLLAGEN signaling in PAH tissues (Supplementary Figs. 3 C, D, 4D).
   Detailed analysis of COLLAGEN signaling interactions showed that
   monocytes predominantly initiate this signaling in PAH, linking immune
   cell activation to vascular remodeling (Supplementary Fig. 4E).
   Additionally, expression analysis revealed significantly higher levels
   of COLLAGEN and its ligands, including CD44, ITGA1, and ITGA10, in PAH.
   Monocytes exhibited high SDC4 expression, reinforcing their role as key
   drivers of COLLAGEN signaling in PAH (Supplementary Fig. 4F). These
   findings underscore the critical involvement of COLLAGEN signaling and
   immune cell activation in PAH, providing insights into potential
   therapeutic targets for vascular remodeling.

Comparison and analysis of intercellular interactions in PHPF and healthy
control (CON) tissues

   CellChat analysis of PHPF and healthy control (CON) tissues validated
   key findings regarding cell-cell interactions and signaling pathways.
   In PHPF tissues, interactions between immune cells (monocytes,
   macrophages) and SMCs were significantly enhanced compared to CON,
   suggesting these cells play critical roles in PHPF pathology
   (Supplementary Fig. 5A). The signaling intensity analysis showed
   stronger signals in monocytes, macrophages, and SMCs in PHPF,
   highlighting their involvement in vascular remodeling (Supplementary
   Fig. 5B). Detailed analysis of specific signaling pathways revealed
   that COLLAGEN signaling was markedly enhanced in PHPF tissues
   (Supplementary Figs. 3E, F, 5 C). Examination of receptor-ligand
   interactions in the COLLAGEN pathway identified monocytes as key
   drivers of this signaling in PHPF, linking immune cell activation to
   vascular remodeling (Supplementary Figs. 3G, H, 5D). Intensity analysis
   confirmed that monocyte-driven COLLAGEN signaling dominated in PHPF
   compared to CON (Supplementary Fig. 5E). Additionally, expression
   analysis of COLLAGEN signaling and its ligands showed significantly
   higher levels of COLLAGEN and its ligand CD44 in PHPF tissues. SDC4 was
   notably overexpressed in monocytes in PHPF, reinforcing the pivotal
   role of monocytes in initiating COLLAGEN signaling (Supplementary
   Fig. 5F). These findings emphasize the critical contribution of immune
   cells and COLLAGEN signaling to the pathology of PHPF.

Reclustering analysis of endothelial cells reveals functional heterogeneity

   To further investigate the heterogeneity of PAH-associated endothelial
   cells (ECs), we extracted ECs from the [131]GSE210248 dataset and
   performed dimensionality reduction and reclustering based on
   single-cell RNA sequencing (scRNA-seq) data. Following quality control
   and normalization, we utilized Seurat for UMAP dimensionality reduction
   and classified ECs into 13 subpopulations (Cluster 0–12) based on
   characteristic gene expression (Supplementary Fig. 6B). A heatmap of
   gene expression patterns (Supplementary Fig. 6A) illustrates the key
   marker genes of different EC subpopulations, revealing significant
   differences in gene expression profiles, suggesting that these
   subpopulations may serve distinct biological functions in PAH
   progression. Further analysis of cell proportion distribution
   (Supplementary Fig. 6C) showed that certain EC subpopulations were
   significantly enriched in the PAH group, while others decreased,
   indicating a potential PAH-induced remodeling of EC composition.
   Additionally, we identified a set of key genes and visualized their
   expression patterns across subpopulations using a bubble plot
   (Supplementary Fig. 6D). The results indicate that some subpopulations
   highly express VWF, CDH5, and PECAM1, suggesting they represent typical
   vascular endothelial cells, whereas others exhibit enriched expression
   of ACKR1, SELE, and ICAM1, indicating inflammatory characteristics.
   Moreover, certain EC subpopulations showed high expression of PLVAP and
   CXCL12, suggesting their potential roles in vascular permeability
   regulation and angiogenesis. Further gene expression analysis
   (Supplementary Fig. 6E) demonstrated distinct expression levels of
   specific genes (CYP1B1, ITGA2, RPL24L1, and VCAM1) among different EC
   subpopulations, further supporting the notion that ECs undergo
   functional differentiation during PAH progression.

Functional analysis of EC subpopulations

   To elucidate the biological functions of EC subpopulations, we
   performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and
   Genomes (KEGG) pathway enrichment analyses (Supplementary Fig. 7). GO
   analysis (Supplementary Fig. 7A) revealed that different EC
   subpopulations were primarily enriched in functional categories such as
   cell adhesion (cadherin binding), chemokine signaling (chemokine
   receptor activity), and cytoskeletal structure (structural constituent
   of cytoskeleton). KEGG analysis (Supplementary Fig. 7B) further
   indicated that these subpopulations were potentially involved in key
   pathways, including inflammatory signaling (TNF signaling pathway),
   vascular adhesion (Focal adhesion), and leukocyte transendothelial
   migration (Leukocyte transendothelial migration). Considering the
   alterations in EC subpopulation proportions between the PAH and control
   groups, the enrichment of these pathways suggests that PAH may
   influence vascular inflammatory responses, vascular permeability, and
   remodeling processes by modulating the composition and function of
   specific EC subpopulations.

Discussion

   In this study, we investigated the molecular features of PAH by
   integrating computational analyses and machine learning techniques. Key
   findings include the identification of hub genes in hypoxic PAH
   subgroups and the prediction of potential therapeutic drugs for these
   groups. Using single-cell RNA sequencing, we observed significant
   expression of hub genes in the hypoxic subgroup of SMCs. Additionally,
   cell-cell communication analysis revealed enhanced interactions among
   different cell subpopulations, providing insights into the cellular
   dynamics underlying PAH pathology.

   While there are significant differences in the pathological mechanisms
   between pulmonary hypertension associated with chronic lung diseases
   (Group 3 PH) and pulmonary arterial hypertension (Group 1 PAH), hypoxia
   contributes to disease progression in both contexts. In PH, hypoxia
   serves as the central driver of pathogenesis, whereas in PAH, hypoxia
   may synergize with other molecular pathways to accelerate vascular
   remodeling and disease advancement. Hypoxia-induced PAH represents a
   distinct form of PH compared to PH associated with chronic lung
   diseases. Notably, regional hypoxia within the pulmonary vasculature
   may drive disease pathogenesis even in the absence of systemic
   hypoxemia. Naeije et al. reported that approximately half of PAH
   patients exhibit decreased arterial oxygen tension (PaO[2])^[132]14. A
   retrospective cohort study also identified mixed venous oxygen tension
   (PvO₂) as a key prognostic factor in PAH^[133]15. Thompson et al.
   further suggested that progressive mixed venous hypoxemia may
   contribute to IPAH development, underscoring the potential role of
   regional hypoxia in PAH pathophysiology^[134]16. Furthermore, BMPR2 has
   been widely recognized as a central molecular factor in PAH. Studies
   have shown that BMPR-II deficiency leads to mitochondrial dysfunction,
   resulting in mitochondrial DNA damage and apoptosis, impairing distal
   pulmonary artery regeneration following hypoxia and
   reoxygenation^[135]17. Notably, primary cells derived from PAH patients
   with BMPR2 mutations exhibit increased susceptibility to
   mitochondria-induced apoptosis after hypoxia and reoxygenation^[136]17.
   Additionally, HIF-1α, a key regulator of hypoxic adaptation in
   pulmonary vascular cells, is upregulated in the lung tissue of PAH
   patients, further supporting the relevance of hypoxia-related
   mechanisms in PAH pathogenesis^[137]18–[138]21. In this study, the c2
   subgroup was closely associated with hypoxia, and KEGG functional
   enrichment analysis further confirmed its significant enrichment in the
   HIF-1 signaling pathway, which has been recognized as a key
   hypoxia-related pathway^[139]22,[140]23. Additionally, pathway
   enrichment analysis revealed significant activation of the TNF (tumor
   necrosis factor) and IL-17 (interleukin-17) signaling pathways in the
   c2 subgroup, highlighting the central role of the interplay between
   hypoxia and the immune-inflammatory network in PAH progression. These
   findings suggest that the pathogenesis of PAH extends beyond vascular
   remodeling to encompass metabolic disorders, immune dysregulation, and
   inflammatory responses, providing new directions for biomarker
   discovery and targeted therapy development. Next, we developed a
   diagnostic model for the hypoxia-associated PAH subtype and predicted
   its potential therapeutic drugs. The machine learning-based diagnostic
   model demonstrated excellent performance in both the training and
   validation datasets (AUC > 0.85), confirming its accuracy in
   identifying the hypoxic subtype. Additionally, CMap drug prediction
   analysis revealed the associations between existing therapies and the
   hypoxia-associated PAH subtype. Notably, we found that first-generation
   phosphodiesterase-V inhibitors, such as tadalafil and bosentan, did not
   exhibit significant therapeutic potential in the hypoxia-associated PAH
   subtype, whereas second-generation phosphodiesterase-V inhibitors
   showed greater potential for treatment. These advancements contribute
   to better diagnostic and treatment strategies for PAH.

   Next, in our single-cell analysis, we found a significant increase in
   the number of SMCs in PAH, suggesting that SMCs may be closely related
   to the occurrence and progression of PAH. Previous studies have
   reported that PFKFB3 promotes vascular remodeling in pulmonary
   hypertension by mediating collagen synthesis and proliferation of SMCs.
   This process occurs through increased glycolysis and lactate levels,
   leading to ERK1/2-dependent phosphorylation of calpain-2 and subsequent
   activation of calpain^[141]24, This abnormal metabolic pattern further
   promotes hypoxia. In addition, CircATP2B4 promotes hypoxia-induced SMC
   proliferation and migration through the miR-223/ATR axis, thereby
   accelerating PAH progression^[142]25. Consistent with these findings,
   our study identified a set of hub genes (CYP1B1, ITGA2, RPL22L1, and
   VCAM1) that were significantly upregulated in SMCs and hypoxia-related
   SMC clusters in two single-cell datasets. Among them, Cytochrome P450
   Family 1 Subfamily B Member 1 (CYP1B1) has been demonstrated to be
   closely associated with SMC pathology^[143]26. Furthermore, we explored
   the role of ECs in PAH. Previous studies have shown that phenotypic
   changes in ECs can lead to PAH, such as smoking-induced pulmonary EC
   apoptosis and dysfunction, which, along with hypoxia, promote PAH
   progression^[144]27,[145]28. Apoptotic ECs can also release TGF-β1 and
   VEGF, thereby inducing SMC proliferation and exacerbating PAH
   development. In our study, we similarly found that EC populations were
   enriched in chemokine secretion and inflammatory signaling pathways,
   which is consistent with previous findings. Notably, the role of the
   TGF-β signaling pathway in PAH has garnered increasing attention. For
   example, studies have shown that MED1 regulates the BMP/TGF-β signaling
   pathway in endothelial cells, thereby influencing PAH
   development^[146]10. dditionally, the miR-182/Myadm axis modulates the
   crosstalk between SMCs and ECs, balancing BMP and TGF-β signaling
   pathways to regulate hypoxia-induced PAH progression^[147]29.

   Although this study made significant discoveries through multi-omics
   data integration and advanced machine learning, several limitations
   remain. First, the limited sample size, despite using multiple
   datasets, poses challenges, especially for certain patient subtypes,
   potentially impacting model robustness and increasing overfitting risk,
   limiting generalizability to larger cohorts. Second, dataset
   heterogeneity, including differences in sample sources, processing
   methods, and sequencing platforms, may introduce variability in the
   results. Third, while the models demonstrated high accuracy and AUC
   values on internal datasets, their generalizability to external
   datasets remains unverified. Finally, the biological functions and
   mechanisms of key genes and pathways identified require further
   experimental validation to confirm their roles in PAH pathogenesis.

   Future research should focus on several aspects: First, validate the
   identified key genes through in vitro and in vivo experiments to
   confirm their roles in PAH pathogenesis. Second, apply and optimize
   machine learning models on larger, multi-center datasets to assess
   their predictive accuracy and clinical utility across diverse
   populations. Third, explore additional gene pathways and cell types
   involved in PAH using advanced technologies like single cell sequencing
   to uncover the disease’s complex mechanisms. These efforts will
   strengthen the foundation for personalized diagnosis and treatment
   strategies.

Electronic supplementary material

   Below is the link to the electronic supplementary material.
   [148]Supplementary Material 1^ (4.6MB, docx)

Author contributions

   JHX, JLC, JJZ& JFW: Conceptualization, Data curation, Formal analysis
   and Writing—original draft; CL & JFW: Writing—review and editing. All
   authors approved the final version of this manuscript.

Data availability

   The raw data can be accessed from the GEO database
   ([149]https://www.ncbi.nlm.nih.gov/geo/). Intermediate files obtained
   during the analysis are provided in the supplementary materials. Access
   to the original code is available upon request to the corresponding
   author.

Declarations

Competing interests

   The authors declare no competing interests.

Consent for publication

   All authors have read and approved of its submission to this journal.

Footnotes

   Publisher’s note

   Springer Nature remains neutral with regard to jurisdictional claims in
   published maps and institutional affiliations.

   Jianghao Xiong and Jiali Chen contributed equally to this work.

References