Abstract Pulmonary arterial hypertension (PAH) is a progressive cardiovascular disease characterized by elevated pulmonary arterial pressure, leading to right heart failure and death. Despite advancements in diagnosis and treatment, it remains incurable, and its mechanisms are poorly understood. This study aimed to integrate multi-omics data analysis and machine learning techniques to uncover the molecular characteristics and subtypes of PAH, providing insights into precise diagnosis and therapeutic strategies. We employed consensus clustering to classify PAH patients into subgroups based on multi-omics data. Differential expression and enrichment analyses were conducted to identify key genes and pathways. Machine learning models were developed to predict PAH subtypes and assess their diagnostic performance. PAH patients were divided into two subgroups: C1 and C2. The C2 subgroup showed significantly upregulated hypoxia-related genes, indicating distinct pathogenic mechanisms. Key genes associated with hypoxia, immune regulation, and inflammation were identified, alongside enriched pathways such as TNF, IL-17, and HIF-1 in the C2 subgroup. Machine learning models achieved high accuracy (AUC > 0.85) in distinguishing hypoxia-associated subtypes, supporting their utility for precise diagnosis. Potential therapeutic targets were identified in the TNF and HIF-1 pathways. This study provides novel insights into PAH’s molecular subtypes and their distinct mechanisms, offering diagnostic tools and potential therapeutic targets for personalized treatment. Validation in larger cohorts and experimental studies is essential to confirm the identified biomarkers and pathways. Supplementary Information The online version contains supplementary material available at 10.1038/s41598-025-99025-5. Keywords: Pulmonary arterial hypertension, Hypoxia, Machine learning, Immune regulation and inflammation Subject terms: Computational biology and bioinformatics, Mechanisms of disease Introduction PAH is a severe cardiopulmonary disease characterized by elevated pulmonary arterial pressure, leading to right heart failure and death^[32]1. The central pathological feature of PAH is pulmonary vascular remodeling, involving endothelial and smooth muscle cell (SMCs) proliferation, vasoconstriction, and increased vascular resistance, which overloads the right ventricle^[33]2,[34]3. Despite advances in diagnosis and treatment, the disease remains incurable, underscoring the need to identify key pathological mechanisms for targeted interventions. Hypoxia has been demonstrated to be closely associated with the progression of PAH. It disrupts endothelial barrier integrity, induces apoptosis, and promotes vascular remodeling^[35]4,[36]5. These changes enhance SMCs proliferation, vessel wall thickening, and abnormal neovascularization, exacerbating pulmonary vascular resistance^[37]5–[38]7. Key hypoxia-regulated pathways include hypoxia-inducible factor-1 (HIF-1), which drives cell proliferation, angiogenesis, and metabolic shifts, and the tumor necrosis factor (TNF) pathway, which intensifies inflammation and endothelial damage^[39]8,[40]9. Understanding the role of hypoxia in PAH and elucidating the underlying mechanisms by which it drives disease progression are of crucial importance. With the advancement of sequencing technology, research has gradually shifted from Bulk RNA sequencing (Bulk RNA-seq) to Single-cell sequencing (scRNA-seq). This technological transformation enables researchers to explore microenvironment in diseases at the single-cell level^[41]10,[42]11. Moreover, machine learning enables the construction of more robust predictive models to identify key genes, thereby enhancing our ability to pinpoint critical molecular drivers in PAH pathogenesis. This study aims to systematically explore the genes and pathways driving PAH using computational and machine learning approaches. By analyzing large-scale transcriptomic and single-cell data, we identify feature genes and construct classification models to predict disease onset and identify biomarkers for early diagnosis. Comparative algorithm analysis highlights key genes and pathways central to PAH progression, providing insights into personalized treatment. Furthermore, identifying aberrantly expressed genes offers potential therapeutic targets, addressing the limited efficacy of current treatments and improving clinical outcomes for PAH patients. Methods Data collection We integrated three publicly available datasets from the GEO database ([43]https://www.ncbi.nlm.nih.gov/gds/) for the analysis of PAH: [44]GSE117261 (58 PAH patients and 25 controls), [45]GSE113439 (15 PAH patients and 11 controls), and [46]GSE53408 (12 PAH patients and 11 controls), totaling 85 PAH patients and 47 controls. Additionally, we downloaded single-cell samples and their corresponding normal controls from [47]GSE210248 and [48]GSE228643. [49]GSE210248 includes 3 PAH samples and 3 controls, while [50]GSE228643 includes 4 samples of pulmonary fibrosis-associated PAH and 2 control samples (Table [51]1). Furthermore, we obtained 493 hypoxia-related genes from the Molecular Signatures Database ([52]https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) (Supplementary Table 1). Table 1. RNA sequence data information. Platform Samples Sequencing type [53]GSE117261 IPAH:31; APAH:18; HPAH:5; Other:4; Control:25 Bulk-RNAseq [54]GSE113439 IPAH:6; CTD:4; CHD:4; Control:11 Bulk-RNAseq [55]GSE53408 IPAH:12; Control:11 Bulk-RNAseq [56]GSE210248 IPAH:3; Control:3 scRNAseq [57]GSE228643 PHPF:3; Control:2 scRNAseq [58]Open in a new tab IPAH idiopathic pulmonary arterial hypertension, APAH associated pulmonary arterial hypertension, HPAH hereditary pulmonary arterial hypertension, CTD connective tissue disease, CHD congenital heart disease. It is important to note that, following the international classification of pulmonary arterial hypertension (2022 ESC/ERS Guidelines for the Diagnosis and Treatment of Pulmonary Hypertension)^[59]12, most Bulk-RNAseq samples included in our analysis are classified as PAH (Group 1 PH), ensuring consistency in sample selection and analysis. Our study investigates the role of hypoxia in the onset and progression of PAH; therefore, we included PHPF (PH induced by chronic lung diseases and hypoxia; Group 3 PH) in the scRNAseq analysis to validate the correlation between key genes and cellular hypoxic phenotypes. Microarray data processing, batch effect correction, and principal component analysis In this study, we downloaded three microarray datasets from the GEO database ([60]GSE113439, [61]GSE117261, and [62]GSE53408) and integrated their gene expression matrices and clinical information. To address potential technical biases between datasets, we applied the combat function from the ‘sva’ package for batch effect correction, ensuring the reliability of the analysis results. The processed expression data were then normalized using the ‘limma’ package to further reduce technical variations across samples. To assess the similarity between samples, we performed principal component analysis (PCA). PCA helped identify the main patterns of variation and visually represented the distribution of samples from different groups (e.g., PAH and control), highlighting the gene expression differences between groups. For a deeper investigation of gene expression patterns, we extracted the 1,000 genes with the highest standard deviation and generated a gene expression heatmap. Consensus clustering-based PAH subtype identification and analysis In this study, we performed consensus clustering analysis on PAH samples to identify potential subtypes. First, we extracted the PAH group samples from the pre-processed expression matrix and, in combination with an externally imported list of hypoxia-related genes, selected genes associated with the hypoxic response for further analysis. Based on this, we applied the ‘ConsensusClusterPlus’ package for consensus clustering, setting the maximum number of clusters to 9. We performed 500 iterations with an 80% sample extraction ratio (pItem = 0.8) to enhance the stability of the clustering results. To assess the reliability of the clustering, we used Euclidean distance as the similarity measure and employed the K-means clustering algorithm. The resulting consensus matrix and internal clustering consistency results were saved in PDF format. After clustering, we calculated the Proportion of Ambiguous Clustering (PAC) value, which was used to evaluate the stability of clustering across different numbers of clusters. The optimal number of clusters was determined based on the minimum PAC value, which led to the selection of the most appropriate subtype number for PAH samples. Once the optimal number of clusters was identified, we extracted the classification results for those clusters and integrated them with clinical data. By verifying the matching of sample names, we ensured that the clustering results accurately corresponded with the clinical data. Finally, we generated a clinical dataset containing subtype information for subsequent differential analysis and biological interpretation. Hypoxia-related gene set analysis based on GSVA We performed gene set variation analysis (GSVA) to evaluate the enrichment of hypoxia-related genes in PAH samples. First, a hypoxia gene set was constructed, and GSVA was applied to compute enrichment scores for each sample relative to the hypoxia gene set. The resulting hypoxia enrichment scores were then used to generate bar plots of the enrichment scores for different sample groups (control, c1, and c2), followed by statistical comparisons. Differential gene analysis and functional enrichment in PAH subtypes In this study, we utilized differential gene analysis, GSVA, and GSEA to evaluate the gene expression characteristics of PAH subtypes. Differential gene analysis was performed using the ‘limma’ package to identify genes that were significantly differentially expressed between subtypes. The gene expression changes were visualized using volcano plots and heatmaps. Subsequently, we conducted Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis using the ‘clusterProfiler’ package to uncover functional pathways associated with the differentially expressed genes^[63]13. GSVA was applied to perform enrichment analysis for the different subtypes, and GSEA was further employed to assess the functional enrichment of upregulated and downregulated genes between the two subtypes. Finally, the enrichment results were presented through multiple visualization methods, providing support for subsequent biological investigations. Core gene identification in C2 subtype vs. control and C2 vs. C1 subtypes In this study, we extracted differentially expressed genes (DEGs) from the C2 subtype versus the control group and from the C2 subtype versus the C1 subtype, followed by an overlapping analysis of these DEGs. First, we imported DEG data for both the C2 subtype vs. control group (non-stable expression genes) and C2 vs. C1 subtype comparisons. Next, we used the intersection function to identify overlapping genes between the two gene sets, and these overlapping genes were saved as CSV files for further analysis. Subsequently, a Venn diagram was generated using the ‘ggvenn’ package to visualize the shared and unique DEGs between the C2 subtype, control group, and C1 subtype. Classification model construction and evaluation using multiple machine learning algorithms In this study, we employed a combination of 12 machine learning algorithms to construct and evaluate 113 classification models aimed at predicting the subtypes of PAH. First, we extracted common genes and samples from both the training and test datasets and performed data normalization to ensure the validity of the models. Within a cross-validation framework, feature selection and model construction were integrated. We utilized algorithms such as Lasso, Ridge, Elastic Net (Enet), Random Forest, Support Vector Machine (SVM), and XGBoost to select feature genes with predictive significance for classification. The model construction process was carried out in two steps: first, potential predictive variables were selected using feature selection algorithms to retain the most relevant gene set; then, classification models were built based on the selected genes. To ensure model robustness, a minimum number of variables was specified to prevent overly simplistic models that could lead to poor performance. During the model validation phase, we evaluated the trained models by calculating risk scores, predicting binary outcomes (0 or 1) for samples, and computing the area under the curve (AUC) values for the models on both the training and test datasets. These AUC values were used to assess the performance of the models in the classification task. To visually compare the performance of different models across datasets, heatmaps were generated to display the AUC results. The final results demonstrated that the models constructed using multiple algorithm combinations effectively predicted PAH subtypes, identifying the optimal model and its key feature genes. The optimal model exhibited high AUC values across multiple datasets, highlighting its superiority in PAH subtype classification. Elastic net model construction Elastic Net is a regression model that integrates the advantages of Lasso (L1 regularization) and Ridge (L2 regularization), balancing the penalty on regression coefficients to reduce overfitting. Overfitting occurs when a model performs well on training data but poorly on test data, often due to excessive sensitivity to noise in the training data. Regularization helps mitigate this by reducing the model’s sensitivity to noise. Lasso performs feature selection by shrinking some coefficients to zero, while Ridge stabilizes the model by shrinking all coefficients, making it useful in cases of multicollinearity. In this study, we selected Elastic Net (α = 0.1) for variable selection, based on the optimal model derived from machine learning. We performed 10-fold cross-validation to choose the optimal regularization parameter (λ) and identified a set of genes with non-zero regression coefficients for prediction. Using the “glmnet” package in R, we conducted Lasso regression analysis on hub genes, constructing the final prognostic risk score model. The risk score (RiskScore) was calculated using the following formula: graphic file with name d33e357.gif Where Coef represents the regression coefficient of the gene in the multivariate Cox regression analysis, and n is the total number of hub genes. GeneMANIA protein–protein interaction network prediction GeneMANIA is a bioinformatics tool and database designed to predict gene functions and construct and visualize gene-gene interaction networks. It integrates multiple data sources to assist researchers in understanding the relationships between genes and inferring their functions. In this study, we input the identified differential genes into the GeneMANIA database to predict protein-protein interaction networks. Clinical candidate drug prediction The Connectivity Map (CMap) database is a bioinformatics resource designed to elucidate the relationships between genes, diseases, and drugs. It systematically collects gene expression data following drug, gene, and small molecule interventions, enabling researchers to explore drug mechanisms, discover new drugs, or identify new uses for existing drugs. In this study, we obtained the upregulated DEGs from the comparison of C2 vs. C1, intersected them with the protein interaction network molecules identified through GeneMANIA analysis, and subsequently input the results into the CMap database for further analysis. Single-cell RNA sequencing data analysis Single-cell RNA sequencing data were processed using the “Seurat” package. Quality control (QC) metrics, including the number of genes, number of transcripts, mitochondrial gene percentage, and ribosomal gene percentage, were visualized using violin plots (VlnPlot). The QC criteria applied were: nFeature_RNA > 200, nFeature_RNA < 5000, and percent.mt < 40. Correlations between these QC metrics were displayed using scatter plots (FeatureScatter). The data were then normalized using the LogNormalize method, and the top 2000 highly variable genes, exhibiting substantial expression variability, were identified for subsequent dimensionality reduction and clustering analysis.For dimensionality reduction, PCA was performed on the highly variable genes, followed by UMAP for visualization. To address potential batch effects, the Harmony algorithm was employed for batch correction, and the corrected data were visualized using UMAP. Clustering was optimized by constructing a clustering tree and adjusting different resolution parameters. Marker genes for each cluster were identified using the FindAllMarkers function, and the clustering results were visualized. To ensure accurate annotation, manual annotation was performed for each cluster to assign specific cell types, such as neutrophils, monocytes, and T cells. These annotated labels were visualized using UMAP, providing a clear representation of cell type distribution in the reduced dimensional space. Monocle cell trajectory analysis To explore the dynamic changes of cells during development or disease progression, we performed cell trajectory analysis using Monocle 3. First, the normalized expression matrix, cell metadata, and highly variable genes were extracted from the Seurat object to construct a Monocle CellDataSet object. Using Monocle’s dimensionality reduction and ordering algorithms, we inferred the cell trajectory based on gene expression patterns and identified the differentiation pathways associated with different cell states. By visualizing the cell trajectory, we were able to intuitively illustrate the developmental process of cells over time or across different conditions, uncovering potential differentiation relationships and key transition points. CellChat cell communication analysis To investigate intercellular communication patterns, we utilized the CellChat tool for analysis. First, the expression matrix and cell type information were extracted from the single-cell RNA sequencing data (e.g., Seurat object) to construct the CellChat object. CellChat then inferred the intercellular communication network based on a known ligand-receptor interaction database. By analyzing ligand-receptor pairs and communication pathways, we identified key signaling pathways between different cell types and visualized these communication networks, thereby revealing the complex interactions among cells and their potential biological implications. Data analysis All analyses were performed using R version 4.3.0. A p-value of less than 0.05 was considered statistically significant. Results Principal component analysis, differential gene heatmap, and consensus clustering reveal biological differences and subtypes in PAH patients and control groups We analyzed gene expression data from PAH patients and control groups across three datasets: [64]GSE117261, [65]GSE113439, and [66]GSE53408. PCA was performed for dimensionality reduction, revealing a partial separation between PAH and control groups along the first two principal components, indicating potential biological differences (Fig. [67]1A). To further explore these differences, we identified the top 1,000 differentially expressed genes and visualized them in a heatmap (Fig. [68]1B). The heatmap highlighted distinct genetic profiles between PAH and control samples, underscoring their molecular divergence. Consensus clustering of patient data, performed using the ConsensusClusterPlus package, identified K = 2 as the most stable clustering solution (Fig. [69]1C–E). This result suggests the presence of two distinct molecular subtypes within the PAH group, providing valuable insights into patient heterogeneity and potential avenues for targeted therapeutic strategies. Fig. 1. [70]Fig. 1 [71]Open in a new tab Clustering analysis of PAH patients based on integrated GEO datasets. (A) PCA plot showing the integration of three GEO datasets ([72]GSE117261, [73]GSE113439, and [74]GSE53408) containing data from PAH and control samples. The aim is to assess whether there is a biological difference between the two groups. (B) Heatmap of the top 1000 most significantly differentially expressed genes, extracted from the integrated data. The heatmap highlights the differential expression between PAH and control samples, making these differences more visually prominent. (C–E) Consensus clustering analysis of PAH data using the ConsensusClusterPlus package, performed at different values of K. The most stable clustering module is observed when K = 2. Identification of hypoxia-related subgroups in PAH and functional differential analysis We identified hypoxia-related subgroups in PAH patients and analyzed gene and functional differences between hypoxic (c2) and non-hypoxic (c1) subgroups. Using 493 hypoxia-related genes from the MSigDB database, Gene Set Variation Analysis (GSVA) showed significantly higher expression of hypoxia-related genes in the c2 subgroup compared to c1 and normal controls (Fig. [75]2A). Differential gene analysis identified 459 upregulated and 250 downregulated genes in c2 (Fig. [76]2B, C), with several upregulated genes closely linked to hypoxic responses. KEGG pathway enrichment analysis revealed significant involvement of TNF, IL-17, and HIF-1 pathways in c2, underscoring their roles in the hypoxic response (Fig. [77]2D). Based on these findings, we defined c2 as the hypoxia-related subgroup. Fig. 2. [78]Fig. 2 [79]Open in a new tab GSVA analysis of hypoxia-related subgroups and functional enrichment differences between hypoxia and non-hypoxia subgroups. (A) GSVA (Gene Set Variation Analysis) was performed using a hypoxia-related gene signature from MSigDB, comprising 493 genes. The analysis was conducted for c1 and c2 subgroups, as well as normal samples, to assess their hypoxia-related gene activity. The resulting GSVA scores were visualized in a bar plot, identifying c2 as the hypoxia-associated subgroup. (B) Differential gene expression analysis between c2 and c1 subgroups was performed using the limma package (c2 vs. c1). A log fold change (logFC) threshold of 0.585 and a p-value cutoff of 0.05 were applied. (C) Volcano plot of differentially expressed genes between c2 and c1 subgroups, with 459 upregulated genes (shown on the top) and 250 downregulated genes. Genes with an adjusted p-value < 0.005 and an absolute logFC > 1.5 are indicated, highlighting several interesting genes. (D) Functional enrichment analysis of upregulated and downregulated genes. The enrichment of upregulated genes (shown in the plot) revealed multiple pathways potentially involved in disease progression, including TNF, IL-17, and HIF-1 signaling pathways. (E) GO (Gene Ontology) enrichment analysis of upregulated and downregulated genes, further revealing significant biological processes and cellular components associated with these gene sets. Gene Ontology (GO) functional enrichment analysis further highlighted enrichment of inflammation, oxygen response, and cell activation processes in c2 (Fig. [80]2E). These findings deepen our understanding of hypoxic mechanisms in PAH, offering insights into potential therapeutic targets. Construction of clinical diagnostic model and gene selection Next, we constructed a clinical diagnostic model for the hypoxia-related subgroup. We first extracted the intersection of differentially expressed genes between the comparisons of c2 vs. c1 and c2 vs. control, resulting in a total of 475 genes (Fig. [81]3A). We then applied a combination of 12 algorithms with variable selection and prediction capabilities to model these genes, generating 113 model combinations in total. To ensure model simplicity, we set a minimum of 5 variables per model, ultimately selecting 31 effective model combinations (Fig. [82]3B). Among these, the Elastic Net model with α = 0.1 performed best, showing excellent AUC values across different cohorts, particularly in the combat and [83]GSE254517 datasets, where the AUC reached 0.971 and 0.943, respectively. Fig. 3. [84]Fig. 3 [85]Open in a new tab Clinical diagnostic model construction: The intersection of differentially expressed genes (c2 vs. c1 and c2 vs. control) was extracted and applied to 113 combinations of 12 algorithms with variable selection and prediction methods for model construction. (A) Venn diagram of the differentially expressed genes. (B) The model was constrained by a minimum of five variables, resulting in 31 final combinations. The optimal model was achieved using an elastic net with alpha = 0.1. (C) Elastic net analysis identified a model comprising 16 genes. Subsequently, we further analyzed the best-performing Elastic Net model. Through Lasso regression, we identified 16 genes closely associated with disease prediction (Fig. [86]3C). The formula for the risk score was calculated as follows: RiskScore = (0.6449) × VCAM1 + (− 0.0906) × TMEM204+(− 0.8461)× CCL2 + (− 0.3694) × HIGD1B + (− 1.1957) × HIF3A + (− 0.3488) × CYP1B1 + (0.0951) × ADAMTS9 + (0.0803) × LXN + (− 0.5070) × RPL22L1 + (0.3955) × ETV5 + (1.3083) × AREG + (0.1288) × ITGA2 + (− 0.6339) × BTNL9 + (0.9003) × LIFR + (0.3237) × TFPI2 +(− 0.5212) × RGN (Fig. [87]3C). These genes form the core components of the diagnostic model, providing precise biological markers for clinical use. Drug prediction and PPI network analysis We analyzed the protein-protein interaction (PPI) network of genes identified in our model using the GeneMANIA database, resulting in a network involving 36 genes (Fig. [88]4A). Intersection of these genes with upregulated differentially expressed genes from the c2 vs. c1 comparison identified 14 core genes, including VCAM1, LXN, ITGA2, and AREG, which may play key roles in PAH pathology (Fig. [89]4B). These 14 core genes were submitted to the CMap database for drug prediction analysis (Fig. [90]4C). Mechanism classification highlighted drugs related to phosphodiesterase inhibitors, endothelin receptor antagonists, calcium channel blockers, and prostacyclin analogs as significantly associated with the selected genes, suggesting their potential therapeutic relevance. CMap scoring ranked drugs by therapeutic potential, with lower-score drugs (marked in blue) showing greater promise (Fig. [91]4D), which may have been underexplored in clinical contexts, could provide additional insights for PAH treatment. Fig. 4. [92]Fig. 4 [93]Open in a new tab Drug prediction: (A) The model genes were input into the GeneMinia database to predict the protein-protein interaction (PPI) network. A PPI network of 36 genes was obtained. (B) The PPI network genes were intersected with the upregulated differentially expressed genes in C2 vs. C1, resulting in 14 hub genes. (C) These 14 hub genes were subsequently input into the Cmap database for drug prediction. Phosphodiesterase inhibitors, endothelin receptor agonists, calcium ion-related drugs, and prostacyclin drugs were extracted and statistically analyzed. (D) Cmap assigns a score to each drug; the lower the score, the more likely the drug has therapeutic potential. Analysis of quality control, cell classification, and gene expression features in PAH and control samples from single-cell RNA sequencing data of pulmonary arterial hypertension We analyzed the [94]GSE210248 dataset to perform quality control for PAH and control samples. The results (Supplementary Fig. 1A) showed that gene expression levels, RNA counts, and mitochondrial gene proportions in all samples met the analysis criteria. Based on the established thresholds, 22,704 cells were included in the analysis. Next, we performed cell classification (Supplementary Fig. 1B) and determined the optimal resolution value of 0.2, which divided the cells into 14 clusters (Fig. [95]5A, Supplementary Fig. 1B). Using relevant markers, the 14 clusters were classified into 10 cell types: Endothelium (VWF, TSPAN7), Epithelium (SFTPC, EPCAM, KRT19, KRT7), Fibroblast (DCN), Macrophage (CD14, C1QA, CD68), Mast Cells (TPSAB1), Monocyte (S100A8), NK Cells (NKG7), Smooth Muscle Cells/SMC (TAGLN, BGN), T Cells (CXCR4, CCL5), and T/NK Cells (NKG7, CXCR4, CCL5) (Fig. [96]5B, C). A bubble plot further illustrated the expression of marker genes in each cell type (Supplementary Fig. 1C), where bubble size and color indicate the expression intensity and the proportion of positive cells for each marker gene. Additionally, we generated heatmaps showing the top 10 highly expressed genes in each cell cluster (Supplementary Fig. 1D). Fig. 5. [97]Fig. 5 [98]Open in a new tab Cell subpopulation clustering and annotation: (A) The 21,139 cells were clustered into 14 groups at a resolution of 0.2. (B,C) Based on cell markers, the 14 clusters were classified into 9 cell types ([99]GSE210248). (D) The 22,704 cells were clustered into 13 groups at a resolution of 0.5. (E,F) Based on cell markers, the 13 clusters were classified into 9 cell types ([100]GSE228643). Analysis of quality control, cell classification, and gene expression features in PHPF and control samples from pulmonary arterial hypertension with pulmonary fibrosis We analyzed the [101]GSE228643 dataset, starting with quality control for PHPF and control samples. QC results confirmed that gene expression levels, RNA counts, and mitochondrial gene proportions met analysis criteria, with 21,139 cells included after applying thresholds (Supplementary Fig. 2A). Cell classification was performed at an optimal resolution of 0.6, resulting in 15 distinct clusters (Fig. [102]5D, Supplementary Fig. 2B). A bubble plot illustrated marker gene expression for each cell type, with bubble size and color representing gene expression intensity and the proportion of positive cells, respectively (Supplementary Fig. 2C). Based on marker genes, the 15 clusters were categorized into 9 cell types: Endothelium (VWF, TSPAN7), Epithelium (SFTPC, EPCAM, KRT19, KRT7), Fibroblast (DCN), Macrophage (CD14, C1QA, CD68), Mast Cells (TPSAB1), Monocyte (S100A8), NK Cells (NKG7), Smooth Muscle Cells/SMC (TAGLN, BGN), B Cells (MS4A1), and T/NK Cells (NKG7, CXCR4, CCL5) (Fig. [103]5E, F). We also generated heatmaps showing gene expression patterns across different clusters, which provide insights into the functional characteristics of the cell clusters (Supplementary Fig. 2D). These analyses support a deeper understanding of the differences between PHPF and CON samples and the cellular heterogeneity within the dataset. Analysis of significant changes in SMC in disease groups and differential expression of key genes We examined cell type distribution across tissues to investigate differences between disease and control groups. In the PAH dataset ([104]GSE210248), visualization revealed a significantly higher proportion of SMCs in the disease group (22.5%) compared to the control group (11.5%) (Fig. [105]6A, B). Similarly, in the PHPF dataset ([106]GSE228643), SMCs accounted for 28.5% in the disease group versus only 0.6% in controls (Fig. [107]6C, D), emphasizing the critical role of SMCs in disease progression. Fig. 6. [108]Fig. 6 [109]Open in a new tab Cell subpopulation clustering and annotation: (A) Distribution of different cells across tissues. (B) Proportion of different cell types across tissues ([110]GSE210248). (C) Distribution of different cells across tissues. (D) Proportion of different cell types across tissues ([111]GSE228643). (E) Expression of CYP1B1, ITGA2, RPL22L1, and VCAM1 across different cell types and tissues ([112]GSE210248). (F) Expression of CYP1B1, ITGA2, RPL22L1, and VCAM1 across different cell types and tissues ([113]GSE228643). Next, we analyzed the expression of key genes (CYP1B1, ITGA2, RPL22L1, and VCAM1) across cell types and tissues (Fig. [114]6E, F). These genes were significantly upregulated in SMCs in both disease datasets, particularly in the disease groups. Notably, CYP1B1, implicated in SMC proliferation according to previous studies, likely plays a central role in driving SMC expansion in these conditions. These findings highlight the importance of SMCs and their associated genes in disease progression, offering valuable insights for future research and potential therapeutic interventions. Analysis of four SMC clusters and exploration of their oxygen-related properties and vascular remodeling mechanisms We re-clustered SMCs from the PAH single-cell dataset into four clusters (Fig. [115]7A) and analyzed their distribution across tissues. Cluster 2 showed a marked increase in disease-associated tissues, suggesting its critical role in disease progression (Fig. [116]7B). Using known marker genes, we classified the clusters into subtypes, identifying cluster 2 as closely associated with hypoxic conditions, indicating its significance under oxygen-deprived states (Fig. [117]7C). Gene expression analysis revealed elevated levels of CYP1B1, ITGA2, RPL22L1, and VCAM1 in cluster 2, aligning with previous studies on hypoxia-related changes, confirming cluster 2 as an oxygen-associated subtype (Fig. [118]7D). GO and KEGG enrichment analyses showed that cluster 2 is linked to collagen, ECM, AGE, growth factors, and the PI3K signaling pathway, highlighting its role in vascular remodeling (Fig. [119]7E). Pseudotime analysis suggested that cluster 2 may arise from cluster 1, which tends toward synthetic SMCs, reflecting dynamic changes and differentiation during disease progression (Fig. [120]7F). These findings establish cluster 2 as a pivotal oxygen-associated SMC subtype involved in PAH pathology and vascular remodeling. Fig. 7. [121]Fig. 7 [122]Open in a new tab SMC re-clustering and analysis: (A) The SMCs were re-clustered into 4 groups. (B) Distribution and proportion of the 4 clusters across tissues, with emphasis on cluster 2. (C) Subclassification of clusters based on markers. (D) Expression of CYP1B1, ITGA2, RPL22L1, and VCAM1 in the clusters. (E) GO and KEGG enrichment analysis. (F) Pseudotemporal ordering of the clusters. Analysis of five SMC clusters and their relationship with synthetic phenotype and vascular remodeling We re-clustered SMCs from the PHPF single-cell dataset into five clusters (Fig. [123]8A). Analysis of cluster distribution across tissues revealed that cluster 1 was predominant in diseased tissues, suggesting its pivotal role in the pathological process (Fig. [124]8B). Subtype classification using known marker genes identified cluster 1 as a synthetic SMC subtype, characterized by elevated expression of CYP1B1, ITGA2, RPL22L1, and VCAM1 (Fig. [125]8C). These findings align with its active involvement in matrix synthesis during disease progression (Fig. [126]8D). GO and KEGG enrichment analyses revealed that cluster 1 is significantly associated with collagen, ECM, AGE, growth factors, and PI3K signaling pathways, further supporting its role in vascular remodeling through matrix component synthesis (Fig. [127]8E). Pseudotime analysis indicated that cluster 1 may differentiate into cluster 2, with cluster 1 showing synthetic characteristics and cluster 2 exhibiting traits related to oxidation and fibroblast activity (Fig. [128]8F). These findings establish cluster 1 as a key synthetic SMC subtype involved in ECM synthesis and vascular remodeling during disease progression, with the potential to differentiate into other functional cell populations, highlighting its significance in the pathology of PHPF. Fig. 8. [129]Fig. 8 [130]Open in a new tab SMC re-clustering and analysis: (A) The SMCs were re-clustered into 5 groups. (B) Distribution and proportion of the 5 clusters across tissues. (C) Subclassification of clusters based on markers. (D) Expression of CYP1B1, ITGA2, RPL22L1, and VCAM1 in the clusters. (E) GO and KEGG enrichment analysis. (F) Pseudotemporal ordering of the clusters. Analysis of the interaction between COLLAGEN and FN1 signaling pathways in PAH and PHPF samples and their role in immune activation and vascular remodeling We analyzed the signaling pathway profiles in PAH, PHPF, and healthy control (CON) samples, highlighting key differences in signaling intensity and cell-type interactions. In PAH samples, overall signaling patterns revealed a significant enhancement of pathways such as COLLAGEN and FN1 compared to CON samples (Supplementary Fig. 3A). Analysis of signaling reception across cell types showed that immune cells and fibroblasts in PAH had stronger responses to these signals than other cell types (Supplementary Fig. 3B). Examination of the COLLAGEN signaling network indicated increased interactions among monocytes, macrophages, and SMCs in PAH (Supplementary Fig. 3C). Similarly, FN1 pathway analysis revealed heightened interactions between immune and stromal cells in PAH samples (Supplementary Fig. 3D). In PHPF samples, overall signaling and reception patterns also showed notably higher COLLAGEN and FN1 signaling compared to CON samples (Supplementary Fig. 3E, F). Network analysis of these pathways highlighted increased interactions between immune cells and fibroblasts in PHPF, with significantly elevated activity in COLLAGEN and FN1 pathways (Supplementary Fig. 3G, H). These findings emphasize the critical roles of COLLAGEN and FN1 signaling in vascular remodeling and immune cell activation in both PAH and PHPF, underscoring their potential as key pathways in disease progression. Comparison and analysis of intercellular interactions in PAH and healthy control (CON) tissues We performed CellChat analysis to investigate intercellular interactions in PAH and healthy control (CON) tissues, uncovering key molecular processes underlying PAH pathology. The analysis revealed significantly increased immune cell interactions in PAH tissues, highlighting the pivotal role of immune cells in the disease (Supplementary Fig. 4A). Signaling strength analysis showed markedly higher interactions between fibroblasts and SMCs in PAH compared to CON, supporting their significant role in vascular remodeling (Supplementary Fig. 4B). Specific pathway analysis revealed enhanced COLLAGEN signaling in PAH, emphasizing its critical involvement in the remodeling process (Supplementary Figs. 3 A, B, 4 C). Receptor-ligand interaction analysis further demonstrated distinct alterations in COLLAGEN signaling in PAH tissues (Supplementary Figs. 3 C, D, 4D). Detailed analysis of COLLAGEN signaling interactions showed that monocytes predominantly initiate this signaling in PAH, linking immune cell activation to vascular remodeling (Supplementary Fig. 4E). Additionally, expression analysis revealed significantly higher levels of COLLAGEN and its ligands, including CD44, ITGA1, and ITGA10, in PAH. Monocytes exhibited high SDC4 expression, reinforcing their role as key drivers of COLLAGEN signaling in PAH (Supplementary Fig. 4F). These findings underscore the critical involvement of COLLAGEN signaling and immune cell activation in PAH, providing insights into potential therapeutic targets for vascular remodeling. Comparison and analysis of intercellular interactions in PHPF and healthy control (CON) tissues CellChat analysis of PHPF and healthy control (CON) tissues validated key findings regarding cell-cell interactions and signaling pathways. In PHPF tissues, interactions between immune cells (monocytes, macrophages) and SMCs were significantly enhanced compared to CON, suggesting these cells play critical roles in PHPF pathology (Supplementary Fig. 5A). The signaling intensity analysis showed stronger signals in monocytes, macrophages, and SMCs in PHPF, highlighting their involvement in vascular remodeling (Supplementary Fig. 5B). Detailed analysis of specific signaling pathways revealed that COLLAGEN signaling was markedly enhanced in PHPF tissues (Supplementary Figs. 3E, F, 5 C). Examination of receptor-ligand interactions in the COLLAGEN pathway identified monocytes as key drivers of this signaling in PHPF, linking immune cell activation to vascular remodeling (Supplementary Figs. 3G, H, 5D). Intensity analysis confirmed that monocyte-driven COLLAGEN signaling dominated in PHPF compared to CON (Supplementary Fig. 5E). Additionally, expression analysis of COLLAGEN signaling and its ligands showed significantly higher levels of COLLAGEN and its ligand CD44 in PHPF tissues. SDC4 was notably overexpressed in monocytes in PHPF, reinforcing the pivotal role of monocytes in initiating COLLAGEN signaling (Supplementary Fig. 5F). These findings emphasize the critical contribution of immune cells and COLLAGEN signaling to the pathology of PHPF. Reclustering analysis of endothelial cells reveals functional heterogeneity To further investigate the heterogeneity of PAH-associated endothelial cells (ECs), we extracted ECs from the [131]GSE210248 dataset and performed dimensionality reduction and reclustering based on single-cell RNA sequencing (scRNA-seq) data. Following quality control and normalization, we utilized Seurat for UMAP dimensionality reduction and classified ECs into 13 subpopulations (Cluster 0–12) based on characteristic gene expression (Supplementary Fig. 6B). A heatmap of gene expression patterns (Supplementary Fig. 6A) illustrates the key marker genes of different EC subpopulations, revealing significant differences in gene expression profiles, suggesting that these subpopulations may serve distinct biological functions in PAH progression. Further analysis of cell proportion distribution (Supplementary Fig. 6C) showed that certain EC subpopulations were significantly enriched in the PAH group, while others decreased, indicating a potential PAH-induced remodeling of EC composition. Additionally, we identified a set of key genes and visualized their expression patterns across subpopulations using a bubble plot (Supplementary Fig. 6D). The results indicate that some subpopulations highly express VWF, CDH5, and PECAM1, suggesting they represent typical vascular endothelial cells, whereas others exhibit enriched expression of ACKR1, SELE, and ICAM1, indicating inflammatory characteristics. Moreover, certain EC subpopulations showed high expression of PLVAP and CXCL12, suggesting their potential roles in vascular permeability regulation and angiogenesis. Further gene expression analysis (Supplementary Fig. 6E) demonstrated distinct expression levels of specific genes (CYP1B1, ITGA2, RPL24L1, and VCAM1) among different EC subpopulations, further supporting the notion that ECs undergo functional differentiation during PAH progression. Functional analysis of EC subpopulations To elucidate the biological functions of EC subpopulations, we performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses (Supplementary Fig. 7). GO analysis (Supplementary Fig. 7A) revealed that different EC subpopulations were primarily enriched in functional categories such as cell adhesion (cadherin binding), chemokine signaling (chemokine receptor activity), and cytoskeletal structure (structural constituent of cytoskeleton). KEGG analysis (Supplementary Fig. 7B) further indicated that these subpopulations were potentially involved in key pathways, including inflammatory signaling (TNF signaling pathway), vascular adhesion (Focal adhesion), and leukocyte transendothelial migration (Leukocyte transendothelial migration). Considering the alterations in EC subpopulation proportions between the PAH and control groups, the enrichment of these pathways suggests that PAH may influence vascular inflammatory responses, vascular permeability, and remodeling processes by modulating the composition and function of specific EC subpopulations. Discussion In this study, we investigated the molecular features of PAH by integrating computational analyses and machine learning techniques. Key findings include the identification of hub genes in hypoxic PAH subgroups and the prediction of potential therapeutic drugs for these groups. Using single-cell RNA sequencing, we observed significant expression of hub genes in the hypoxic subgroup of SMCs. Additionally, cell-cell communication analysis revealed enhanced interactions among different cell subpopulations, providing insights into the cellular dynamics underlying PAH pathology. While there are significant differences in the pathological mechanisms between pulmonary hypertension associated with chronic lung diseases (Group 3 PH) and pulmonary arterial hypertension (Group 1 PAH), hypoxia contributes to disease progression in both contexts. In PH, hypoxia serves as the central driver of pathogenesis, whereas in PAH, hypoxia may synergize with other molecular pathways to accelerate vascular remodeling and disease advancement. Hypoxia-induced PAH represents a distinct form of PH compared to PH associated with chronic lung diseases. Notably, regional hypoxia within the pulmonary vasculature may drive disease pathogenesis even in the absence of systemic hypoxemia. Naeije et al. reported that approximately half of PAH patients exhibit decreased arterial oxygen tension (PaO[2])^[132]14. A retrospective cohort study also identified mixed venous oxygen tension (PvO₂) as a key prognostic factor in PAH^[133]15. Thompson et al. further suggested that progressive mixed venous hypoxemia may contribute to IPAH development, underscoring the potential role of regional hypoxia in PAH pathophysiology^[134]16. Furthermore, BMPR2 has been widely recognized as a central molecular factor in PAH. Studies have shown that BMPR-II deficiency leads to mitochondrial dysfunction, resulting in mitochondrial DNA damage and apoptosis, impairing distal pulmonary artery regeneration following hypoxia and reoxygenation^[135]17. Notably, primary cells derived from PAH patients with BMPR2 mutations exhibit increased susceptibility to mitochondria-induced apoptosis after hypoxia and reoxygenation^[136]17. Additionally, HIF-1α, a key regulator of hypoxic adaptation in pulmonary vascular cells, is upregulated in the lung tissue of PAH patients, further supporting the relevance of hypoxia-related mechanisms in PAH pathogenesis^[137]18–[138]21. In this study, the c2 subgroup was closely associated with hypoxia, and KEGG functional enrichment analysis further confirmed its significant enrichment in the HIF-1 signaling pathway, which has been recognized as a key hypoxia-related pathway^[139]22,[140]23. Additionally, pathway enrichment analysis revealed significant activation of the TNF (tumor necrosis factor) and IL-17 (interleukin-17) signaling pathways in the c2 subgroup, highlighting the central role of the interplay between hypoxia and the immune-inflammatory network in PAH progression. These findings suggest that the pathogenesis of PAH extends beyond vascular remodeling to encompass metabolic disorders, immune dysregulation, and inflammatory responses, providing new directions for biomarker discovery and targeted therapy development. Next, we developed a diagnostic model for the hypoxia-associated PAH subtype and predicted its potential therapeutic drugs. The machine learning-based diagnostic model demonstrated excellent performance in both the training and validation datasets (AUC > 0.85), confirming its accuracy in identifying the hypoxic subtype. Additionally, CMap drug prediction analysis revealed the associations between existing therapies and the hypoxia-associated PAH subtype. Notably, we found that first-generation phosphodiesterase-V inhibitors, such as tadalafil and bosentan, did not exhibit significant therapeutic potential in the hypoxia-associated PAH subtype, whereas second-generation phosphodiesterase-V inhibitors showed greater potential for treatment. These advancements contribute to better diagnostic and treatment strategies for PAH. Next, in our single-cell analysis, we found a significant increase in the number of SMCs in PAH, suggesting that SMCs may be closely related to the occurrence and progression of PAH. Previous studies have reported that PFKFB3 promotes vascular remodeling in pulmonary hypertension by mediating collagen synthesis and proliferation of SMCs. This process occurs through increased glycolysis and lactate levels, leading to ERK1/2-dependent phosphorylation of calpain-2 and subsequent activation of calpain^[141]24, This abnormal metabolic pattern further promotes hypoxia. In addition, CircATP2B4 promotes hypoxia-induced SMC proliferation and migration through the miR-223/ATR axis, thereby accelerating PAH progression^[142]25. Consistent with these findings, our study identified a set of hub genes (CYP1B1, ITGA2, RPL22L1, and VCAM1) that were significantly upregulated in SMCs and hypoxia-related SMC clusters in two single-cell datasets. Among them, Cytochrome P450 Family 1 Subfamily B Member 1 (CYP1B1) has been demonstrated to be closely associated with SMC pathology^[143]26. Furthermore, we explored the role of ECs in PAH. Previous studies have shown that phenotypic changes in ECs can lead to PAH, such as smoking-induced pulmonary EC apoptosis and dysfunction, which, along with hypoxia, promote PAH progression^[144]27,[145]28. Apoptotic ECs can also release TGF-β1 and VEGF, thereby inducing SMC proliferation and exacerbating PAH development. In our study, we similarly found that EC populations were enriched in chemokine secretion and inflammatory signaling pathways, which is consistent with previous findings. Notably, the role of the TGF-β signaling pathway in PAH has garnered increasing attention. For example, studies have shown that MED1 regulates the BMP/TGF-β signaling pathway in endothelial cells, thereby influencing PAH development^[146]10. dditionally, the miR-182/Myadm axis modulates the crosstalk between SMCs and ECs, balancing BMP and TGF-β signaling pathways to regulate hypoxia-induced PAH progression^[147]29. Although this study made significant discoveries through multi-omics data integration and advanced machine learning, several limitations remain. First, the limited sample size, despite using multiple datasets, poses challenges, especially for certain patient subtypes, potentially impacting model robustness and increasing overfitting risk, limiting generalizability to larger cohorts. Second, dataset heterogeneity, including differences in sample sources, processing methods, and sequencing platforms, may introduce variability in the results. Third, while the models demonstrated high accuracy and AUC values on internal datasets, their generalizability to external datasets remains unverified. Finally, the biological functions and mechanisms of key genes and pathways identified require further experimental validation to confirm their roles in PAH pathogenesis. Future research should focus on several aspects: First, validate the identified key genes through in vitro and in vivo experiments to confirm their roles in PAH pathogenesis. Second, apply and optimize machine learning models on larger, multi-center datasets to assess their predictive accuracy and clinical utility across diverse populations. Third, explore additional gene pathways and cell types involved in PAH using advanced technologies like single cell sequencing to uncover the disease’s complex mechanisms. These efforts will strengthen the foundation for personalized diagnosis and treatment strategies. Electronic supplementary material Below is the link to the electronic supplementary material. [148]Supplementary Material 1^ (4.6MB, docx) Author contributions JHX, JLC, JJZ& JFW: Conceptualization, Data curation, Formal analysis and Writing—original draft; CL & JFW: Writing—review and editing. All authors approved the final version of this manuscript. Data availability The raw data can be accessed from the GEO database ([149]https://www.ncbi.nlm.nih.gov/geo/). Intermediate files obtained during the analysis are provided in the supplementary materials. Access to the original code is available upon request to the corresponding author. Declarations Competing interests The authors declare no competing interests. Consent for publication All authors have read and approved of its submission to this journal. Footnotes Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Jianghao Xiong and Jiali Chen contributed equally to this work. References