Abstract Background Cardiovascular diseases (CVD) are the leading cause of global mortality, yet current treatments benefit only a subset of patients. To identify new potential treatment targets, we conducted the first proteome wide association study (PWAS) for 26 CVDs using plasma proteomics data from the largest cohort to date (53,022 individuals in the UK Biobank Pharma Proteomics Project (UKB-PPP)). Methods and results We calculated single nucleotide polymorphism (SNP)-protein weights using the UKB-PPP dataset and integrated these weights with genome-wide association study (GWAS) summary statistics for 26 CVDs across three categories (16 cardiac, 5 venous, and 5 cerebrovascular diseases) in up to 1,308,460 individuals. PWAS was performed using the Functional Summary-based Imputation (FUSION) framework to identify protein-disease associations. Replication was conducted in two independent human plasma proteomic datasets (comprising 7213 and 3301 participants, respectively). We identified 155 proteins associated with CVDs and further Mendelian randomization analysis revealed 72 proteins with evidence of a causal association. Of these, 26 out of 35 available proteins were validated. Notably, 33 of the 72 proteins were not previously implicated in GWAS of CVDs. For example, PROC was found to be associated with venous thromboembolism (P = 6.32 × 10^–7). We further conducted longitudinal analyses using plasma proteomics data and peripheral blood mononuclear cells single cell RNA-seq data. The results showed that 90.63% (29/32) of the detected proteins exhibited stable plasma expression, and 18 genes displayed stable expression in at least one cell type, particularly in CD14+ monocytes. We also utilized these proteins to construct disease diagnostic models, and notably, models for 14 out of 18 diseases achieved an area under the curve (AUC) exceeding 0.8, indicating promising diagnostic potential. Conclusions We identified 72 proteins that causally influence CVD risk, providing new mechanistic insights into CVD and may prove to be promising targets as CVD therapeutics. Graphical abstract [52]graphic file with name 12933_2025_2847_Figa_HTML.jpg Supplementary Information The online version contains supplementary material available at 10.1186/s12933-025-02847-w. Keywords: CVD, Human blood plasma proteomes, Causal proteins, PWAS Research insights What is currently known about this topic? * Cardiovascular diseases (CVD) are multifactorial disorders influenced by both heritable and environmental factors. * Proteins, as the final products of gene expression, are the main functional components of biological processes. Understanding the association between protein abundance and disease status may facilitate the identification of novel therapeutic targets. What is the key research question? * The key research question of this study is to identify CVD relevant proteins and provide potential therapeutic targets. What is new? * We identified 72 proteins causally associated with CVD, and 33 of them are encoded by genes that were not implicated in the original genome-wide association study. Disease diagnostic models constructed using these proteins demonstrated promising potential for accurate disease detection. How might this study influence clinical practice? * The CVD relevant proteins we identified provide new mechanistic insights into CVD and may prove to be promising targets as CVD therapeutics. Introduction Cardiovascular diseases (CVD) are a group of disorders of the heart and blood vessels. As the leading global cause of mortality, CVD took an estimated 17.9 million lives in 2019, accounting for 32% of all deaths worldwide [[53]1–[54]3]. In clinical practice, there are some commonly used treatments for CVD, such as the use of statins to reduce cardiovascular morbidity and mortality. However, current treatments are not suitable for all patients and have certain risks [[55]4]. Therefore, there is an urgent and critical need for new therapeutic targets for CVD [[56]5]. Proteins, as the final products of gene expression, are the main functional components of biological processes [[57]6]. Understanding the association between protein abundance and disease status may facilitate the identification of novel therapeutic targets [[58]7–[59]11]. However, due to the high cost of proteomic measurements, establishing associations between protein abundance and disease status in large-scale populations remains challenging [[60]12]. Proteome-wide association study (PWAS) is a powerful strategy to solve this problem. It employs single-nucleotide polymorphisms (SNPs) to genetically impute protein abundance levels, which are then integrated with genome-wide association study (GWAS) summary statistics for a trait to derive protein-trait associations [[61]13, [62]14]. Further summary data-based Mendelian randomization (SMR) can be used to identify proteins with evidence that suggests a causal association with the outcome [[63]15]. MR is a widely used approach that utilizes genetic variants as instrumental variables to infer the potential causal effect of an exposure on an outcome. Since genetic variations are randomly allocated during meiosis and fertilization, they are relatively independent of individuals’ subsequent behaviors. Therefore, MR analysis could minimize issues of confounding and reverse causality. This integrative analytical approach has been employed to identify novel potential therapeutic targets for neurological disorders [[64]13, [65]14, [66]16–[67]18] using brain proteomics data. However, this approach has not yet been used in the identification of CVD-relevant proteins. Here we performed the first PWAS for multiple CVDs using plasma proteomics data of the largest cohort to date (53,022 individuals from the UK Biobank Pharma Proteomics Project (UKB-PPP)) [[68]19]. Subsequently, we used the proteins identified by PWAS as exposures and CVDs as the outcome to detect proteins showing evidence of a causal association. The study design is shown in Fig. [69]1. We applied the above analytic approaches to the discovery dataset consisting of human plasma proteomic and genetic data from UKB-PPP [[70]19] and the GWAS of 26 CVDs spanning 3 categories (16 cardiac diseases, N = 234,829 to 1,030,836; 5 venous diseases, N = 388,830–484,598; 5 cerebrovascular diseases, N = 484,598 to 1,308,460) [[71]20–[72]25]. Additionally, we conducted replication analyses leveraging two independent human plasma proteomic datasets, encompassing 7213 participants from the Atherosclerosis Risk in Communities (ARIC) study and 3301 individuals from the INTERVAL study, to ensure robustness and reproducibility [[73]26, [74]27]. For functional interpretation of the identified proteins, enrichment analyses were performed to detect the pathways associated with CVD. Longitudinal stability analysis at plasma and cell-type levels was used to assess the expression stability of the proteins. We also pharmacologically annotated the proteins of interest with approved drugs to assess their feasibility as treatment targets. Fig. 1. [75]Fig. 1 [76]Open in a new tab Overview of the current study. Proteomics data were collected from three different sources: UKB, ARIC and INTERVAL. GWAS summary data for 26 CVDs spanning 3 categories (16 cardiac diseases, 5 venous diseases, 5 cerebrovascular diseases, up to 1,308,460 individuals) were included. PWAS were conducted using the proteomic datasets, followed by Mendelian randomization to assess potential causal relationships. Subsequently, functional annotation was conducted for genes identified through PWAS Results The analysis pipeline is shown in Supplementary Fig. [77]1. We first performed PWAS to obtain proteins associated with CVDs. Then SMR analysis was performed for causal inference, testing whether the proteins identified by PWAS were causally associated with CVDs. Finally, HEIDI test was used to make sure the SMR findings were not driven by linkage. Causal proteins from the initial analysis were carried forward for replication analysis and functional annotation. Discovery PWAS of multiple CVDs We generated the human blood plasma proteome model based on 53,022 UKB-PPP participants [[78]19]. The UKB-PPP OLINK proteome dataset included data for 2923 proteins [[79]19]. After quality control, 1715 proteins with SNP-based heritability (P < 0.05, h^2 > 0, Supplementary Table [80]1) were used for PWAS. The Pearson’s r between the predictive power of protein models and gene heritability was 0.938 (R^2 = 0.88, Supplementary Fig. [81]2), supporting the accuracy of our protein estimation model. The plasma proteome model results were integrated with the 26 CVD GWAS datasets using the FUSION pipeline [[82]28]. Detailed information of the 26 GWAS datasets is shown in Supplementary Table [83]2. We performed genetic correlation analysis and the results revealed that diseases within the same category typically exhibit stronger correlations. (Supplementary Fig. [84]3). As shown in Supplementary Table [85]3, we identified 341 protein-CVD pairs of associations after multiple testing corrections (P < 2.92 × 10^–5, 0.05/1715 proteins) (Fig. [86]2A). For each disease, when multiple genes corresponding to the associated proteins are located within a 1 Mb genomic window, these genes were subjected to further conditional analysis (Supplementary Fig. [87]4). This analysis was conducted on 53 unique genes (encompassing 87 protein-disease association pairs), with the results presented in Supplementary Table [88]4. The number of overlapping SNPs for each pair of genes was shown in Supplementary Table [89]5. To identify independent associations, we performed conditional analyses using a regression-based summary statistics approach [[90]28]. Finally, we obtained 314 independent PWAS association signals, including 155 unique proteins associated with CVD. The number of associated genes for each phenotype is shown in Fig. [91]2B. Taking heart failure as an example, PWAS identified 48 proteins associated with it (Fig. [92]2B). Eighteen genes from seven loci were subjected to conditional analysis. Six genes (SORT1, SHISA5, PDE5A, PGF, FURIN, DDT) were removed from the conditional analyses (Fig. [93]2C). Finally, we obtained 42 genes associated with heart failure. Fig. 2. [94]Fig. 2 [95]Open in a new tab Results of the PWAS analyses. A Manhattan plot for the PWAS results for CVD. Each dot represents the association between a disease and a gene, with the x-axis indicating the genomic location and the y-axis showing -log[10](P value). The gray horizontal line denotes the Bonferroni-corrected threshold, P < 2.92 × 10^–5. Associations for the three CVD categories are highlighted in red, green, blue, respectively. Labeled genes are the top results on each chromosome. B Number of genes identified by PWAS across the 26 CVD. Different colors indicate different disease categories. Genes identified through conditional analysis are shown in gray (n = 27). C Regional plots of PWAS loci from conditional analysis for heart failure. Conditional proteins include CELSR2, GSTM1, SPINK8, DAG1, HYAL1, FABP2, ACYP1, FES, GSTT2B and SUSD2. In each plot, the top panel highlights marginally associated PWAS genes (blue) and the retained genes (green). The bottom panel displays regional Manhattan plot of the data before (grey) and after (blue) conditioning on the predicted expression of the green genes. Chr: chromosome Causal analysis of the proteins identified by PWAS We employed Summary-data-based Mendelian Randomization (SMR) [[96]15] to further evaluate the causality of the 155 proteins. The accompanying heterogeneity in dependent instruments (HEIDI) test was used to ensure that the SMR findings were not driven by linkage disequilibrium. The SMR results confirmed the causal effects of 121 unique proteins on CVD risk. The unit of the effect estimates is the standard deviation of the protein abundance. However, HEIDI results argued against a causal role for 49 genes due to linkage disequilibrium. Therefore, 72 unique proteins provided evidence of a causal role in CVD based on the SMR/HEIDI analysis. Importantly, the direction of effect in SMR was consistent with that in PWAS for all proteins passing SMR/HEIDI analysis. Finally, a total of 72 proteins were retained for further analysis (Fig. [97]3A, Supplementary Table [98]6, Supplementary Table [99]7). Fig. 3. [100]Fig. 3 [101]Open in a new tab Results for the causal genes. A The heatmap displays PWAS results for the 72 genes that passed the causality tests. The color intensity indicates the direction and magnitude of the association. Genes replicated in independent PWAS analysis are indicated by circles, Causal genes are marked with “*” and the novel genes—defined as those without any variant (P < 5 × 10^–8) within ± 2 M window in the GWAS results—are labeled in red. B The Venn diagram illustrates the overlap of causal genes across three disease categories. C The number of novel genes identified in each disease. D Manhattan plots of the PWAS and GWAS results for CVDs. The top panel presents the Manhattan plot of the PWAS results. The horizontal dotted line indicates the Bonferroni-corrected threshold (P < 2.91 × 10^−5, 0.05/1715). Colored dots highlight novel genes not previously reported in GWAS. The bottom panel displays the GWAS Manhattan plot, where each point represents the association results of a SNP with CVD. The horizontal line indicates the genome-wide threshold (P value = 5 × 10^−8) Replication analysis using the proteomics data from ARIC and INTERVAL For the 72 causal proteins identified from the discovery dataset, 35 were available in at least one of the replication datasets. The number of detected proteins in the ARIC and INTERVAL projects was 28 and 22, respectively. For these proteins, we incorporated plasma protein profiles and the CVD GWAS datasets to perform replication PWAS and SMR analysis. In the replication PWAS analysis, the effect sizes of the validation and discovery cohorts were correlated, with correlation coefficients (r) of 0.91 for the ARIC dataset and 0.85 for the INTERVAL dataset (Supplementary Fig. [102]5). The association between 35 proteins and CVDs was replicated in at least one cohort (Supplementary Table [103]8). After SMR analysis, the causal effects of 12 proteins on CVDs were replicated in at least one of the validation datasets (Supplementary Table [104]8). The causal associations between three proteins (ASPN, COL15A1 and IL6R) and CVDs were replicated in both datasets. Common proteins associated with diseases in three CVD categories Ninety-two percent of the proteins (66/72) were identified in only one disease category (Fig. [105]3B). Only PROCR was associated with diseases in all three CVD categories. As shown in Fig. [106]3A, PROCR is negatively associated with 2 cardiac diseases and 1 cerebrovascular disease, and positively associated with 1 venous disease. The inconsistency in association direction might be due to the fact that PROCR is linked to anti-inflammatory and anticoagulant functions [[107]29]. Novelty of the CVD causal genes To assess the novelty of the 72 potentially causal genes, we checked the lowest P values for the SNPs within a 2 Mb window of these genes using the summary statistics from the CVD GWAS. We found that 33 genes were not located within 2 Mb of a GWAS signal (P < 5 × 10^–8), suggesting that these 33 genes are novel genes not implicated in the original GWAS (Fig. [108]3C). Twenty novel genes have not been detected in other CVD GWAS either. For example, PROC was found to be associated with venous thromboembolism (Fig. [109]3D top) and VSIG10 was found to be associated with two venous diseases (diseases of veins and varicose veins; Supplementary Fig. [110]6). The associations between these genes and CVDs were not detected across all GWAS datasets for the 26 CVD subtypes. The remaining 13 novel genes were not implicated in the original GWAS but have been detected in the GWAS of other CVDs. For example, our PWAS results showed that PCSK9 was associated with heart failure and calcific aortic valvular stenosis (Fig. [111]3D bottom). The GWAS of calcific aortic valvular stenosis didn’t detect the association signal of this gene, but it was found to be associated with heart failure in GWAS data. Gene ontology enrichment analysis To further elucidate the molecular mechanisms underlying the 72 identified proteins, we carried out a non-redundant gene ontology (GO) biological processes enrichment analysis using WebGestalt 2024 [[112]30, [113]31]. Five pathways were identified with FDR < 0.05. Of these, two pathways are associated with vessel/blood—related processes, and one is linked to lipid metabolism. Specifically, as shown in Fig. [114]4A, cardiac disease relevant genes are enriched in two pathways, namely the biological process involved in symbiotic interaction and regulation of plasma lipoprotein particle levels. Genes associated with venous diseases exhibited enrichment in three biological pathways, with two of these pathways falling under vessel/blood-related processes, specifically the coagulation pathway. No pathway was detected for the genes associated with cerebrovascular disease due to the limited number of genes. The complete GO enrichment analysis results are provided in Supplementary Table [115]9. Fig. 4. [116]Fig. 4 [117]Open in a new tab Function annotation of the identified genes. A Enriched Gene Ontology (GO) biological process (BP) terms for the causal genes across different categories. The color of each bar indicates the biological function category of the pathway: lipid-related process (faint yellow), vessel/blood-related process (purple) and other (gray). B Protein–protein interaction network constructed with the identified causal genes. Edges represent physical interaction, with thickness proportional to the interaction score. Genes associated with cardiac disease, venous disease and cerebrovascular disease are colored red, green, and blue, respectively. Node size reflects the number of connections. Community 1 includes six proteins associated with immunity/inflammation. Community 2 includes four proteins associated with lipid-related process. Community 3 includes five proteins involved in vessel/blood-related process, particularly the formation of fibrin clot. C Enriched mouse phenotypes associated with the causal genes across different disease categories. Bar colors denote biological function categories: immunity/inflammation (light coral), lipid-related process (faint yellow), vessel/blood-related process, (purple) and other (gray). D Results of the longitudinal stability analysis at the protein level (left) and single-cell level (right). At the protein level, genes are classified as stable (blue) or variable (red) based on a coefficient of variation (CV) threshold of 10%. Among the 72 causal proteins, 29 were classified as stable. The colored bars on the left reflect gene grouping based on PWAS results. For the single-cell data, a CV threshold of 10% was used. Gray indicates samples with low average expression (< 0.01 after normalization). Eighteen genes exhibit stable expression across 19 cell types. Different donors are indicated by different colors. E Gene-drug-disease network constructed using the identified causal genes. Edge colors indicate the disease category of the association genes. ICD: International Classification of Diseases Protein–protein interactions (PPI) network analysis To investigate the connectivity for the 72 proteins, we performed network-based analysis using STRING [[118]32]. The minimum required interaction score was 0.4. The constructed PPI network included 19 nodes and 19 edges, primarily comprising three protein communities (Fig. [119]4B). Proteins associated with cardiac disease predominantly clustered within a network centered around FN1 and APOE. Consistent with pathway enrichment analysis results, these proteins were mostly in the community of immunity/inflammation and lipid-related process. The proteins associated with venous disease were mainly centered around five proteins (F10, F11, PROC, PROCR and PROS1) implicated in blood/vessel-related processes, particularly the coagulation process. Interactions among the proteins associated with cerebrovascular diseases are relatively sparse. Complete information about communities is presented in Supplementary Table [120]10. Mouse phenotypic annotation of potential causal genes. We further evaluated whether the 72 proteins were associated with CVD-related phenotypes in mice using the Mouse Genome Informatics (MGI) database [[121]33]. We performed phenotype enrichment analysis using Fisher’s exact test. Consistent with the pathway enrichment and network analysis results, mutations in genes associated with cardiac diseases were enriched in phenotypes related to immunity/inflammation, lipid-related process, and vessel/blood-related process (Fig. [122]4C). Mutations in genes associated with venous disease were enriched in phenotypes related to vessel/blood-related process (Fig. [123]4C). These results further support the involvement of the identified proteins in CVD pathogenesis. The complete MGI annotation results are provided in Supplementary Table [124]11. Evaluate longitudinal stability at protein and single-cell level To evaluate the expression stability of the 72 proteins, we performed longitudinal analysis using plasma proteomics data and peripheral blood mononuclear cells (PBMC) single cell RNA-seq (scRNA-seq) data from GEO dataset [125]GSE190992 [[126]34]. The plasma proteomics data were collected from 6 healthy, non-smoking Caucasian donors over a 10-week period. Thirty-two of the 72 proteins were detected in this dataset. Among them, 90.63% (29/32) of the proteins exhibited stable expression in plasma (median coefficient of variation < 10%, Fig. [127]4D left). The PBMC scRNA-seq data were collected weekly from four donors over a course of 6 weeks. We identified 18 genes that displayed stable expression in at least one cell type (median coefficient of variation < 10%, Fig. [128]4D right). Notably, all these genes were stably expressed in CD14+  monocytes. According to previous studies, monocytes play a crucial role in both local ischemia and inflammatory responses, which are closely linked to the development of cardiovascular diseases [[129]35–[130]37]. Cell-type specific expression of the CVD causal proteins We investigated whether the 18 genes stably expressed in a certain cell type also exhibited enriched expression in that cell type. As shown in Supplementary Tables [131]12 and [132]14 genes exhibited enriched expression in cells where they were stably expressed. For example, IL1RN was stably expressed in CD14-Mono (Fig. [133]4D), and its expression in this cell type was higher than in other cell types. We further validated the enriched expression of these 14 genes in cells with their stable expression, using an independent scRNA-seq dataset derived from PBMCs of 11 individuals [[134]38]. The enriched expression of 10 genes was successfully validated (Supplementary Table [135]13). Potential of causal proteins in disease diagnostic model To validate the role of causal proteins identified through PWAS in early clinical disease screening, we utilized Support Vector Machines (SVM) to build protein-based disease diagnostic models using disease information from the UKB. Among 18 cardiovascular diseases with over 100 patients, 14 diseases showed area under the curve (AUC) values exceeding 0.8 in the test datasets (Supplementary Table [136]14, Supplementary Fig. [137]7). For example, the diagnostic model for coronary artery bypass grafting achieved the highest performance with an AUC of 0.89 (95% CI 0.86–0.91). Drug repurposing analyses identified potential therapeutic targets for CVD To investigate potential drug target genes, we constructed a gene-drug-disease network (Fig. [138]4E). The results showed that 21 of the 72 proteins were the targets of 41 drugs with c clinical trials completed or ongoing (Supplementary Table [139]15). Eleven drugs were already in use for treating circulatory system disorders. For example, Drotrecogin alfa, which targets PROCR and PROS1, has been used as an effective treatment for managing cerebrovascular ischemic events [[140]39]. The remaining 30 drugs for treating diseases from other systems might be considered for potential drug repurposing. For example, Menadione, which targets PROC, is currently used to treat vitamin K deficiency and prostate cancer. Discussion In this study, we integrated data from 26 CVD GWAS along with three large-scale human plasma proteomic datasets to conduct a comprehensive PWAS analysis. In total, we identified 115 independent protein-disease association pairs, involving 72 unique proteins associated with CVD. Among these, 33 represented novel associations that were not implicated in the original GWAS. We also explored potential biological mechanisms underlying CVD and identified promising targets for future drug development. The PWAS identified 72 genes with evidence consistent with a causal role in CVD, including 33 that were not implicated in the original GWAS. For example, PROC was identified as being associated with venous thromboembolism. PROC encodes a vitamin K-dependent enzyme that plays a crucial role in regulating human thrombosis and hemostasis [[141]40]. Consistent with our findings, previous studies have reported that reduced PROC levels in plasma can be used as a marker of increased risk of venous thrombosis [[142]41, [143]42]. In the PPI network, we also demonstrated that PROC, together with coagulation factors such as F11 and F10, forms a venous-related network. Longitudinal stability analysis showed that PROC was stably expressed in blood plasma. Currently, two drugs (Menadione and Cupric Chloride) that target PROC have completed clinical trials for the treatment of conditions such as fungal infections, prostate cancer, and vitamin K deficiency [[144]43]. Further studies are warranted to evaluate the therapeutic potential of these drugs for treating venous thromboembolism. Here we used PWAS coupled with subsequent SMR analysis to identify proteins causally associated with at least one of the 26 CVD subtypes, resulting in the identification of 72 proteins causally linked to CVDs. Recently, Schuermans et al. [[145]44] applied Cox regression followed by MR analysis on proteomics data from the UK Biobank, identifying 39 proteins causally associated with 4 CVD subtypes. These two analytical approaches are complementary in nature. PWAS is particularly useful for elucidating how genetic variation modulates protein functions to influence disease risk, whereas Cox regression enables the simultaneous estimation of multiple covariates’ effect on disease outcomes. Seven proteins were identified by both studies. In addition to differences in analytical methods, discrepancies in the proteomic datasets used may also contribute to the variation in results. We selected 1715 heritable proteins from the full set of proteins for PWAS analysis [[146]19], while they chose 1463 proteins in their analysis (1075 overlapping proteins). The novel proteins we identified in our study may offer additional insights into the pathogenesis of CVDs. Ninety-two percent (66/72) of the identified proteins are category-specific, suggesting that the underlying pathogenic mechanisms of the three disease categories are different. Only PROCR was associated with diseases in all three CVD categories. PROCR was positively associated with venous disorders but negatively associated with stroke and coronary artery disease. PROCR is a receptor for activated protein C, a serine protease that is activated as a part of and involved in the blood coagulation pathway. Consistent with our results, GWAS have shown that the minor G allele of rs867186 in this gene is associated with a higher risk of venous thromboembolism [[147]45, [148]46] but a lower risk of CAD [[149]47, [150]48]. A previous study has shown that PROCR is linked to CAD through anti-inflammatory mechanisms and to VTE through pro-thrombotic mechanisms [[151]29]. The longitudinal stability analysis revealed that 29 of the 32 detected proteins (90.63%) maintained stable expression in plasma. Stable proteins typically show little variation under healthy conditions. As noted by Vasaikar et al. [[152]34], their altered levels in disease states may highlight their potential as biomarkers for disease monitoring in future studies. In addition, PBMC scRNA data analysis identified 18 genes that exhibited stable expression in at least one cell type, and all 18 genes were stably expressed in CD14+ monocytes, highlighting the important role of CD14+ monocytes in CVD development. Consistently, previous studies have associated increased frequency of CD14+ monocytes with clinical CVD events and plaque vulnerability [[153]49, [154]50]. The density of CD14+ monocytes was found to be higher in patients with moderate‐severe heart failure compared to those with normal or mild LV impairment [[155]51, [156]52]. These results suggest that CD14+ monocytes might serve as biomarkers for CVD. Our study has several limitations. First, since the currently available proteomics and GWAS are mainly derived from European populations, our results are primarily applicable to the European population. Second, we focused on cis-regulatory elements when constructing models to assess protein influences. This is a common approach among researchers, as the current sample size in proteomics may not be sufficient to detect the trans effect. With larger-scale data available in the future, models considering both cis and trans effects can be constructed. In addition, the sample size of the dataset used for longitudinal analysis was relatively small, and studies with larger sample sizes are needed to confirm our findings. Future studies with larger, diverse populations and extended follow-up are needed to validate these preliminary findings. In summary, using the largest available proteomics data from UKB-PPP projects (a total of 1715 heritable proteins from 53,022 individuals), we performed a PWAS study for 26 CVDs. We identified 72 proteins that causally influence CVD risk. These proteins may serve as potential targets for future mechanistic and therapeutic studies aimed at finding effective treatments for CVD. Methods Human plasma proteomic and genetic data in UKB We generated the human blood plasma proteome models from 53,022 participants of European ancestry in the UKB-PPP [[157]19]. All UK Biobank participants provided informed consent and the conduct of our research complied with the ethical principles set forth in the Declaration of Helsinki [[158]53, [159]54]. The sample was selected in two batches from Consortium members and UK Biobank cohort, and the proteomic profiling was performed using standard Olink proteomics pipeline. Proximity Extension Assay antibodies matched to unique complementary oligonucleotides were bound to their respective target proteins and underwent quantification through next-generation sequencing. Following rigorous quality control measures, the normalized protein expression (NPX) values were computed using the Inter-Plate Control method. This NPX score effectively served as a quantitative measure of protein abundance in our samples. Genotype data matching the protein dataset underwent genotyping, imputation, and quality control steps as detailed in previous work [[160]53]. This included sex discrepancy, sex chromosome aneuploidy, and heterozygosity checks, with imputed variants filtered for INFO scores > 0.7. All chromosomal positions were updated to the hg38 assembly using LiftOver [[161]55]. Genotyping quality control was executed using PLINK2.0 software [[162]56]. Participants exhibiting over 5% missing genotypic data were removed from consideration. Moreover, variants displaying deviations from the Hardy–Weinberg equilibrium (with P values less than 1 × 10^–8), a genotype missing rate exceeding 5%, a minor allele frequency below 1%, or those not classified as SNPs were also excluded from the analysis. Following the preprocessing of both genotype and protein datasets, we adopted the FUSION software (download from [163]http://gusevlab.org/projects/fusion) [[164]28] to train the proteome model and we only considered the subset comprising 1,190,321 SNPs from the HapMap3 project [[165]57]. SNPs situated up to 500 kb away from either end of genes were defined as cis-SNPs. The model further incorporated adjustments for protein expression based on gender and age as covariates to refine the association analysis and account for potential confounding variables. CVD GWAS summary association statistics Our GWAS data were mainly collected from the GWAS catalog [[166]20–[167]25] and FinnGen database [[168]58]. In accordance with the ICD-10 standard of circulatory disorders, we selected GWAS studies involving a minimum of 5000 cases. When multiple studies of the same condition were identified, we opted for those with the largest sample sizes. Statins are drugs that lower cholesterol and reduce the risk of heart disease. The FinnGen database also classified “statin medication” in the ‘Diseases of the circulatory system’ category. Therefore, we included statin medication as one of the CVDs in our analysis. This stringent selection procedure resulted in a final cohort of 26 unique GWAS for our investigation. Based on the distinct pathophysiological mechanisms, we categorized the diseases into cardiac disease, venous disease, and cerebrovascular disease. Statistical approach Proteome-wide association studies (PWAS) We used the standard processes in the FUSION software to construct protein models and incorporate GWAS data for our PWAS analysis [[169]28]. After applying the previously outlined quality control measures to screen the sample and genotype data, we utilized GCTA software to estimate the SNP-based heritability for individual proteins [[170]59]. To expedite calculations, a random subset of approximately 10,000 individuals was selected from the full cohort for each protein’s heritability estimation. From the analysis of 2923 proteins, 1715 displayed heritability (h^2 > 0, P < 0.05, Supplementary Table [171]1). We then employed the FUSION software to estimate the impact of SNPs on protein abundance using multiple predictive models (top1, lasso, enet) and selected the most predictive model [[172]28]. Next, FUSION combined the genetic effect of CVD (i.e., CVD GWAS Z-score) with the protein abundance weights by calculating the linear sum of Z-score × weight for the independent SNPs at the locus to perform PWAS and obtain the protein-CVD associations. For the resulting PWAS analysis results, the Bonferroni correction was applied to account for multiple testing. Consequently, proteins with a P value < 2.92 × 10^–5 (0.05/1715) were considered to be associated with CVD in our discovery PWAS analysis. For multiple proteins whose corresponding genes were located within 1 Mb, we performed conditional and joint analyses to identify independent associations while excluding spurious signals driven by correlated marginal effects. We performed a stepwise regression model selection procedure to identify independent signals. The top protein within the locus was initially chosen to condition the effects of other proteins. Subsequently, the proteins with minimum conditional P value lower than the cutoff (Bonferroni corrected 0.05) were added to the selected sets. All selected proteins were then jointly fitted in a model and the proteins with the largest P value that is greater than the cutoff were dropped. This process was repeated until no proteins can be added or removed from the model. The proteins remaining in the model were regarded as potential independent proteins at their loci. These analyses were conducted using FUSION software. Causal analysis We subsequently employed the SMR [[173]15] to draw causal inferences on the associations identified from the PWAS. We leveraged recently published pQTL data [[174]19] derived from UKB-PPP study for the exposure proteins. GWAS data for the outcome CVDs were the same as those used in our PWAS analysis. SMR (version 1.3.1) was used with default parameters. MR relies on three fundamental assumptions: (1) the genetic variants (SNPs) must be strongly associated with the exposure; (2) there are no measured and unmeasured confounders of the instrument and the outcome; and (3) the SNPs influence the outcome only through their effect on the exposure, not via alternative pathways. To meet these conditions, SNPs were selected based on the following criteria: they had to be cis-pQTLs, defined as variants located within ± 1 Mb of the gene encoding the target protein, and show genome-wide threshold (P < 5 × 10^−8) with a minor allele frequency (MAF) > 0.1. Palindromic SNPs with minor allele frequencies over 0.42 were removed from the analysis [[175]60, [176]61]. And the clumping algorithm within PLINK [[177]56] was used to select independent SNPs (r^2 threshold = 0.001, window size = 1 Mb). LD information was obtained from 1000 Genomes Project’s European data (phase 3). For all instruments at a locus, we calculated a vector of z-statistics (Z = BETA/SE), which were then combined into a test statistic (T = ∑z^2). This statistic accounts for the LD structure via the SNP correlation matrix, and the significance of the test was assessed using the Saddlepoint approximation [[178]62]. In addition, the HEIDI test was applied to assess and differentiate between true causal effects and pleiotropic signals. The core principle of the HEIDI test is that if a trait and gene expression are influenced by the same causal variant, then all SNPs in LD with that variant should yield LD-dependent effect estimates. Therefore, HEIDI evaluates heterogeneity by testing whether the effect estimates of different SNPs deviate from the top cis-pQTL’s estimate. Our determination of causal relationships relied on P < 0.05 (Bonferroni corrected) for the SMR analysis and the unadjusted P > 0.05 from the HEIDI test. The P < 0.05 in the HEIDI test was interpreted as evidence for potential linkage due to two nearby but distinct causal variants in strong linkage disequilibrium. The SMR analysis followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) MR reporting guidelines, and the checklist is provided in Supplementary Table [179]16. Replication analysis We then performed the replication analysis in two other publicly available datasets. The modeling methodologies for these datasets have been documented in prior research [[180]26]. The Atherosclerosis Risk in Communities (ARIC) study included 4483 protein measurements from 7213 European participants [[181]26], while the INTERVAL study retained information on 3170 proteins for 3301 individuals [[182]27]. Subsequent to the heritability filtering phase, an ensemble of 2379 proteins (1348 in ARIC and 1031 in INTERVAL) was selected for incorporation into our replication verification. We conducted replication analyses of PWAS and causal inference results in two datasets. The Bonferroni method was used for multiple testing correction. Associations are considered as successfully validated only if their P values passed the Bonferroni correction threshold and their effect size directions aligned with those observed in the discovery set. PPI and GO enrichment To investigate the biological function of the identified genes, we employed the STRING [[183]32] database to perform network analysis. The Markov cluster (MCL) algorithm was used with an inflation parameter of 3. Subsequently, the derived network was refined and visually optimized utilizing Cytoscape [[184]63] software. In this visualization, node size represents the degree of connectivity for each gene, indicative of its interaction frequency within the network, while distinct indicate different gene categories, facilitating categorical distinction and interpretation. Additionally, we conducted functional enrichment analysis for causal genes pertinent to three categories diseases using the WebGestalt [[185]31] online platform, focusing on the GO biological processes. We selected the pathways with FDR-adjusted P < 0.05 and more than 3 overlapping genes. Longitudinal data stability analysis We conducted a longitudinal analysis utilizing the data set [186]GSE190992 from the Gene Expression Omnibus (GEO) database [[187]34]. Details for data collection methodology has been reported previously [[188]34]. The data encompassed proteomics measurements over a 10-week period for 6 healthy donors, assayed using the Olink Proximity Extension assay. The single-cell transcriptomic data were collected in over 6-week period from 4 of these donors. For each donor, we calculated the coefficient of variation (CV) for each gene at both the proteome and single-cell levels as a measure of stability (CV = standard deviation / mean × 100%). We selected a threshold of 10% as criteria for stable protein abundance in proteome data and stable gene expression in single-cell data. Cell-type specific expression of the CVD causal genes We utilized scRNA-seq data from 11 healthy donors sourced from the GEO database ([189]GSE244515) [[190]17]. During the preprocessing phase, cells expressing fewer than 200 genes or with mitochondrial gene content exceeding 15% were filtered out. For cell type annotation, we normalized the count matrices using the LogNormalize method with a scaling factor of 10,000. This normalization also facilitated identification of variable features. To integrate datasets from different samples and mitigate batch effects, we applied the Harmony integration method. These procedures were carried out using the Seurat package version 4.4.0 within the R environment. After quality control and normalization, the data included 27,484 genes across 371,086 cells. For the 70 CVD causal genes, we used the Wilcoxon rank sum test to compare the expression levels between target cell type and other cells. We applied FDR correction to the P values derived from the multiple testing across 27,484 genes. Finally, we retained the results of FDR P value < 0.05 and logFC > 1.5 were considered to have cell-type-specific expression. Potential of causal proteins in disease diagnostic models We constructed disease diagnostic models using 70 CVD causal proteins identified via PWAS. Patients were selected based on ICD-10 diagnostic codes (Data-Field: 41,270) from the UK Biobank. Only diseases with more than 100 cases were included, and individuals with more than 7 proteins missing were excluded. Individuals without any disease served as the control group. We used Support Vector Machines (SVM) to build these models. To address the issue of sample imbalance, where control samples greatly outnumber case samples, we randomly downsampled the control group to match the number of cases during model training. This approach achieved a 1:1 sample balance and reduced bias from uneven sample distribution. To ensure the robustness and generalizability of our models, we adopted a tenfold cross-validation method. In this procedure, the training dataset was randomly divided into ten equal parts. During each iteration of the cross-validation, nine folds were used for training the model, while the remaining part was reserved for validation. This process was repeated ten times, with each fold serving once as the validation set. Model performance was evaluated by averaging the results across all iterations, enable selection of the best predictive model. To further assess the reliability of our models, we conducted bootstrap sampling with 1000 iterations to estimate the area under the curve (AUC), its 95% confidence interval (CI), and other relevant evaluation metrics. The entire model construction process was implemented in R software (version 4.1.0), utilizing key packages including caret for model training and evaluation, pROC for ROC analysis, and e1071 for SVM implementation. Mouse genome informatics and drug analysis MGI database [[191]33] serves as a global repository for murine research, offering a comprehensive integration of genetic, genomic, and biological information. This platform facilitates research into human health and disease by enabling insights derived from mouse models. We found that many genes knockout mouse models exhibit phenotypes associated with circulatory system disease. Phenotype enrichment analysis was performed using Fisher’s exact test. We retained phenotypes that met the following criteria: at least three overlapping genes, FDR-adjusted P value < 0.05, and odds ratio (OR) > 1. Furthermore, we constructed a gene-drug-disease interaction network by integrating gene-drug associations from the DrugBank [[192]64] database and drug-disease relationships from the Therapeutic Target Database (TTD) [[193]65]. The network focused exclusively on drugs with approved clinical efficacy, excluding those that had been discontinued at any stage of development. Supplementary Information [194]Supplementary Material 1^ (5.8MB, pdf) [195]Supplementary Material 2^ (2.8MB, xlsx) Acknowledgements