Abstract

Background

   Cardiovascular diseases (CVD) are the leading cause of global
   mortality, yet current treatments benefit only a subset of patients. To
   identify new potential treatment targets, we conducted the first
   proteome wide association study (PWAS) for 26 CVDs using plasma
   proteomics data from the largest cohort to date (53,022 individuals in
   the UK Biobank Pharma Proteomics Project (UKB-PPP)).

Methods and results

   We calculated single nucleotide polymorphism (SNP)-protein weights
   using the UKB-PPP dataset and integrated these weights with genome-wide
   association study (GWAS) summary statistics for 26 CVDs across three
   categories (16 cardiac, 5 venous, and 5 cerebrovascular diseases) in up
   to 1,308,460 individuals. PWAS was performed using the Functional
   Summary-based Imputation (FUSION) framework to identify protein-disease
   associations. Replication was conducted in two independent human plasma
   proteomic datasets (comprising 7213 and 3301 participants,
   respectively). We identified 155 proteins associated with CVDs and
   further Mendelian randomization analysis revealed 72 proteins with
   evidence of a causal association. Of these, 26 out of 35 available
   proteins were validated. Notably, 33 of the 72 proteins were not
   previously implicated in GWAS of CVDs. For example, PROC was found to
   be associated with venous thromboembolism (P = 6.32 × 10^–7). We
   further conducted longitudinal analyses using plasma proteomics data
   and peripheral blood mononuclear cells single cell RNA-seq data. The
   results showed that 90.63% (29/32) of the detected proteins exhibited
   stable plasma expression, and 18 genes displayed stable expression in
   at least one cell type, particularly in CD14+ monocytes. We also
   utilized these proteins to construct disease diagnostic models, and
   notably, models for 14 out of 18 diseases achieved an area under the
   curve (AUC) exceeding 0.8, indicating promising diagnostic potential.

Conclusions

   We identified 72 proteins that causally influence CVD risk, providing
   new mechanistic insights into CVD and may prove to be promising targets
   as CVD therapeutics.

Graphical abstract

   [52]graphic file with name 12933_2025_2847_Figa_HTML.jpg

Supplementary Information

   The online version contains supplementary material available at
   10.1186/s12933-025-02847-w.

   Keywords: CVD, Human blood plasma proteomes, Causal proteins, PWAS

Research insights

   What is currently known about this topic?
     * Cardiovascular diseases (CVD) are multifactorial disorders
       influenced by both heritable and environmental factors.
     * Proteins, as the final products of gene expression, are the main
       functional components of biological processes. Understanding the
       association between protein abundance and disease status may
       facilitate the identification of novel therapeutic targets.

   What is the key research question?
     * The key research question of this study is to identify CVD relevant
       proteins and provide potential therapeutic targets.

   What is new?
     * We identified 72 proteins causally associated with CVD, and 33 of
       them are encoded by genes that were not implicated in the original
       genome-wide association study. Disease diagnostic models
       constructed using these proteins demonstrated promising potential
       for accurate disease detection.

   How might this study influence clinical practice?
     * The CVD relevant proteins we identified provide new mechanistic
       insights into CVD and may prove to be promising targets as CVD
       therapeutics.

Introduction

   Cardiovascular diseases (CVD) are a group of disorders of the heart and
   blood vessels. As the leading global cause of mortality, CVD took an
   estimated 17.9 million lives in 2019, accounting for 32% of all deaths
   worldwide [[53]1–[54]3]. In clinical practice, there are some commonly
   used treatments for CVD, such as the use of statins to reduce
   cardiovascular morbidity and mortality. However, current treatments are
   not suitable for all patients and have certain risks [[55]4].
   Therefore, there is an urgent and critical need for new therapeutic
   targets for CVD [[56]5].

   Proteins, as the final products of gene expression, are the main
   functional components of biological processes [[57]6]. Understanding
   the association between protein abundance and disease status may
   facilitate the identification of novel therapeutic targets
   [[58]7–[59]11]. However, due to the high cost of proteomic
   measurements, establishing associations between protein abundance and
   disease status in large-scale populations remains challenging [[60]12].
   Proteome-wide association study (PWAS) is a powerful strategy to solve
   this problem. It employs single-nucleotide polymorphisms (SNPs) to
   genetically impute protein abundance levels, which are then integrated
   with genome-wide association study (GWAS) summary statistics for a
   trait to derive protein-trait associations [[61]13, [62]14]. Further
   summary data-based Mendelian randomization (SMR) can be used to
   identify proteins with evidence that suggests a causal association with
   the outcome [[63]15]. MR is a widely used approach that utilizes
   genetic variants as instrumental variables to infer the potential
   causal effect of an exposure on an outcome. Since genetic variations
   are randomly allocated during meiosis and fertilization, they are
   relatively independent of individuals’ subsequent behaviors. Therefore,
   MR analysis could minimize issues of confounding and reverse causality.
   This integrative analytical approach has been employed to identify
   novel potential therapeutic targets for neurological disorders [[64]13,
   [65]14, [66]16–[67]18] using brain proteomics data. However, this
   approach has not yet been used in the identification of CVD-relevant
   proteins.

   Here we performed the first PWAS for multiple CVDs using plasma
   proteomics data of the largest cohort to date (53,022 individuals from
   the UK Biobank Pharma Proteomics Project (UKB-PPP)) [[68]19].
   Subsequently, we used the proteins identified by PWAS as exposures and
   CVDs as the outcome to detect proteins showing evidence of a causal
   association. The study design is shown in Fig. [69]1. We applied the
   above analytic approaches to the discovery dataset consisting of human
   plasma proteomic and genetic data from UKB-PPP [[70]19] and the GWAS of
   26 CVDs spanning 3 categories (16 cardiac diseases, N = 234,829 to
   1,030,836; 5 venous diseases, N = 388,830–484,598; 5 cerebrovascular
   diseases, N = 484,598 to 1,308,460) [[71]20–[72]25]. Additionally, we
   conducted replication analyses leveraging two independent human plasma
   proteomic datasets, encompassing 7213 participants from the
   Atherosclerosis Risk in Communities (ARIC) study and 3301 individuals
   from the INTERVAL study, to ensure robustness and reproducibility
   [[73]26, [74]27]. For functional interpretation of the identified
   proteins, enrichment analyses were performed to detect the pathways
   associated with CVD. Longitudinal stability analysis at plasma and
   cell-type levels was used to assess the expression stability of the
   proteins. We also pharmacologically annotated the proteins of interest
   with approved drugs to assess their feasibility as treatment targets.

Fig. 1.

   [75]Fig. 1
   [76]Open in a new tab

   Overview of the current study. Proteomics data were collected from
   three different sources: UKB, ARIC and INTERVAL. GWAS summary data for
   26 CVDs spanning 3 categories (16 cardiac diseases, 5 venous diseases,
   5 cerebrovascular diseases, up to 1,308,460 individuals) were included.
   PWAS were conducted using the proteomic datasets, followed by Mendelian
   randomization to assess potential causal relationships. Subsequently,
   functional annotation was conducted for genes identified through PWAS

Results

   The analysis pipeline is shown in Supplementary Fig. [77]1. We first
   performed PWAS to obtain proteins associated with CVDs. Then SMR
   analysis was performed for causal inference, testing whether the
   proteins identified by PWAS were causally associated with CVDs.
   Finally, HEIDI test was used to make sure the SMR findings were not
   driven by linkage. Causal proteins from the initial analysis were
   carried forward for replication analysis and functional annotation.

Discovery PWAS of multiple CVDs

   We generated the human blood plasma proteome model based on 53,022
   UKB-PPP participants [[78]19]. The UKB-PPP OLINK proteome dataset
   included data for 2923 proteins [[79]19]. After quality control, 1715
   proteins with SNP-based heritability (P < 0.05, h^2 > 0, Supplementary
   Table [80]1) were used for PWAS. The Pearson’s r between the predictive
   power of protein models and gene heritability was 0.938 (R^2 = 0.88,
   Supplementary Fig. [81]2), supporting the accuracy of our protein
   estimation model.

   The plasma proteome model results were integrated with the 26 CVD GWAS
   datasets using the FUSION pipeline [[82]28]. Detailed information of
   the 26 GWAS datasets is shown in Supplementary Table [83]2. We
   performed genetic correlation analysis and the results revealed that
   diseases within the same category typically exhibit stronger
   correlations. (Supplementary Fig. [84]3).

   As shown in Supplementary Table [85]3, we identified 341 protein-CVD
   pairs of associations after multiple testing corrections
   (P < 2.92 × 10^–5, 0.05/1715 proteins) (Fig. [86]2A). For each disease,
   when multiple genes corresponding to the associated proteins are
   located within a 1 Mb genomic window, these genes were subjected to
   further conditional analysis (Supplementary Fig. [87]4). This analysis
   was conducted on 53 unique genes (encompassing 87 protein-disease
   association pairs), with the results presented in Supplementary
   Table [88]4. The number of overlapping SNPs for each pair of genes was
   shown in Supplementary Table [89]5. To identify independent
   associations, we performed conditional analyses using a
   regression-based summary statistics approach [[90]28]. Finally, we
   obtained 314 independent PWAS association signals, including 155 unique
   proteins associated with CVD. The number of associated genes for each
   phenotype is shown in Fig. [91]2B. Taking heart failure as an example,
   PWAS identified 48 proteins associated with it (Fig. [92]2B). Eighteen
   genes from seven loci were subjected to conditional analysis. Six genes
   (SORT1, SHISA5, PDE5A, PGF, FURIN, DDT) were removed from the
   conditional analyses (Fig. [93]2C). Finally, we obtained 42 genes
   associated with heart failure.

Fig. 2.

   [94]Fig. 2
   [95]Open in a new tab

   Results of the PWAS analyses. A Manhattan plot for the PWAS results for
   CVD. Each dot represents the association between a disease and a gene,
   with the x-axis indicating the genomic location and the y-axis showing
   -log[10](P value). The gray horizontal line denotes the
   Bonferroni-corrected threshold, P < 2.92 × 10^–5. Associations for the
   three CVD categories are highlighted in red, green, blue, respectively.
   Labeled genes are the top results on each chromosome. B Number of genes
   identified by PWAS across the 26 CVD. Different colors indicate
   different disease categories. Genes identified through conditional
   analysis are shown in gray (n = 27). C Regional plots of PWAS loci from
   conditional analysis for heart failure. Conditional proteins include
   CELSR2, GSTM1, SPINK8, DAG1, HYAL1, FABP2, ACYP1, FES, GSTT2B and
   SUSD2. In each plot, the top panel highlights marginally associated
   PWAS genes (blue) and the retained genes (green). The bottom panel
   displays regional Manhattan plot of the data before (grey) and after
   (blue) conditioning on the predicted expression of the green genes.
   Chr: chromosome

Causal analysis of the proteins identified by PWAS

   We employed Summary-data-based Mendelian Randomization (SMR) [[96]15]
   to further evaluate the causality of the 155 proteins. The accompanying
   heterogeneity in dependent instruments (HEIDI) test was used to ensure
   that the SMR findings were not driven by linkage disequilibrium. The
   SMR results confirmed the causal effects of 121 unique proteins on CVD
   risk. The unit of the effect estimates is the standard deviation of the
   protein abundance. However, HEIDI results argued against a causal role
   for 49 genes due to linkage disequilibrium. Therefore, 72 unique
   proteins provided evidence of a causal role in CVD based on the
   SMR/HEIDI analysis. Importantly, the direction of effect in SMR was
   consistent with that in PWAS for all proteins passing SMR/HEIDI
   analysis. Finally, a total of 72 proteins were retained for further
   analysis (Fig. [97]3A, Supplementary Table [98]6, Supplementary
   Table [99]7).

Fig. 3.

   [100]Fig. 3
   [101]Open in a new tab

   Results for the causal genes. A The heatmap displays PWAS results for
   the 72 genes that passed the causality tests. The color intensity
   indicates the direction and magnitude of the association. Genes
   replicated in independent PWAS analysis are indicated by circles,
   Causal genes are marked with “*” and the novel genes—defined as those
   without any variant (P < 5 × 10^–8) within ± 2 M window in the GWAS
   results—are labeled in red. B The Venn diagram illustrates the overlap
   of causal genes across three disease categories. C The number of novel
   genes identified in each disease. D Manhattan plots of the PWAS and
   GWAS results for CVDs. The top panel presents the Manhattan plot of the
   PWAS results. The horizontal dotted line indicates the
   Bonferroni-corrected threshold (P < 2.91 × 10^−5, 0.05/1715). Colored
   dots highlight novel genes not previously reported in GWAS. The bottom
   panel displays the GWAS Manhattan plot, where each point represents the
   association results of a SNP with CVD. The horizontal line indicates
   the genome-wide threshold (P value = 5 × 10^−8)

Replication analysis using the proteomics data from ARIC and INTERVAL

   For the 72 causal proteins identified from the discovery dataset, 35
   were available in at least one of the replication datasets. The number
   of detected proteins in the ARIC and INTERVAL projects was 28 and 22,
   respectively. For these proteins, we incorporated plasma protein
   profiles and the CVD GWAS datasets to perform replication PWAS and SMR
   analysis.

   In the replication PWAS analysis, the effect sizes of the validation
   and discovery cohorts were correlated, with correlation coefficients
   (r) of 0.91 for the ARIC dataset and 0.85 for the INTERVAL dataset
   (Supplementary Fig. [102]5). The association between 35 proteins and
   CVDs was replicated in at least one cohort (Supplementary
   Table [103]8). After SMR analysis, the causal effects of 12 proteins on
   CVDs were replicated in at least one of the validation datasets
   (Supplementary Table [104]8). The causal associations between three
   proteins (ASPN, COL15A1 and IL6R) and CVDs were replicated in both
   datasets.

Common proteins associated with diseases in three CVD categories

   Ninety-two percent of the proteins (66/72) were identified in only one
   disease category (Fig. [105]3B). Only PROCR was associated with
   diseases in all three CVD categories. As shown in Fig. [106]3A, PROCR
   is negatively associated with 2 cardiac diseases and 1 cerebrovascular
   disease, and positively associated with 1 venous disease. The
   inconsistency in association direction might be due to the fact that
   PROCR is linked to anti-inflammatory and anticoagulant functions
   [[107]29].

Novelty of the CVD causal genes

   To assess the novelty of the 72 potentially causal genes, we checked
   the lowest P values for the SNPs within a 2 Mb window of these genes
   using the summary statistics from the CVD GWAS. We found that 33 genes
   were not located within 2 Mb of a GWAS signal (P < 5 × 10^–8),
   suggesting that these 33 genes are novel genes not implicated in the
   original GWAS (Fig. [108]3C). Twenty novel genes have not been detected
   in other CVD GWAS either. For example, PROC was found to be associated
   with venous thromboembolism (Fig. [109]3D top) and VSIG10 was found to
   be associated with two venous diseases (diseases of veins and varicose
   veins; Supplementary Fig. [110]6). The associations between these genes
   and CVDs were not detected across all GWAS datasets for the 26 CVD
   subtypes. The remaining 13 novel genes were not implicated in the
   original GWAS but have been detected in the GWAS of other CVDs. For
   example, our PWAS results showed that PCSK9 was associated with heart
   failure and calcific aortic valvular stenosis (Fig. [111]3D bottom).
   The GWAS of calcific aortic valvular stenosis didn’t detect the
   association signal of this gene, but it was found to be associated with
   heart failure in GWAS data.

Gene ontology enrichment analysis

   To further elucidate the molecular mechanisms underlying the 72
   identified proteins, we carried out a non-redundant gene ontology (GO)
   biological processes enrichment analysis using WebGestalt 2024
   [[112]30, [113]31]. Five pathways were identified with FDR < 0.05. Of
   these, two pathways are associated with vessel/blood—related processes,
   and one is linked to lipid metabolism. Specifically, as shown in
   Fig. [114]4A, cardiac disease relevant genes are enriched in two
   pathways, namely the biological process involved in symbiotic
   interaction and regulation of plasma lipoprotein particle levels. Genes
   associated with venous diseases exhibited enrichment in three
   biological pathways, with two of these pathways falling under
   vessel/blood-related processes, specifically the coagulation pathway.
   No pathway was detected for the genes associated with cerebrovascular
   disease due to the limited number of genes. The complete GO enrichment
   analysis results are provided in Supplementary Table [115]9.

Fig. 4.

   [116]Fig. 4
   [117]Open in a new tab

   Function annotation of the identified genes. A Enriched Gene Ontology
   (GO) biological process (BP) terms for the causal genes across
   different categories. The color of each bar indicates the biological
   function category of the pathway: lipid-related process (faint yellow),
   vessel/blood-related process (purple) and other (gray). B
   Protein–protein interaction network constructed with the identified
   causal genes. Edges represent physical interaction, with thickness
   proportional to the interaction score. Genes associated with cardiac
   disease, venous disease and cerebrovascular disease are colored red,
   green, and blue, respectively. Node size reflects the number of
   connections. Community 1 includes six proteins associated with
   immunity/inflammation. Community 2 includes four proteins associated
   with lipid-related process. Community 3 includes five proteins involved
   in vessel/blood-related process, particularly the formation of fibrin
   clot. C Enriched mouse phenotypes associated with the causal genes
   across different disease categories. Bar colors denote biological
   function categories: immunity/inflammation (light coral), lipid-related
   process (faint yellow), vessel/blood-related process, (purple) and
   other (gray). D Results of the longitudinal stability analysis at the
   protein level (left) and single-cell level (right). At the protein
   level, genes are classified as stable (blue) or variable (red) based on
   a coefficient of variation (CV) threshold of 10%. Among the 72 causal
   proteins, 29 were classified as stable. The colored bars on the left
   reflect gene grouping based on PWAS results. For the single-cell data,
   a CV threshold of 10% was used. Gray indicates samples with low average
   expression (< 0.01 after normalization). Eighteen genes exhibit stable
   expression across 19 cell types. Different donors are indicated by
   different colors. E Gene-drug-disease network constructed using the
   identified causal genes. Edge colors indicate the disease category of
   the association genes. ICD: International Classification of Diseases

Protein–protein interactions (PPI) network analysis

   To investigate the connectivity for the 72 proteins, we performed
   network-based analysis using STRING [[118]32]. The minimum required
   interaction score was 0.4. The constructed PPI network included 19
   nodes and 19 edges, primarily comprising three protein communities
   (Fig. [119]4B). Proteins associated with cardiac disease predominantly
   clustered within a network centered around FN1 and APOE. Consistent
   with pathway enrichment analysis results, these proteins were mostly in
   the community of immunity/inflammation and lipid-related process. The
   proteins associated with venous disease were mainly centered around
   five proteins (F10, F11, PROC, PROCR and PROS1) implicated in
   blood/vessel-related processes, particularly the coagulation process.
   Interactions among the proteins associated with cerebrovascular
   diseases are relatively sparse. Complete information about communities
   is presented in Supplementary Table [120]10.

Mouse phenotypic annotation of potential causal genes.

   We further evaluated whether the 72 proteins were associated with
   CVD-related phenotypes in mice using the Mouse Genome Informatics (MGI)
   database [[121]33]. We performed phenotype enrichment analysis using
   Fisher’s exact test. Consistent with the pathway enrichment and network
   analysis results, mutations in genes associated with cardiac diseases
   were enriched in phenotypes related to immunity/inflammation,
   lipid-related process, and vessel/blood-related process (Fig. [122]4C).
   Mutations in genes associated with venous disease were enriched in
   phenotypes related to vessel/blood-related process (Fig. [123]4C).
   These results further support the involvement of the identified
   proteins in CVD pathogenesis. The complete MGI annotation results are
   provided in Supplementary Table [124]11.

Evaluate longitudinal stability at protein and single-cell level

   To evaluate the expression stability of the 72 proteins, we performed
   longitudinal analysis using plasma proteomics data and peripheral blood
   mononuclear cells (PBMC) single cell RNA-seq (scRNA-seq) data from GEO
   dataset [125]GSE190992 [[126]34].

   The plasma proteomics data were collected from 6 healthy, non-smoking
   Caucasian donors over a 10-week period. Thirty-two of the 72 proteins
   were detected in this dataset. Among them, 90.63% (29/32) of the
   proteins exhibited stable expression in plasma (median coefficient of
   variation < 10%, Fig. [127]4D left).

   The PBMC scRNA-seq data were collected weekly from four donors over a
   course of 6 weeks. We identified 18 genes that displayed stable
   expression in at least one cell type (median coefficient of
   variation < 10%, Fig. [128]4D right). Notably, all these genes were
   stably expressed in CD14+  monocytes. According to previous studies,
   monocytes play a crucial role in both local ischemia and inflammatory
   responses, which are closely linked to the development of
   cardiovascular diseases [[129]35–[130]37].

Cell-type specific expression of the CVD causal proteins

   We investigated whether the 18 genes stably expressed in a certain cell
   type also exhibited enriched expression in that cell type. As shown in
   Supplementary Tables [131]12 and [132]14 genes exhibited enriched
   expression in cells where they were stably expressed. For example,
   IL1RN was stably expressed in CD14-Mono (Fig. [133]4D), and its
   expression in this cell type was higher than in other cell types. We
   further validated the enriched expression of these 14 genes in cells
   with their stable expression, using an independent scRNA-seq dataset
   derived from PBMCs of 11 individuals [[134]38]. The enriched expression
   of 10 genes was successfully validated (Supplementary Table [135]13).

Potential of causal proteins in disease diagnostic model

   To validate the role of causal proteins identified through PWAS in
   early clinical disease screening, we utilized Support Vector Machines
   (SVM) to build protein-based disease diagnostic models using disease
   information from the UKB. Among 18 cardiovascular diseases with over
   100 patients, 14 diseases showed area under the curve (AUC) values
   exceeding 0.8 in the test datasets (Supplementary Table [136]14,
   Supplementary Fig. [137]7). For example, the diagnostic model for
   coronary artery bypass grafting achieved the highest performance with
   an AUC of 0.89 (95% CI 0.86–0.91).

Drug repurposing analyses identified potential therapeutic targets for CVD

   To investigate potential drug target genes, we constructed a
   gene-drug-disease network (Fig. [138]4E). The results showed that 21 of
   the 72 proteins were the targets of 41 drugs with c clinical trials
   completed or ongoing (Supplementary Table [139]15). Eleven drugs were
   already in use for treating circulatory system disorders. For example,
   Drotrecogin alfa, which targets PROCR and PROS1, has been used as an
   effective treatment for managing cerebrovascular ischemic events
   [[140]39]. The remaining 30 drugs for treating diseases from other
   systems might be considered for potential drug repurposing. For
   example, Menadione, which targets PROC, is currently used to treat
   vitamin K deficiency and prostate cancer.

Discussion

   In this study, we integrated data from 26 CVD GWAS along with three
   large-scale human plasma proteomic datasets to conduct a comprehensive
   PWAS analysis. In total, we identified 115 independent protein-disease
   association pairs, involving 72 unique proteins associated with CVD.
   Among these, 33 represented novel associations that were not implicated
   in the original GWAS. We also explored potential biological mechanisms
   underlying CVD and identified promising targets for future drug
   development.

   The PWAS identified 72 genes with evidence consistent with a causal
   role in CVD, including 33 that were not implicated in the original
   GWAS. For example, PROC was identified as being associated with venous
   thromboembolism. PROC encodes a vitamin K-dependent enzyme that plays a
   crucial role in regulating human thrombosis and hemostasis [[141]40].
   Consistent with our findings, previous studies have reported that
   reduced PROC levels in plasma can be used as a marker of increased risk
   of venous thrombosis [[142]41, [143]42]. In the PPI network, we also
   demonstrated that PROC, together with coagulation factors such as F11
   and F10, forms a venous-related network. Longitudinal stability
   analysis showed that PROC was stably expressed in blood plasma.
   Currently, two drugs (Menadione and Cupric Chloride) that target PROC
   have completed clinical trials for the treatment of conditions such as
   fungal infections, prostate cancer, and vitamin K deficiency [[144]43].
   Further studies are warranted to evaluate the therapeutic potential of
   these drugs for treating venous thromboembolism.

   Here we used PWAS coupled with subsequent SMR analysis to identify
   proteins causally associated with at least one of the 26 CVD subtypes,
   resulting in the identification of 72 proteins causally linked to CVDs.
   Recently, Schuermans et al. [[145]44] applied Cox regression followed
   by MR analysis on proteomics data from the UK Biobank, identifying 39
   proteins causally associated with 4 CVD subtypes. These two analytical
   approaches are complementary in nature. PWAS is particularly useful for
   elucidating how genetic variation modulates protein functions to
   influence disease risk, whereas Cox regression enables the simultaneous
   estimation of multiple covariates’ effect on disease outcomes. Seven
   proteins were identified by both studies. In addition to differences in
   analytical methods, discrepancies in the proteomic datasets used may
   also contribute to the variation in results. We selected 1715 heritable
   proteins from the full set of proteins for PWAS analysis [[146]19],
   while they chose 1463 proteins in their analysis (1075 overlapping
   proteins). The novel proteins we identified in our study may offer
   additional insights into the pathogenesis of CVDs.

   Ninety-two percent (66/72) of the identified proteins are
   category-specific, suggesting that the underlying pathogenic mechanisms
   of the three disease categories are different. Only PROCR was
   associated with diseases in all three CVD categories. PROCR was
   positively associated with venous disorders but negatively associated
   with stroke and coronary artery disease. PROCR is a receptor for
   activated protein C, a serine protease that is activated as a part of
   and involved in the blood coagulation pathway. Consistent with our
   results, GWAS have shown that the minor G allele of rs867186 in this
   gene is associated with a higher risk of venous thromboembolism
   [[147]45, [148]46] but a lower risk of CAD [[149]47, [150]48]. A
   previous study has shown that PROCR is linked to CAD through
   anti-inflammatory mechanisms and to VTE through pro-thrombotic
   mechanisms [[151]29].

   The longitudinal stability analysis revealed that 29 of the 32 detected
   proteins (90.63%) maintained stable expression in plasma. Stable
   proteins typically show little variation under healthy conditions. As
   noted by Vasaikar et al. [[152]34], their altered levels in disease
   states may highlight their potential as biomarkers for disease
   monitoring in future studies. In addition, PBMC scRNA data analysis
   identified 18 genes that exhibited stable expression in at least one
   cell type, and all 18 genes were stably expressed in CD14+ monocytes,
   highlighting the important role of CD14+ monocytes in CVD development.
   Consistently, previous studies have associated increased frequency of
   CD14+ monocytes with clinical CVD events and plaque vulnerability
   [[153]49, [154]50]. The density of CD14+ monocytes was found to be
   higher in patients with moderate‐severe heart failure compared to those
   with normal or mild LV impairment [[155]51, [156]52]. These results
   suggest that CD14+ monocytes might serve as biomarkers for CVD.

   Our study has several limitations. First, since the currently available
   proteomics and GWAS are mainly derived from European populations, our
   results are primarily applicable to the European population. Second, we
   focused on cis-regulatory elements when constructing models to assess
   protein influences. This is a common approach among researchers, as the
   current sample size in proteomics may not be sufficient to detect the
   trans effect. With larger-scale data available in the future, models
   considering both cis and trans effects can be constructed. In addition,
   the sample size of the dataset used for longitudinal analysis was
   relatively small, and studies with larger sample sizes are needed to
   confirm our findings. Future studies with larger, diverse populations
   and extended follow-up are needed to validate these preliminary
   findings.

   In summary, using the largest available proteomics data from UKB-PPP
   projects (a total of 1715 heritable proteins from 53,022 individuals),
   we performed a PWAS study for 26 CVDs. We identified 72 proteins that
   causally influence CVD risk. These proteins may serve as potential
   targets for future mechanistic and therapeutic studies aimed at finding
   effective treatments for CVD.

Methods

Human plasma proteomic and genetic data in UKB

   We generated the human blood plasma proteome models from 53,022
   participants of European ancestry in the UKB-PPP [[157]19]. All UK
   Biobank participants provided informed consent and the conduct of our
   research complied with the ethical principles set forth in the
   Declaration of Helsinki [[158]53, [159]54]. The sample was selected in
   two batches from Consortium members and UK Biobank cohort, and the
   proteomic profiling was performed using standard Olink proteomics
   pipeline. Proximity Extension Assay antibodies matched to unique
   complementary oligonucleotides were bound to their respective target
   proteins and underwent quantification through next-generation
   sequencing. Following rigorous quality control measures, the normalized
   protein expression (NPX) values were computed using the Inter-Plate
   Control method. This NPX score effectively served as a quantitative
   measure of protein abundance in our samples.

   Genotype data matching the protein dataset underwent genotyping,
   imputation, and quality control steps as detailed in previous work
   [[160]53]. This included sex discrepancy, sex chromosome aneuploidy,
   and heterozygosity checks, with imputed variants filtered for INFO
   scores > 0.7. All chromosomal positions were updated to the hg38
   assembly using LiftOver [[161]55]. Genotyping quality control was
   executed using PLINK2.0 software [[162]56]. Participants exhibiting
   over 5% missing genotypic data were removed from consideration.
   Moreover, variants displaying deviations from the Hardy–Weinberg
   equilibrium (with P values less than 1 × 10^–8), a genotype missing
   rate exceeding 5%, a minor allele frequency below 1%, or those not
   classified as SNPs were also excluded from the analysis.

   Following the preprocessing of both genotype and protein datasets, we
   adopted the FUSION software (download from
   [163]http://gusevlab.org/projects/fusion) [[164]28] to train the
   proteome model and we only considered the subset comprising 1,190,321
   SNPs from the HapMap3 project [[165]57]. SNPs situated up to 500 kb
   away from either end of genes were defined as cis-SNPs. The model
   further incorporated adjustments for protein expression based on gender
   and age as covariates to refine the association analysis and account
   for potential confounding variables.

CVD GWAS summary association statistics

   Our GWAS data were mainly collected from the GWAS catalog
   [[166]20–[167]25] and FinnGen database [[168]58]. In accordance with
   the ICD-10 standard of circulatory disorders, we selected GWAS studies
   involving a minimum of 5000 cases. When multiple studies of the same
   condition were identified, we opted for those with the largest sample
   sizes. Statins are drugs that lower cholesterol and reduce the risk of
   heart disease. The FinnGen database also classified “statin medication”
   in the ‘Diseases of the circulatory system’ category. Therefore, we
   included statin medication as one of the CVDs in our analysis. This
   stringent selection procedure resulted in a final cohort of 26 unique
   GWAS for our investigation. Based on the distinct pathophysiological
   mechanisms, we categorized the diseases into cardiac disease, venous
   disease, and cerebrovascular disease.

Statistical approach

Proteome-wide association studies (PWAS)

   We used the standard processes in the FUSION software to construct
   protein models and incorporate GWAS data for our PWAS analysis
   [[169]28]. After applying the previously outlined quality control
   measures to screen the sample and genotype data, we utilized GCTA
   software to estimate the SNP-based heritability for individual proteins
   [[170]59]. To expedite calculations, a random subset of approximately
   10,000 individuals was selected from the full cohort for each protein’s
   heritability estimation. From the analysis of 2923 proteins, 1715
   displayed heritability (h^2 > 0, P < 0.05, Supplementary Table [171]1).
   We then employed the FUSION software to estimate the impact of SNPs on
   protein abundance using multiple predictive models (top1, lasso, enet)
   and selected the most predictive model [[172]28]. Next, FUSION combined
   the genetic effect of CVD (i.e., CVD GWAS Z-score) with the protein
   abundance weights by calculating the linear sum of Z-score × weight for
   the independent SNPs at the locus to perform PWAS and obtain the
   protein-CVD associations. For the resulting PWAS analysis results,
   the Bonferroni correction was applied to account for multiple testing.
   Consequently, proteins with a P value < 2.92 × 10^–5 (0.05/1715) were
   considered to be associated with CVD in our discovery PWAS analysis.

   For multiple proteins whose corresponding genes were located within
   1 Mb, we performed conditional and joint analyses to identify
   independent associations while excluding spurious signals driven by
   correlated marginal effects. We performed a stepwise regression model
   selection procedure to identify independent signals. The top protein
   within the locus was initially chosen to condition the effects of other
   proteins. Subsequently, the proteins with minimum conditional P value
   lower than the cutoff (Bonferroni corrected 0.05) were added to the
   selected sets. All selected proteins were then jointly fitted in a
   model and the proteins with the largest P value that is greater than
   the cutoff were dropped. This process was repeated until no proteins
   can be added or removed from the model. The proteins remaining in the
   model were regarded as potential independent proteins at their loci.
   These analyses were conducted using FUSION software.

Causal analysis

   We subsequently employed the SMR [[173]15] to draw causal inferences on
   the associations identified from the PWAS. We leveraged recently
   published pQTL data [[174]19] derived from UKB-PPP study for the
   exposure proteins. GWAS data for the outcome CVDs were the same as
   those used in our PWAS analysis. SMR (version 1.3.1) was used with
   default parameters. MR relies on three fundamental assumptions: (1) the
   genetic variants (SNPs) must be strongly associated with the exposure;
   (2) there are no measured and unmeasured confounders of the instrument
   and the outcome; and (3) the SNPs influence the outcome only through
   their effect on the exposure, not via alternative pathways. To meet
   these conditions, SNPs were selected based on the following criteria:
   they had to be cis-pQTLs, defined as variants located within ± 1 Mb of
   the gene encoding the target protein, and show genome-wide threshold
   (P < 5 × 10^−8) with a minor allele frequency (MAF) > 0.1. Palindromic
   SNPs with minor allele frequencies over 0.42 were removed from the
   analysis [[175]60, [176]61]. And the clumping algorithm within PLINK
   [[177]56] was used to select independent SNPs (r^2 threshold = 0.001,
   window size = 1 Mb). LD information was obtained from 1000 Genomes
   Project’s European data (phase 3). For all instruments at a locus, we
   calculated a vector of z-statistics (Z = BETA/SE), which were then
   combined into a test statistic (T = ∑z^2). This statistic accounts for
   the LD structure via the SNP correlation matrix, and the significance
   of the test was assessed using the Saddlepoint approximation [[178]62].
   In addition, the HEIDI test was applied to assess and differentiate
   between true causal effects and pleiotropic signals. The core principle
   of the HEIDI test is that if a trait and gene expression are influenced
   by the same causal variant, then all SNPs in LD with that variant
   should yield LD-dependent effect estimates. Therefore, HEIDI evaluates
   heterogeneity by testing whether the effect estimates of different SNPs
   deviate from the top cis-pQTL’s estimate. Our determination of causal
   relationships relied on P < 0.05 (Bonferroni corrected) for the SMR
   analysis and the unadjusted P > 0.05 from the HEIDI test. The P < 0.05
   in the HEIDI test was interpreted as evidence for potential linkage due
   to two nearby but distinct causal variants in strong linkage
   disequilibrium. The SMR analysis followed the Strengthening the
   Reporting of Observational Studies in Epidemiology (STROBE) MR
   reporting guidelines, and the checklist is provided in Supplementary
   Table [179]16.

Replication analysis

   We then performed the replication analysis in two other publicly
   available datasets. The modeling methodologies for these datasets have
   been documented in prior research [[180]26]. The Atherosclerosis Risk
   in Communities (ARIC) study included 4483 protein measurements from
   7213 European participants [[181]26], while the INTERVAL study retained
   information on 3170 proteins for 3301 individuals [[182]27]. Subsequent
   to the heritability filtering phase, an ensemble of 2379 proteins (1348
   in ARIC and 1031 in INTERVAL) was selected for incorporation into our
   replication verification. We conducted replication analyses of PWAS and
   causal inference results in two datasets. The Bonferroni method was
   used for multiple testing correction. Associations are considered as
   successfully validated only if their P values passed the Bonferroni
   correction threshold and their effect size directions aligned with
   those observed in the discovery set.

PPI and GO enrichment

   To investigate the biological function of the identified genes, we
   employed the STRING [[183]32] database to perform network analysis. The
   Markov cluster (MCL) algorithm was used with an inflation parameter of
   3. Subsequently, the derived network was refined and visually optimized
   utilizing Cytoscape [[184]63] software. In this visualization, node
   size represents the degree of connectivity for each gene, indicative of
   its interaction frequency within the network, while distinct indicate
   different gene categories, facilitating categorical distinction and
   interpretation. Additionally, we conducted functional enrichment
   analysis for causal genes pertinent to three categories diseases using
   the WebGestalt [[185]31] online platform, focusing on the GO biological
   processes. We selected the pathways with FDR-adjusted P < 0.05 and more
   than 3 overlapping genes.

Longitudinal data stability analysis

   We conducted a longitudinal analysis utilizing the data set
   [186]GSE190992 from the Gene Expression Omnibus (GEO) database
   [[187]34]. Details for data collection methodology has been reported
   previously [[188]34]. The data encompassed proteomics measurements over
   a 10-week period for 6 healthy donors, assayed using the Olink
   Proximity Extension assay. The single-cell transcriptomic data were
   collected in over 6-week period from 4 of these donors. For each donor,
   we calculated the coefficient of variation (CV) for each gene at both
   the proteome and single-cell levels as a measure of stability
   (CV = standard deviation / mean × 100%). We selected a threshold of 10%
   as criteria for stable protein abundance in proteome data and stable
   gene expression in single-cell data.

Cell-type specific expression of the CVD causal genes

   We utilized scRNA-seq data from 11 healthy donors sourced from the GEO
   database ([189]GSE244515) [[190]17]. During the preprocessing phase,
   cells expressing fewer than 200 genes or with mitochondrial gene
   content exceeding 15% were filtered out. For cell type annotation, we
   normalized the count matrices using the LogNormalize method with a
   scaling factor of 10,000. This normalization also facilitated
   identification of variable features. To integrate datasets from
   different samples and mitigate batch effects, we applied the Harmony
   integration method. These procedures were carried out using the Seurat
   package version 4.4.0 within the R environment. After quality control
   and normalization, the data included 27,484 genes across 371,086 cells.
   For the 70 CVD causal genes, we used the Wilcoxon rank sum test to
   compare the expression levels between target cell type and other cells.
   We applied FDR correction to the P values derived from the multiple
   testing across 27,484 genes. Finally, we retained the results of FDR P
   value < 0.05 and logFC > 1.5 were considered to have cell-type-specific
   expression.

Potential of causal proteins in disease diagnostic models

   We constructed disease diagnostic models using 70 CVD causal proteins
   identified via PWAS. Patients were selected based on ICD-10 diagnostic
   codes (Data-Field: 41,270) from the UK Biobank. Only diseases with more
   than 100 cases were included, and individuals with more than 7 proteins
   missing were excluded. Individuals without any disease served as the
   control group. We used Support Vector Machines (SVM) to build these
   models. To address the issue of sample imbalance, where control samples
   greatly outnumber case samples, we randomly downsampled the control
   group to match the number of cases during model training. This approach
   achieved a 1:1 sample balance and reduced bias from uneven sample
   distribution. To ensure the robustness and generalizability of our
   models, we adopted a tenfold cross-validation method. In this
   procedure, the training dataset was randomly divided into ten equal
   parts. During each iteration of the cross-validation, nine folds were
   used for training the model, while the remaining part was reserved for
   validation. This process was repeated ten times, with each fold serving
   once as the validation set. Model performance was evaluated by
   averaging the results across all iterations, enable selection of the
   best predictive model. To further assess the reliability of our models,
   we conducted bootstrap sampling with 1000 iterations to estimate the
   area under the curve (AUC), its 95% confidence interval (CI), and other
   relevant evaluation metrics. The entire model construction process was
   implemented in R software (version 4.1.0), utilizing key packages
   including caret for model training and evaluation, pROC for ROC
   analysis, and e1071 for SVM implementation.

Mouse genome informatics and drug analysis

   MGI database [[191]33] serves as a global repository for murine
   research, offering a comprehensive integration of genetic, genomic, and
   biological information. This platform facilitates research into human
   health and disease by enabling insights derived from mouse models. We
   found that many genes knockout mouse models exhibit phenotypes
   associated with circulatory system disease. Phenotype enrichment
   analysis was performed using Fisher’s exact test. We retained
   phenotypes that met the following criteria: at least three overlapping
   genes, FDR-adjusted P value < 0.05, and odds ratio (OR) > 1.
   Furthermore, we constructed a gene-drug-disease interaction network by
   integrating gene-drug associations from the DrugBank [[192]64] database
   and drug-disease relationships from the Therapeutic Target Database
   (TTD) [[193]65]. The network focused exclusively on drugs with approved
   clinical efficacy, excluding those that had been discontinued at any
   stage of development.

Supplementary Information

   [194]Supplementary Material 1^ (5.8MB, pdf)
   [195]Supplementary Material 2^ (2.8MB, xlsx)

Acknowledgements