Abstract

   Cardiac diseases represent common highly morbid conditions for which
   molecular mechanisms remain incompletely understood. Here we report the
   analysis of 1,459 protein measurements in 44,313 UK Biobank
   participants to characterize the circulating proteome associated with
   incident coronary artery disease, heart failure, atrial fibrillation
   and aortic stenosis. Multivariable-adjusted Cox regression identified
   820 protein–disease associations—including 441 proteins—at
   Bonferroni-adjusted P < 8.6 × 10^−6. Cis-Mendelian randomization
   suggested causal roles aligning with epidemiological findings for 4% of
   proteins identified in primary analyses, prioritizing therapeutic
   targets across cardiac diseases (for example, spondin-1 for atrial
   fibrillation and the Kunitz-type protease inhibitor 1 for coronary
   artery disease). Interaction analyses identified seven protein–disease
   associations that differed Bonferroni-significantly by sex. Models
   incorporating proteomic data (versus clinical risk factors alone)
   improved prediction for coronary artery disease, heart failure and
   atrial fibrillation. These results lay a foundation for future
   investigations to uncover disease mechanisms and assess the utility of
   protein-based prevention strategies for cardiac diseases.

   Subject terms: Cardiovascular diseases, Molecular medicine, Proteomics
     __________________________________________________________________

   Schuermans et al. identify a causal relationship between the
   circulating proteins spondin-1 and atrial fibrillation and SPINT1 and
   coronary artery disease and show that adding proteomic data improves
   clinical risk factor-based cardiovascular risk prediction.

Main

   Cardiac diseases represent the leading global cause of morbidity and
   mortality^[53]1, with coronary artery disease, heart failure, atrial
   fibrillation and aortic stenosis collectively accounting for more than
   90% of cardiac deaths^[54]1,[55]2. The prevention of these diseases
   typically relies on accurate risk prediction and pharmacotherapy of
   modifiable risk factors, which represent two complementary aspects of
   cardiovascular care that remain incompletely optimized. For instance,
   most high-risk individuals remain undetected until experiencing their
   first clinical event^[56]3,[57]4, and even under currently optimal
   treatment conditions, there remains substantial residual risk that is
   incompletely captured by traditional risk factors^[58]5,[59]6. Both the
   development of effective prediction tools and the discovery of
   therapeutic targets could considerably improve treatment outcomes and
   enhance early detection for different cardiac diseases.

   The circulating proteome—a dynamic network that reflects genetic
   background as well as external factors such as environmental exposures
   and lifestyle alterations—may be leveraged for both risk prediction and
   disease risk modification. For instance, smaller studies have
   demonstrated that sparse protein-based risk scores can improve the
   prediction of cardiovascular events in certain
   populations^[60]7–[61]11. Furthermore, targeted analyses of specific
   biomarkers suggest that integrating proteomic and genetic data can
   nominate causal protein–disease associations and reveal actionable drug
   targets in the bloodstream^[62]12,[63]13. Nevertheless, protein–disease
   associations do not need to be causal to be usefully predictive and,
   conversely, associations that are causal and may generate therapeutic
   targets are not necessarily useful in the prediction of incident
   events. Whether population-scale, agnostic analyses of the circulating
   proteome can provide insights into both aspects of clinical care, that
   is, improve the prediction of first clinical events and reveal causal
   mediators for a range of cardiac disease subtypes, remains unclear.

   Here, we performed a proteomic analysis of cardiac diseases in 44,313
   unrelated individuals from the UK Biobank (UKB) Pharma Proteomics
   Project (PPP) (Table [64]1)^[65]14. The UKB-PPP is a population-based
   cohort with high-throughput proteomic profiling (using the Olink
   Explore 1536 platform) at study baseline and longitudinal follow-up for
   incident cardiac events (Fig. [66]1). In primary analyses, we used
   multivariable-adjusted time-to-event models to test the associations of
   1,459 circulating proteins with incident coronary artery disease, heart
   failure, atrial fibrillation and aortic stenosis. While these models
   can identify single protein–disease associations, they cannot be used
   to infer causal protein–disease associations and do not provide useful
   measures of predictive performance. Therefore, in a first set of
   downstream analyses, we leveraged Mendelian randomization (MR) to infer
   causal roles among the identified proteins to prioritize therapeutic
   targets. Next, in a separate set of downstream analyses, we constructed
   protein-based risk scores and evaluated whether these could improve
   disease prediction beyond the use of traditional risk factors. Other
   downstream analyses assessed sex differences in protein–disease
   associations and tested enrichment of certain biological pathways in
   proteins associated with different cardiac diseases.

Table 1.

   Baseline characteristics of UK Biobank Pharma Proteomics Project
   participants included in the present study (n = 44,313)
                                        UKB-PPP participants (n = 44,313)
   Age at blood draw, years             56.4 ± 8.2
   Female, n                            24,701 (55.7%)
   Race/ethnicity, n                    –
    Asian                               942 (2.1%)
    Black                               1,051 (2.4%)
    White                               41,481 (93.6%)
    Mixed                               300 (0.7%)
    Other                               539 (1.2%)
   Smoking status, n                    –
    Never                               24,709 (55.8%)
    Previous                            14,932 (33.7%)
    Current                             4,672 (10.5%)
   BMI, kg m^−²                         27.3 ± 4.7
   Blood pressure, mm Hg                –
    Systolic blood pressure             139.5 ± 19.6
    Diastolic blood pressure            82.3 ± 10.6
   Blood biochemistry, mg dl^−1         –
    Total cholesterol                   221.4 ± 43.8
    LDL cholesterol                     138.3 ± 33.4
    HDL cholesterol                     56.4 ± 14.7
    Triglycerides                       129.8 (92.0 to 188.1)
    Creatinine                          0.82 ± 0.19
   Type 2 diabetes mellitus, n          1,218 (2.7%)
   Medication use, n                    –
    Cholesterol-lowering medication use 6,544 (14.8%)
    Antihypertensive medication use     6,214 (14.0%)
   Townsend deprivation index           −2.08 (−3.63 to 0.70)
   [67]Open in a new tab

   Continuous variables are summarized as mean ± s.d. or median (IQR), as
   appropriate. Categorical variables are summarized as n (%).

Fig. 1. Visual representation of the study design and participant inclusion
and exclusion criteria.

   [68]Fig. 1
   [69]Open in a new tab

   The present study tested the associations of circulating proteins with
   common cardiac diseases (coronary artery disease, heart failure, atrial
   fibrillation and aortic stenosis) in the UKB-PPP. Primary analyses
   tested the epidemiological associations of 1,459 circulating proteins
   with cardiac diseases in 44,313 UKB-PPP participants without these
   diseases at baseline. Secondary analyses performed cis-MR analyses,
   tested for sex-specific effects and trained and tested protein-based
   risk scores. a, Study design. b, Participant inclusion and exclusion
   criteria.

Results

Associations of circulating proteins with heart diseases

   Of 44,313 UKB-PPP participants, 4,610 (10.4%) experienced at least one
   cardiac event over a median (interquartile range (IQR)) follow-up of
   11.1 (10.4–11.8) years. Coronary artery disease had the highest
   cumulative incidence (6.2% (n = 2,729 of 44,313)), followed by atrial
   fibrillation (4.8% (n = 2,107 of 44,313)), heart failure (2.3%
   (n = 1,014 of 44,313)) and aortic stenosis (0.7% (n = 326 of 44,313))
   (Extended Data Fig. [70]1).

Extended Data Fig. 1. Cumulative incidence of coronary artery disease, heart
failure, atrial fibrillation, and aotic stenosis during follow-up.

   [71]Extended Data Fig. 1
   [72]Open in a new tab

   Cumulative incidence plots were constructed using the Kaplan–Meier
   method. Participants were followed for a median (interquartile range)
   follow-up of 11.1 (10.4–11.8) years.

   [73]Source data

   Primary analyses tested the associations of 1,459 circulating proteins
   (Supplementary Tables [74]1 and [75]2) with each of the incident heart
   diseases (coronary artery disease, heart failure, atrial fibrillation
   and aortic stenosis) using multivariable-adjusted Cox regression
   models. The correlation matrix of these circulating proteins is
   provided in Extended Data Fig. [76]2 and Supplementary Table [77]3.
   Primary analyses identified 820 protein–disease associations—reflecting
   441 unique proteins—at Bonferroni-corrected P < 0.05/5,836
   (P < 0.05/(1,459 tested proteins × 4 tested outcomes)) (Fig. [78]2 and
   Supplementary Table [79]4). Heart failure had the highest number of
   proteomic associations (n = 384), followed by coronary artery disease
   (n = 259), atrial fibrillation (n = 156) and aortic stenosis (n = 21).
   Among proteins with one or more significant associations, 261 (59.2%)
   were shared across multiple outcomes and 15 (3.4%) were shared across
   all four outcomes (Extended Data Fig. [80]3).

Extended Data Fig. 2. Correlations among circulating proteins measured at
baseline.

   [81]Extended Data Fig. 2
   [82]Open in a new tab

   All colored boxes represent Pearson correlation coefficients (r)
   indicating the correlations between the proteins that were measured in
   the final study cohort (N = 44,313). Red boxes indicate positive
   correlations between proteins (r > 0), whereas blue boxes indicate
   negative correlations between proteins (r < 0). Pearson correlation
   coefficients are provided in Supplementary Table [83]3. Each row and
   each column each represent one circulating protein. Proteins were
   clustered using a hierarchical cluster analysis based on the “complete
   linkage method”. Hierarchical clustering was performed using the
   hclust() function in R. The heat plot was constructed using the
   pheatmap() function (pheatmap package^[84]21 in R).

   [85]Source data

Fig. 2. Associations of circulating protein levels with incident coronary
artery disease, heart failure, atrial fibrillation and aortic stenosis.

   [86]Fig. 2
   [87]Open in a new tab

   Miami plots visualize the associations of all 1,459 Olink proteins with
   coronary artery disease, heart failure, atrial fibrillation and aortic
   stenosis, tested using multivariable-adjusted Cox proportional hazards
   models ([88]Methods). The y axis indicates the −log[10](P) value for
   each association, multiplied by 1 if the association was positive
   (β > 0) or −1 if the association was negative (β < 0). The x axis
   indicates the genetic position of each protein’s encoding gene.
   Protein–disease associations with Bonferroni-corrected two-sided
   P < 0.05 (P < 0.05/5,836 or ~8.6 × 10^−6) are shown in blue (if the
   protein was associated with more than one outcome) or green (if the
   protein was not associated with more than one outcome). The probability
   density functions show the distributions of the strongest
   protein–disease associations in cases (dark blue) versus controls
   (light blue) for each outcome. These analyses included 44,313 UKB-PPP
   participants, among whom 2,729 experienced coronary artery disease,
   2,107 heart failure, 1,014 atrial fibrillation and 326 aortic stenosis
   events during follow-up.

   [89]Source data

Extended Data Fig. 3. Venn diagram showing the number of distinct and shared
protein associations across outcomes.

   [90]Extended Data Fig. 3
   [91]Open in a new tab

   All 441 proteins that were associated with one or more outcomes at
   Bonferroni-corrected P < 0.05 are represented in this graph.

   [92]Source data

   The strongest protein–disease associations (by P value) were observed
   for atrial fibrillation, with N-terminal pro-B-type natriuretic peptide
   (NT-proBNP) and B-type natriuretic peptide (NPPB, also known as BNP)
   yielding hazard ratios (HRs) of 1.74 (95% confidence interval (CI)
   1.67–1.81; P = 8.7 × 10^−^173) and 1.62 (95% CI 1.54–1.69;
   P = 5.6 × 10^−95), respectively, for each s.d. increase in circulating
   protein levels (which were log[2]-transformed before analysis).
   NT-proBNP was also the second strongest for association with heart
   failure, with an HR of 1.57 (95% CI 1.48–1.66; P = 5.4 × 10^−56) per
   s.d. The biomarker most strongly associated with heart failure was WAP
   four-disulfide core domain protein 2 (WFDC2), a fibroblast-derived
   mediator of fibrosis also known as human epididymis protein 4
   (HE4)^[93]15, with an HR of 1.62 (95% CI 1.54–1.72; P = 4.1 × 10^−65)
   per s.d. The proteins most strongly associated with incident coronary
   artery disease were growth differentiation factor 15 (GDF15; HR 1.31
   (95% CI 1.26–1.36) per s.d.; P = 2.0 × 10^−45) and matrix
   metalloproteinase-12 (MMP12; HR 1.29 (95% CI 1.24–1.34) per s.d.;
   P = 1.1 × 10^−39); those most strongly associated with aortic stenosis
   were GDF15 (HR 1.44 (95% CI 1.29–1.60) per s.d.; P = 2.7 × 10^−11) and
   WFDC2 (HR 1.40 (95% CI 1.26–1.55) per s.d.; P = 7.3 × 10^−10). The
   distributions of the proteins with the strongest associations,
   stratified by incident cases versus controls, are shown in Fig. [94]1.

   To gain insights into biological pathways associated with the
   identified proteins, we carried the 820 observed protein–disease
   associations forward for pathway enrichment analysis using the Gene
   Ontology resource^[95]16 via Enrichr^[96]17. The highest-scoring
   pathways for coronary artery disease and heart failure included
   inflammatory and immune-related processes involving
   leukocyte/lymphocyte chemotaxis and cellular response to cytokines
   (Extended Data Fig. [97]4 and Supplementary Table [98]5). Participants
   with coronary artery disease or heart failure during follow-up were
   also enriched for apoptosis-related proteins such as those from the
   tumor necrosis factor (TNF) receptor family. Proteins associated with
   aortic stenosis demonstrated enrichment for peptidase inhibitor
   activity, consistent with recent work suggesting an important role for
   certain peptidases in the progression of calcific aortic
   stenosis^[99]18. Furthermore, according to data from the Human Protein
   Atlas^[100]19, proteins associated with at least one cardiac outcome
   were more often actively secreted to the bloodstream (34.0% (n = 150 of
   441 proteins)) than those without any significant associations (16.2%
   (n = 165 of 1,018); chi-squared test, P = 5.4 × 10^−14).

Extended Data Fig. 4. Top biological processes, molecular functions, and
cellular components enriched among proteins associated with coronary artery
disease, heart failure, atrial fibrillation, and aortic stenosis.

   [101]Extended Data Fig. 4
   [102]Open in a new tab

   Top biological functions, molecular pathways, and cellular components
   were queried using the Gene Ontology resource^[103]22,[104]23 via
   Enrichr^[105]24. Enrichment tests were performed against a background
   gene set that included the genes corresponding to all 1,459 proteins
   tested in primary analyses. Gene sets with a false discovery
   rate-adjusted two-sided P < 0.05 were considered statistically
   significant. Bright colors indicate statistical significance, whereas
   dull colors indicate no statistical significance. All P-values shown
   were unadjusted for multiple comparisons.

   [106]Source data

Cis-Mendelian randomization analyses

   Next, we performed MR analyses to infer causal effects of the
   identified proteins on coronary artery disease, heart failure, atrial
   fibrillation and aortic stenosis. As the use of cis-protein
   quantitative trait loci (cis-pQTLs; genetic variants associated with
   circulating protein levels that map near the protein-encoding gene)
   facilitates adherence to the assumptions of MR^[107]20,[108]21, we only
   used variants within a 200-kilobase range of the protein-encoding gene
   to construct our genetic instruments. Of 441 unique
   Bonferroni-significant proteins in primary analyses, 430 (97.5%;
   corresponding to 802 protein–disease associations) had at least one
   valid cis-pQTL (±200 kilobases, P < 5 × 10^−6, R^2 < 0.1)
   (Supplementary Table [109]6). F-statistics were >10 for all proteins
   other than myoglobin, which was excluded from cis-MR analyses to
   minimize the risk of weak instrument bias. Median (IQR) F-statistics
   and R^2 estimates (representing phenotypic variance explained by
   genetic instruments) were 1,515 (454–4,050) and 3.2% (1.0–7.9%),
   respectively (Supplementary Table [110]7). Consistent with the use of
   cis-pQTLs^[111]21, Steiger filtering did not identify any variants
   explaining more variance in the outcome than the exposure
   (Supplementary Table [112]8).

   Of 801 protein–disease associations examined in cis-MR analyses, 76
   (9.5%; representing 69 of 429 (16.1%) proteins) showed suggestive
   evidence of causality with P < 0.05 (Fig. [113]3 and Supplementary
   Table [114]9). Because it is routinely recommended to evaluate cis-MR
   findings across P value and R^2 thresholds^[115]13, we performed
   multiple sensitivity analyses (using genetic variants at
   P < 5 × 10^−4/< 5 × 10^−6/< 5 × 10^−8 and
   R^2 < 0.001/< 0.01/< 0.1/< 0.2) to evaluate the robustness of the
   observed genetic associations (Supplementary Table [116]10). We further
   performed MR-Egger (Supplementary Table [117]10), one-sample MR
   (Supplementary Table [118]11) and multivariable-adjusted MR adjusting
   for proteins with shared pQTLs (Supplementary Tables [119]12
   and[120]13). A total of 40 of 76 (52.6%) genetic protein–disease
   associations were robust across all sensitivity analyses (directionally
   consistent across all MR models without evidence of horizontal
   pleiotropy). Genetic and observational analyses showed directional
   consistency for 17 of 40 (42.5%) robust genetic associations,
   corresponding to 2.1% of all protein–disease pairs and 4.0% of unique
   proteins tested in cis-MR analyses. These protein–disease associations
   all had positive effect estimates, implying that increased protein
   concentrations may promote cardiac disease risk and lowering would
   reduce risk. Furthermore, proteins were considered to be druggable for
   14 of 17 (82.4%) robustly and directionally consistent protein–disease
   associations (Supplementary Table [121]14)^[122]22.

Fig. 3. Associations of genetically predicted protein levels with coronary
artery disease, heart failure, atrial fibrillation and aortic stenosis.

   [123]Fig. 3
   [124]Open in a new tab

   The volcano plots visualize the genetic associations of all proteins
   identified in primary analyses with their corresponding outcomes, by
   plotting each association’s −log[10](P) against the corresponding
   log(OR) per s.d. increase in genetically predicted protein levels. All
   analyses represent cis-MR analyses using the IVW (for instruments with
   two or more variants) or Wald ratio method (for instruments with one
   variant). Genetic instruments were constructed using cis-variants
   associated with circulating protein levels at P < 1 × 10^−4 clumped at
   R^2 < 0.1. Associations with two-sided P < 0.05 (not corrected for
   multiple comparisons) are shown in yellow (if the primary cis-MR
   analysis was directionally consistent with the observational analysis)
   or red (if the primary cis-MR analysis was not directionally consistent
   with the observational analysis). Bright colors and protein labels
   indicate robustness against sensitivity analyses ([125]Methods),
   whereas dull colors indicate no robustness against sensitivity
   analyses. OR, odds ratio.

   [126]Source data

   Because proprotein convertase subtilisin/kexin type 9 (PCSK9) is an
   established causal biomarker and therapeutic target for coronary artery
   disease^[127]23, we used this protein as a positive control for our
   cis-MR analyses (despite PCSK9 not reaching Bonferroni significance for
   coronary artery disease in epidemiological analyses). Each s.d.
   increase in genetically predicted PCSK9 was associated with 1.23-fold
   odds of coronary artery disease (95% CI 1.17–1.30; P = 6.0 × 10^−16),
   supporting a cis-MR strategy for the identification of potential causal
   protein–disease associations. The proteins with the strongest
   epidemiological associations did not generally show strong genetic
   associations with cardiac diseases. For instance, neither genetic
   associations for WFDC2 nor GDF15 reached nominal significance
   (Supplementary Table [128]9). Genetically predicted MMP12, which was
   among the most strongly associated proteins for incident coronary
   artery disease in epidemiological analyses, was only modestly
   associated with a protective effect on coronary artery disease risk in
   primary cis-MR analyses (OR 0.97 (95% CI 0.95–1.00) per s.d.;
   P = 0.046).

   The strongest robust genetic associations were observed for spondin-1
   (SPON1) and adrenomedullin (ADM) with atrial fibrillation. Consistent
   with their observational associations, higher genetically predicted
   levels of SPON1 (OR 1.11 (95% CI 1.05–1.17) per s.d.; P = 2.9 × 10^−4)
   and ADM (OR 1.23 (95% CI 1.11–1.35) per s.d.; P = 5.4 × 10^−^5) were
   associated with a greater risk of atrial fibrillation. Notably,
   colocalization analyses suggested shared causal genetic variants
   between these two proteins and atrial fibrillation (posterior
   probability for shared causal variants [H[4]] > 0.80; Supplementary
   Table [129]15). The Kunitz-type protease inhibitor 1 (SPINT1; also
   known as hepatocyte growth factor activator inhibitor type 1) and
   asialoglycoprotein receptor 1 (ASGR1) had the strongest directionally
   concordant and robust genetic associations for coronary artery disease
   (OR 1.09 (95% CI 1.03–1.23) per s.d.; P = 7.9 × 10^−3) and heart
   failure (OR, 1.13 (95% CI 1.03–1.49) per s.d.; P = 2.4 × 10^−2),
   respectively. For aortic stenosis, the latent-transforming growth
   factor β-binding protein 2 (LTBP2) was the only protein with a robust
   and directionally consistent genetic association (OR 1.24 (95% CI
   1.03–1.49) per s.d.; P = 2.4 × 10^−2).

Sex-specific protein–disease associations

   Because previous work suggested sex differences in the concentrations
   of cardiovascular biomarkers^[130]24, we hypothesized (a priori) that
   certain proteins were differentially associated with cardiac disease
   risk in men versus women. Therefore, we tested the
   multivariable-adjusted associations of all 1,459 proteins with cardiac
   diseases in men (n = 19,612) versus women (n = 24,701). A total of 467
   protein–disease associations met the primary significance threshold
   (P < 0.05/5,836) for men versus 314 for women (Supplementary Table
   [131]16). Protein–disease associations (for all 1,459 tested
   biomarkers) showed strong correlation between sexes, indicated by a
   Pearson correlation coefficient (r) of 0.71 (Extended Data Fig.
   [132]5). The correlation between sexes was strongest for heart failure
   (r = 0.79), whereas it was comparatively weaker for aortic stenosis
   (r = 0.47).

Extended Data Fig. 5. Correlation between the effect sizes of protein–disease
associations in male vs. female participants.

   [133]Extended Data Fig. 5
   [134]Open in a new tab

   The scatter plots depict the correlation between the protein–disease
   associations’ effect sizes (that is, log[HR]) in female vs. male
   participants. HR indicates hazard ratio. All estimates were calculated
   using multivariable-adjusted Cox proportional hazards models, adjusted
   for age, age², self-reported race/ethnicity, the first ten principal
   components of genetic ancestry, smoking, normalized Townsend
   deprivation index, body mass index, systolic blood pressure,
   antihypertensive medication use, total cholesterol, high-density
   lipoprotein cholesterol, cholesterol-lowering medication use, serum
   creatinine, and prevalent type 2 diabetes. In addition, we included the
   cardiac outcomes that were not tested (for example, heart failure,
   atrial fibrillation, and aortic stenosis for incident coronary artery
   disease models) as time-varying covariates. The labeled protein–disease
   represent proteins that were associated with the indicated outcome at
   two-sided P < 0.05/5,836 (that is, Bonferroni-adjusted) in one sex
   without nominal significance (two-sided unadjusted P > 0.05) in the
   other sex. In addition, all proteins indicated in color had suggestive
   evidence for interaction by sex (two-sided unadjusted
   P[interaction] < 0.05). HR indicates hazard ratio.

   [135]Source data

   We formally tested for sex interactions across all protein–disease
   associations reaching significance (P < 0.05/5,836) in at least one sex
   (n = 566) (Fig. [136]4 and Supplementary Table [137]16). Six
   protein–disease associations had a Bonferroni-significant
   (P < 0.05/566) sex-differential effect for atrial fibrillation,
   including T cell surface glycoprotein CD1c (CD1C;
   P[interaction] = 6.9 × 10^−^5), cyclic ADP-ribose hydrolase (CD38;
   P[interaction] = 3.3 × 10^−6), cathepsin L2 (CTSV;
   P[interaction] = 7.3 × 10^−5), NT-proBNP
   (P[interaction] = 3.7 × 10^−5), paired immunoglobulin-like type 2
   receptor β (PILRB; P[interaction] = 4.1 × 10^−5) and WFDC2
   (P[interaction] = 7.7 × 10^−5). We also observed a sex-differential
   effect for chymotrypsin C (CTRC; P[interaction] = 1.9 × 10^−5) on
   coronary artery disease. To test whether these associations differed
   between premenopausal and postmenopausal women, we performed
   association analyses in female participants stratified by menopause
   status and tested the interaction between these proteins and menopause
   status on cardiac outcome risk (Supplementary Fig. [138]1). These
   analyses revealed that the association of CD38 with atrial fibrillation
   was stronger in premenopausal than in postmenopausal women
   (P[interaction] = 4.4 × 10^−2), although CD38 was positively associated
   with atrial fibrillation risk in both groups. There were no other
   significant interactions between circulating proteins and menopause
   status on cardiac disease risk, suggesting that the identified
   sex-differential effects are not strongly affected by menopause status.

Fig. 4. Sex-specific protein–disease associations and protein-by-sex
interactions for coronary artery disease, heart failure, atrial fibrillation
and aortic stenosis.

   [139]Fig. 4
   [140]Open in a new tab

   Lollipop plots depict the differences in effect sizes between male and
   female participants (log(HR)[males] − log(HR)[females]) for all tested
   protein–disease associations. Bright colors with labels represent
   proteins with two-sided P < 0.05/5,836 (Bonferroni-corrected) in one
   sex without nominal significance (two-sided P > 0.05) in the other sex;
   dull colors represent proteins with P < 0.05/5,836 in one sex and at
   least nominal significance (two-sided P < 0.05) in the other sex. In
   addition, all proteins indicated in color had suggestive evidence for
   interaction by sex (two-sided P[interaction] < 0.05). Forest plots
   depict the sex-stratified protein–disease associations (purple for men,
   pink for women) for the five proteins with the strongest sex–protein
   interactions. In these forest plots, central points indicate the HR of
   the indicated protein (per s.d.) with the indicated outcome stratified
   by sex (with corresponding 95% CIs). P[interaction] indicates the P
   value for the interaction term between ‘sex’ and the indicated protein
   on the corresponding outcome. All associations were tested using
   multivariable-adjusted Cox proportional hazards models ([141]Methods)
   in 19,612 male and 24,701 female participants.

   [142]Source data

Protein-based prediction of cardiac diseases

   We next derived and tested the predictive accuracy of protein-based
   risk scores in addition to clinical risk factors in the UKB-PPP. We
   constructed protein-based, clinical, and combined (using proteomic and
   clinical variables) risk scores in the training set (80%; n = 35,450)
   using least absolute shrinkage and selection operator (LASSO)
   regression with tenfold cross-validation. Protein-based risk scores
   (using all 1,459 tested proteins as input) included 64 proteins for
   coronary artery disease, 38 for heart failure, 92 for atrial
   fibrillation and 21 for aortic stenosis (Supplementary Table [143]17
   and Extended Data Fig. [144]6). The prediction models’ highest-weighted
   biomarkers were largely overlapping with those showing the strongest
   associations in primary analyses.

Extended Data Fig. 6. Protein weights for the primary protein-based
prediction models of coronary artery disease, heart failure, atrial
fibrillation, and aortic stenosis.

   [145]Extended Data Fig. 6
   [146]Open in a new tab

   Each bar indicates the protein weights (that is, the absolute value of
   the corresponding regression coefficients).

   [147]Source data

   Analyses in the testing cohort (20%; n = 8,863) revealed that
   protein-based risk scores effectively stratified the risk of incident
   events across outcomes (Fig. [148]5a–c). The protein-based risk scores
   were strong independent predictors of incident events in
   multivariable-adjusted Cox regression models, with HRs of 2.19 (95% CI
   1.87–2.55; P = 3.1 × 10^−^23) per s.d. increase for coronary artery
   disease, 2.49 (95% CI 2.10–2.95; P = 1.4 × 10^−25) for heart failure,
   2.39 (95% CI 2.13–2.69; P = 7.5 × 10^−48) for atrial fibrillation and
   2.70 (95% CI 1.65–4.42; P = 7.5 × 10^−5) for aortic stenosis. The top
   versus bottom quintile of protein-based risk scores was associated with
   HRs of 8.15 (95% CI 4.07–16.30; P = 3.04 × 10^−9) for coronary artery
   disease, 12.85 (95% CI 3.90–42.31; P = 2.67 × 10^−5) for heart failure,
   6.85 (95% CI, 3.40–13.80; P = 7.52 × 10^−8) for atrial fibrillation and
   2.70 (95% CI 0.45–16.13; P = 0.28) for aortic stenosis.

Fig. 5. Risk stratification and prediction of incident coronary artery
disease, heart failure, atrial fibrillation and aortic stenosis by
protein-based risk scores in the UKB-PPP.

   [149]Fig. 5
   [150]Open in a new tab

   a, Distributions of protein-based risk scores in cases and controls. b,
   Cumulative incidence of each outcome (calculated using the Kaplan–Meier
   method) by protein-based score quintiles. c, Incidence rate estimates
   according to protein-based score deciles on a logarithmically scaled y
   axis. d, Accuracies of the clinical, proteomic and combined risk scores
   in predicting the indicated outcomes (quantified using the ROC AUC)
   with corresponding 95% CI. For a, the vertical lines indicate the
   protein-based risk score values corresponding to an FPR of 5.0%; the
   DRs indicate ‘exact’ detection rates, calculated as the unadjusted
   proportions of cases with a positive test result at the corresponding
   protein-based risk score threshold. For b, incidence rate estimates are
   not displayed if the incidence in a protein score percentile bin was
   zero. All analyses were performed in the UKB-PPP testing set
   (n = 8,863). During a median (IQR) follow-up of 11.1 (10.4–11.8) years,
   566 participants in the UKB-PPP testing set experienced coronary artery
   disease events, 203 experienced heart failure, 432 atrial fibrillation
   and 59 aortic stenosis.

   [151]Source data

   Distributions of protein-based risk scores in individuals who did and
   did not experience an event during follow-up are shown in Fig. [152]5a.
   ROC curve analyses revealed that adding proteomic data improved the
   prediction of incident coronary artery disease, heart failure and
   atrial fibrillation (Fig. [153]5d). The increment in predictive
   accuracy compared to the clinical model—quantified using the area under
   the ROC curve (AUC)—was most pronounced for atrial fibrillation (AUC
   0.801 (95% CI 0.779–0.822) versus 0.749 (95% CI 0.727–0.772); DeLong
   test: P = 2.0 × 10^−10) and heart failure (AUC 0.799 (95% CI
   0.769–0.830) versus 0.732 (95% CI 0.698–0.766); P = 1.7 × 10^−6),
   followed by coronary artery disease (AUC 0.757 (95% CI 0.738–0.776)
   versus 0.734 (95% CI 0.714–0.754); P = 1.4 × 10^−4). There was no
   statistically significant difference for aortic stenosis (AUC 0.803
   (95% CI 0.754–0.852) versus 0.789 (95% CI 0.738–0.840); P = 0.35). For
   a false positive rate of 5% (where a test score incorrectly classifies
   5% of controls as positive), the protein-based risk scores for coronary
   artery disease, heart failure, atrial fibrillation and aortic stenosis
   yielded exact detection rates (the proportions of cases that were
   correctly classified as positive; also known as true positive rates or
   sensitivities) of 22.6%, 33.5%, 33.3% and 15.3%, respectively (Fig.
   [154]5a). The corresponding exact detection rates of the combined risk
   scores for a false positive rate of 5% were 21.0%, 35.0%, 35.9% and
   18.6%. Using a more stringent false positive rate cutoff of 1%, the
   corresponding true positive rates were 8.1%, 9.4%, 10.6% and 5.1%; the
   probabilities of experiencing an event during follow-up (given a
   positive test result) for these were 34.8%, 34.7%, 18.0% and 3.6%,
   respectively (Supplementary Table [155]18).

   To evaluate the performance of the protein-based scores for coronary
   artery disease, heart failure and atrial fibrillation in an external
   cohort, we tested the accuracies of these scores in the Women’s Health
   Initiative (WHI). A total of 1,083 WHI participants who provided blood
   samples at the WHI-Long Life Study (LLS) visit, with data on 552
   circulating protein analytes (measured using six Olink Target 96
   assays), were included (Extended Data Fig. [156]7 and Supplementary
   Table [157]19). Among the 552 available protein analytes, there were
   518 unique proteins that were also measured the UKB-PPP (Supplementary
   Table [158]20) and were used to retrain the proteomic models in the
   UKB-PPP training set (Supplementary Table [159]21). ROC curve analyses
   in the UKB-PPP testing set demonstrated that the retrained proteomic
   scores (based on the proteins that were overlapping between the UKB-PPP
   and the WHI-LLS) improved the prediction of incident events with
   increments that were similar to those observed using the scores that
   were constructed using the full protein set (based on all 1,459
   circulating proteins measured in the UKB-PPP) (Supplementary Fig.
   [160]2). Similarly, analyses in the WHI-LLS showed that the combined
   models (based on both clinical and proteomic variables) were associated
   with a significantly better detection of coronary artery disease (AUC
   0.664 (95% CI 0.612–0.716) versus 0.599 (95% CI 0.543–0.656);
   P = 1.4 × 10^−2), heart failure (AUC 0.720 (95% CI 0.683–0.777) versus
   0.636 (95% CI 0.583–0.689); P = 6.6 × 10^−4) and atrial fibrillation
   (AUC 0.673 (95% CI 0.631–0.714) versus 0.589 (95% CI 0.546–0.632);
   P = 6.7 × 10^−7) compared to the clinical models (Extended Data Fig.
   [161]8).

Extended Data Fig. 7. WHI-LLS participant inclusion and exclusion criteria
for external validation analyses.

   [162]Extended Data Fig. 7
   [163]Open in a new tab

   External validation analyses tested the performance of protein-based
   risk scores to predict incident coronary artery disease, heart failure,
   and atrial fibrillation in 1,083 participants from the Women’s Health
   Initiative who attended the Long Life Study (WHI-LLS).

Extended Data Fig. 8. Risk (A–C) stratification and (D) prediction of
incident coronary artery disease, heart failure, atrial fibrillation, and
aortic stenosis by protein-based risk scores in the WHI-LLS.

   [164]Extended Data Fig. 8
   [165]Open in a new tab

   The indicated plots depict (A) the distributions of protein-based risk
   scores in cases and controls; (B) the cumulative incidence of each
   outcome (calculated using the Kaplan–Meier method) by protein-based
   score quintiles; (C) incidence rate estimates according to
   protein-based score deciles on a logarithmically scaled Y axis; and (D)
   the accuracies of the clinical, proteomic, and combined risk scores in
   predicting the indicated outcomes (quantified using the area under the
   receiver-operating characteristic curve [AUC] with corresponding 95%
   confidence intervals [CIs]). For (A), the vertical lines indicate the
   protein-based risk score values corresponding to a false positive rate
   (FPR) of 5.0%; the detection rates (DRs) indicate the “exact” detection
   rates, calculated as the unadjusted proportions of cases with a
   positive test result at the corresponding protein-based risk score
   threshold. For (C), incidence rate estimates are not displayed if the
   incidence of the indicated outcome in a protein score percentile bin
   was zero. All analyses were performed in the Women’s Health Initiative
   Long Life Study (WHI-LLS; n = 1,083). During a median (interquartile
   range) follow-up of 8.3 (5.6-8.9) years, 85 participants in the WHI-LLS
   cohort experienced coronary artery disease events, 100 experienced
   heart failure, and 182 atrial fibrillation.

   [166]Source data

   Given the disproportionately high weights for NT-proBNP in the
   protein-based risk scores for atrial fibrillation and heart failure
   (Supplementary Table [167]17), we further evaluated the performance of
   models including NT-proBNP alone versus those incorporating all other
   biomarkers in predicting these outcomes in the UKB-PPP testing set. We
   also excluded NPPB from the latter set of protein-based risk scores as
   NPPB and NT-proBNP are encoded by the same gene and released in the
   circulation in equimolar quantities^[168]25. Compared to the score
   based on clinical factors alone (0.749 (95% CI 0.727–0.772)), inclusion
   of NT-proBNP significantly improved the prediction of atrial
   fibrillation (AUC 0.788 (95% CI 0.766–0.811); P = 1.1× 10^−6),
   resulting in a greater increment in predictive accuracy than the score
   incorporating all proteins other than NT-proBNP and NPPB (AUC 0.777
   (95% CI 0.756–0.799); P = 9.2 × 10^−7) (Extended Data Fig. [169]9). In
   contrast, for heart failure, the score incorporating all proteins
   except NT-proBNP and NPPB was associated with a greater improvement in
   predictive accuracy versus the clinical score (AUC 0.786 (95% CI,
   0.754–0.818) versus 0.732 (95% CI 0.698–0.766); P = 1.4 × 10^−5) than
   the score incorporating NT-proBNP alone (AUC 0.756 (95% CI
   0.722–0.790); P = 0.07) (Extended Data Fig. [170]10).

Extended Data Fig. 9. Risk prediction of incident atrial fibrillation by risk
scores incorporating NT-proBNP and all proteins except NT-proBNP and NPPB.

   [171]Extended Data Fig. 9
   [172]Open in a new tab

   The receiver-operating characteristics curves depict the accuracy of
   the clinical, proteomic, and combined risk scores in predicting atrial
   fibrillation events in the UKB-PPP testing set (n = 8,863). Areas under
   the curve (AUCs) and corresponding 95% confidence intervals (95% CIs)
   quantify the performance of each model. Models with multiple candidate
   features were constructed using logistic least absolute shrinkage and
   selection operator (LASSO) models; the combined models included all
   clinical predictors (see Methods) as well as the indicated biomarkers
   (that is, NT-proBNP or all proteins except NT-proBNP and NPPB) as
   potential covariates in the final model. Participants were followed for
   a median (interquartile range) follow-up of 11.1 (10.4–11.8) years.

   [173]Source data

Extended Data Fig. 10. Risk prediction of incident heart failure by risk
scores incorporating NT-proBNP and all proteins except NT-proBNP and NPPB.

   [174]Extended Data Fig. 10
   [175]Open in a new tab

   The receiver-operating characteristics curves depict the accuracy of
   the clinical, proteomic, and combined risk scores in predicting heart
   failure events in the UKB-PPP testing set (n = 8,863). Areas under the
   curve (AUCs) and corresponding 95% confidence intervals (95% CIs)
   quantify the performance of each model. Models with multiple candidate
   features were constructed were constructed using logistic least
   absolute shrinkage and selection operator (LASSO) models; the combined
   models included all clinical predictors (see Methods) as well as the
   indicated biomarkers (that is, NT-proBNP or all proteins except
   NT-proBNP and NPPB) as potential covariates in the final model.
   Participants were followed for a median (interquartile range) follow-up
   of 11.1 (10.4–11.8) years.

   [176]Source data

Discussion

   In a population-based cohort of ~45,000 middle-aged adults with
   circulating protein measurements and longitudinal follow-up, we
   characterized the proteomic architecture of incident coronary artery
   disease, heart failure, atrial fibrillation and aortic stenosis. We
   identified 820 significant protein–disease associations with important
   roles (potentially mediating or marking disease presence) for
   natriuretic peptides (for example, NT-proBNP), inflammatory mediators
   (for example, MMP12) and apoptosis-related factors (for example, GDF15)
   as predictors of cardiac diseases. Genetic analyses suggested causal or
   mediating roles—either protective or deleterious—for a substantial
   proportion of biomarkers identified in observational analyses.
   Sex-based analyses suggested generally preserved associations between
   men and women, albeit with varying weights of prediction including
   several biomarkers with strong sex interactions. Finally, we
   constructed sparse protein-based risk scores that improved the
   prediction of cardiac disease development in the general population.
   Our findings provide insights into the biology of cardiac diseases,
   with implications for the prediction of incident cardiovascular
   diagnoses and potential for targeted prevention and treatment of these
   conditions.

   The findings from this study offer insights into potential causal roles
   of proteins associated with incident cardiac diseases. We found that 4%
   of proteins identified in primary analyses (and tested in cis-MR
   analyses) had putative causal associations that were directionally
   concordant with those derived from epidemiological models (primary
   analyses); however, we also identified many proteins (more than 5%)
   with genetic associations that were robust across sensitivity analyses
   yet directionally discordant with epidemiological estimates
   (‘opposite’). By systematically integrating observational and genetic
   data, our study corroborates and extends previous studies reporting
   similar discrepancies between genetic and epidemiological associations
   for selected proteins^[177]13,[178]26. For instance, consistent with
   previous research^[179]13,[180]26, primary cis-MR analyses revealed a
   protective effect of genetically predicted MMP12 on coronary artery
   disease, although observational analyses indicated strong associations
   of higher MMP12 levels with the same outcome. Whether these seemingly
   discordant observations reflect inherent differences between disease
   onset versus progression or compensatory response to subclinical
   disease (where protein levels increase before the onset of symptoms,
   potentially acting as adaptive or compensatory mechanisms to mitigate
   damage caused by the underlying disease) requires further
   investigation. Nevertheless, several proteins had consistent genetic
   and observational effects. One example of concordant observations was
   SPON1, for which higher levels (both measured and genetically
   predicted) were associated with increased atrial fibrillation risk.
   SPON1 is an extracellular protein expressed in tissues such as the
   heart and brain that has been implicated in Alzheimer’s dementia
   through its role in amyloid-β precursor protein processing^[181]27.
   Previous protein-focused analyses in patients with heart failure showed
   that the presence of atrial fibrillation was associated with activation
   of amyloid-β-related pathways, with SPON1 as one of the most strongly
   upregulated proteins in those with atrial fibrillation^[182]28. These
   data, together with colocalization findings indicating shared causal
   variants for SPON1 and atrial fibrillation, collectively suggest that
   SPON1 not only marks presence of a pro-arrhythmic substrate, but could
   also represent an upstream therapeutic target for preventing and/or
   treating atrial fibrillation. Given the paucity of identified robust
   biomarkers mediating atrial fibrillation risk, more data are needed
   regarding SPON1 and its role in arrhythmogenesis.

   Biomarkers identified in proteomic analyses are often markers of
   already established disease, rather than mediators of disease biology.
   In this regard, our analyses identified several inflammation- and
   apoptosis-related proteins as strong predictors marking risk for
   cardiac disease but were unlikely causal biomarkers. WFDC2 (also known
   as HE4)—a profibrotic protease inhibitor with a potential role in
   natural immunity^[183]15—emerged as the strongest proteomic predictor
   of heart failure. Previous research in hospitalized heart failure
   patients demonstrated associations of circulating WFDC2 with disease
   severity as well as kidney function^[184]29. As WFDC2 is expressed
   exclusively in noncardiovascular tissues such as the respiratory tract,
   male and female genitourinary system and kidneys^[185]30, it is likely
   that the strong associations of WFDC2 with cardiac outcomes stem from
   peripheral organ responses rather than indicating direct cardiac
   dysfunction or vascular damage. Similarly, GDF15—another pleiotropic
   protein expressed across multiple organ systems^[186]31—was the
   strongest biomarker for coronary artery disease. As a member of the
   transforming growth factor-β (TGF-β) superfamily, GDF15 is upregulated
   in response to external stressors (for example, inflammation, hypoxia
   and oxidative stress) and is believed to reflect the cumulative impact
   of both acute and chronic exposure to cellular stressors^[187]31.
   Recent data suggest GDF15 as an independent prognostic biomarker for
   individuals with established atherosclerotic cardiovascular
   disease^[188]32. Nevertheless, cis-MR analyses detected no evidence of
   causality in the associations of these proteins with cardiac diseases.
   Assuming that these analyses had adequately strong genetic instruments
   and sufficient power, our findings collectively suggest that
   inflammation- and apoptosis-related biomarkers such as WFDC2 and GDF15
   represent early disease markers without causal involvement in the
   pathogenesis of cardiac diseases, consistent with their pleiotropic and
   nonspecific effects in response to tissue damage across organs.

   In addition, our findings demonstrate that circulating proteins can
   provide information beyond clinical risk factors to predict cardiac
   events. Risk scores integrating proteomic and clinical data led to
   better prediction of coronary artery disease, heart failure and atrial
   fibrillation than those based on clinical parameters alone, both
   internally in the UKB-PPP and externally in the WHI-LLS. Nevertheless,
   the clinical-proteomic scores yielded detection rates ranging 35–50%
   for these conditions, for a false positive rate of 10%. As
   cardiovascular prevention (for example, through statins) is offered at
   progressively lower cardiovascular risk thresholds, it is unlikely that
   proteomic scores will be an effective standalone test to screen for
   allocation of primary prevention therapies in people without known risk
   drivers^[189]33,[190]34. Nevertheless, our analyses demonstrate that
   protein-based risk scores confer information that is not captured by
   clinical risk factors and may also provide biological insights. For
   example, in addition to confirming the established association of
   natriuretic peptide elevation with so-called ‘pre-heart
   failure’^[191]35, ROC analyses revealed that NT-proBNP was a better
   predictor of atrial fibrillation than all other proteins together
   (except NPPB or BNP). These results extend previous work demonstrating
   strong associations of circulating NT-proBNP with incident atrial
   fibrillation^[192]36 and align with recent data from the LOOP trial,
   suggesting that individuals with elevated NT-proBNP levels may derive
   more clinical benefit from atrial fibrillation screening than those
   with lower levels^[193]37. Collectively, these findings provide support
   for the use of NT-proBNP as a biomarker for atrial fibrillation in the
   general population.

   Another finding from this study was evidence for biological sex
   differences underlying cardiac disease risk in men and women. The
   strongest sex interaction across all tested proteins was observed for
   CD38, which was significantly more strongly associated with incident
   atrial fibrillation in female than in male participants. CD38 is a
   glycoprotein expressed across various immune cells including
   lymphocytes and plasma cells^[194]38. Previous research suggests that
   CD38 is causally implicated in autoimmune diseases such as rheumatoid
   arthritis and systemic lupus erythematosus^[195]38,[196]39. As a
   history of autoimmune diseases represents a risk factor for atrial
   fibrillation that affects women more strongly than men^[197]40, it
   could be possible that the observed sex differences for CD38 reflect a
   more important role for immunity-related pathways in women.
   Furthermore, some of the largest differences in protein–disease
   associations between sexes were observed for aortic stenosis. For
   instance, we identified several sex-specific senescence-associated
   biomarkers (for example, IGFBP7 and TNF) associated with incident
   aortic stenosis in female, but not male, participants. Previous
   histological work in aortic stenosis patients revealed distinct tissue
   composition differences between men and women, with women showing less
   valvular calcification but more fibrosis than men^[198]41. These
   findings indicate that fibrosis-related proteins are likely stronger
   markers for aortic stenosis in women than in men.

   While this study benefits from a large sample size and the use of
   state-of-the-art proteomic profiling methods, findings must be
   interpreted in the context of limitations. First, the strength and
   quantity of protein–disease associations for each outcome were
   influenced by statistical power and, consequently, the number of cases
   per outcome. Conditions with lower incidence rates during follow-up
   (such as aortic stenosis) had fewer proteomic associations. Second, the
   study population was predominantly white, precluding generalization to
   other races/ethnicities. Third, causal inference using MR relies on the
   validity of the underlying instrumental variable assumptions. This
   study utilized a robust cis-MR framework (facilitating the adherence to
   these assumptions^[199]20,[200]21) and tested the robustness of the
   genetic associations through many sensitivity analyses. Nevertheless,
   prioritized therapeutic targets remain to be evaluated in animal
   experiments and eventually human trials. Fourth, not all proteins
   identified in primary analyses had strong cis-pQTLs, precluding
   adequate cis-MR analyses. Genetic instrument strength also varied
   across proteins. Instruments with more variants have greater power to
   detect statistically significant genetic protein–disease associations,
   potentially leading to an underestimation of associations for
   instruments with fewer variants. Additionally, there are no established
   power calculation methods for two-sample MR analyses with binary
   outcomes (such as those performed in this study) beyond the use of
   F-statistics to evaluate genetic instrument strength. Nevertheless, we
   minimized type II error by adopting a lenient P value threshold
   (P < 0.05) to indicate statistical significance for primary cis-MR
   analyses and prioritizing genetic protein–disease associations that
   were robust to many sensitivity analyses. Finally, external validation
   of protein-based risk scores was performed using a restricted set of
   proteins that only included those that were measured in both the
   UKB-PPP and WHI-LLS. Therefore, the external validation results do not
   fully reflect the predictive ability of protein-based risk scores
   constructed using the full set of proteins evaluated in primary
   analyses. This limitation partially reflects the rapid evolution of
   large-scale proteomics research, characterized by increasingly
   extensive but imperfectly overlapping proteomic assays across platforms
   used in different cohorts. Nevertheless, our external validation
   approach still found that a limited panel of proteins measured in both
   the UKB-PPP and WHI-LLS significantly improved the prediction of
   incident coronary artery disease, heart failure and atrial fibrillation
   in both cohorts.

   Leveraging a population-based cohort of ~45,000 participants, this
   study characterized the circulating proteome associated with incident
   coronary artery disease, heart failure, atrial fibrillation and aortic
   stenosis. The study findings support new applications for established
   biomarkers (for example, atrial fibrillation surveillance using
   NT-proBNP) and identify strong and potentially useful predictors of
   cardiac diseases (for example, WFDC2 for heart failure). These results
   lay a foundation for future investigations to uncover disease
   mechanisms and assess the clinical utility of protein-based prevention
   strategies for cardiac diseases.

Methods

Study design and participants

   The study design is illustrated in Fig. [201]1. The UKB is a
   population-based cohort of ~500,000 volunteers aged 40–69 years at the
   time of study enrollment, recruited from 22 assessment centers across
   the United Kingdom during 2006–2010^[202]42. At enrollment,
   participants provided informed consent; underwent physical examination;
   provided details on sociodemographic characteristics, lifestyle
   factors, medical history, and medication use; and donated blood
   samples. Follow-up for incident events occurred via linkage to
   electronic health records through March 2020.

   The UKB-PPP is a precompetitive consortium of 13 biopharmaceutical
   companies funding the generation of blood-based proteomic data in a
   subset of UKB participants^[203]14,[204]43. Upon release, the sponsors
   have no direct role in research activities of these features as is the
   case for the present work. The UKB-PPP includes 54,306 participants, of
   whom 46,673 (85.9%) were randomly selected from baseline, 6,385 (11.8%)
   were preselected by UKB-PPP consortium members based on certain
   characteristics of interest (for example, disease status or genetic
   ancestry) and 1,268 (2.3%) were selected because they attended multiple
   visits of the COVID-19 case–control imaging study^[205]14. We
   considered 52,705 participants with baseline proteomic data passing
   quality control for inclusion in the present study (Fig. [206]1).
   Participants were excluded if they had missing data for >10% of assay
   measurements or if they had missing data on self-reported
   race/ethnicity or genetic ancestry. We also excluded individuals
   inferred to be related (closer than third degree; kinship coefficient
   >0.0884) and those with self-reported or physician-ascertained coronary
   artery disease, heart failure, atrial fibrillation or aortic stenosis
   at baseline (see below for disease definitions).

   The UKB was approved by the North West Multi-center Research Ethics
   Committee. All analyses were conducted under UKB application no. 7089.
   The Mass General Brigham Institutional Review Board approved the
   secondary use of these data.

Protein measurements and proteomic data processing

   Blood samples donated by UKB-PPP study participants underwent proteomic
   profiling using the Olink Explore 1536 platform (Olink Proteomics),
   which measures 1,472 protein analytes across four different panels (the
   Cardiometabolic, Inflammation, Neurology and Oncology panels)
   representing 1,463 unique proteins (Supplementary Table
   [207]1)^[208]44. In brief, Olink uses proximity extension assay
   technology, whereby antibody pairs with conjugated oligonucleotides
   bind their target proteins in a pairwise manner. When an antibody pair
   has bound its target, complementary oligonucleotides undergo
   hybridization and, subsequently, extension by DNA polymerase. These DNA
   sequences—or tags—are then amplified through PCR amplification, which
   can be quantified using next-generation sequencing. For each assay and
   each sample, normalized protein expression values are calculated as the
   log[2]-transformed ratio of sequence read counts to the counts of the
   extension control, corrected for plate and batch
   effects^[209]14,[210]43.

   For proteins that were measured by multiple panels (TNF, IL-6 and
   CXCL8), we only evaluated data from the panel with the highest
   detectability per protein and, if necessary, the largest number of
   protein measurements exceeding the respective limit of detection (the
   Cardiometabolic panel for TNF and Oncology panel for IL-6 and CXCL8).
   We further excluded proteins with >10% missingness in the final study
   cohort (CTSS and NPM1 from the Neurology panel, PCOLCE from the
   Cardiometabolic panel and TACSTD2 from the Oncology panel;
   Supplementary Table [211]2) and imputed the remaining 1.1% of missing
   protein values using k-nearest neighbors (k = 10) via the impute.knn()
   function (impute package^[212]45 in R)^[213]43. The remaining 1,459
   protein markers underwent z-score transformation before analysis.

Covariate ascertainment

   Demographic characteristics, medical history, medication use and health
   behaviors were systematically assessed upon enrollment in the UKB.
   Self-reported-race/ethnicity was collected at baseline and used as a
   binary variable (white versus nonwhite) in analyses. Smoking was
   dichotomized as ever (current or past) smoking versus no history of
   smoking. Type 2 diabetes was defined by self-report or qualifying
   International Classification of Diseases (ICD) codes. The Townsend
   deprivation index—an area-level score that incorporates data on home
   ownership, automobile ownership, employment and household
   overcrowding—was used as a composite measure of material
   deprivation^[214]46. Townsend deprivation index scores were
   inverse-rank normalized and z-score transformed before analysis.

   Anthropometric data, physical measurements and blood samples were
   obtained by trained study staff^[215]42. Body mass index (BMI) was
   calculated from standing height and weight measured at baseline. After
   a 5-min period of seated rest, blood pressure was measured using an
   electronic monitor (Omron 705 IT, OMRON Healthcare) on two separate
   occasions with a 1-min interval in between; the mean was calculated and
   used for analysis when both measurements were available. Total
   cholesterol, high-density lipoprotein (HDL) cholesterol and creatinine
   concentrations were quantified in baseline blood samples (AU5800,
   Beckman Coulter).

   Missing values for BMI (missing for n = 717; 1.6%), systolic blood
   pressure (n = 2,210; 5.0%), total cholesterol (n = 1,995; 4.5%), HDL
   cholesterol (n = 5,591; 12.6%), serum creatinine (n = 2,010; 4.5%) and
   normalized Townsend deprivation index (n = 53; 0.1%) were imputed using
   linear regression models incorporating sex, age, race/ethnicity and the
   first ten principal components of genetic ancestry as predictors.

Outcome ascertainment

   Follow-up for incident outcomes occurred through linkage to national
   health records until March 2020. Incident events were defined by the
   occurrence of (1) at least one qualifying ICD-9 or ICD-10 code for a
   corresponding in- or outpatient diagnosis (as either a primary or
   secondary disease diagnosis); or (2) at least one Office of Population
   Censuses and Surveys Classification of Surgical Operations and
   Procedures (OPCS) code for a qualifying procedure (for example,
   coronary artery revascularization for coronary artery disease). The
   specific codes used to define each outcome are listed in Supplementary
   Table [216]22 (refs. ^[217]47,[218]48).

Proteomic association analyses

   Primary analyses tested the associations of circulating protein levels
   with incident cardiac events using Cox proportional hazards models
   adjusted for age, age², sex, self-reported race/ethnicity, the first
   ten principal components of genetic ancestry, smoking, normalized
   Townsend deprivation index, BMI, systolic blood pressure,
   antihypertensive medication use, total cholesterol, HDL cholesterol,
   cholesterol-lowering medication use, serum creatinine (as a measure of
   kidney function) and prevalent type 2 diabetes. In addition, to
   increase the specificity of the detected protein associations for a
   given disease (for example, coronary artery disease), we included the
   other cardiac outcomes (for example, heart failure, atrial fibrillation
   and aortic stenosis) as time-varying covariates using the tmerge()
   function in R (survival package)^[219]49. Bonferroni-corrected
   P < 0.05/5,836 (P < 0.05/(1,459 tested proteins × 4 tested outcomes) or
   ~8.6 × 10^−6) indicated statistical significance for the primary
   analyses. To illustrate the distributions of the most strongly
   associated proteins in individuals who experienced incident events
   versus those who did not, we constructed probability density functions
   showing the distributions of the strongest protein–disease associations
   in cases versus controls for each outcome using the ggplot2 package in
   R^[220]50.

Pathway enrichment analyses

   Pathway enrichment analyses evaluated whether certain protein groups
   representing biologically distinct pathways were disproportionately up-
   or downregulated in individuals with incident cardiac events. Top
   biological functions, molecular pathways and cellular components were
   queried for each outcome using the Gene Ontology resource^[221]16 via
   Enrichr^[222]17. Enrichment tests were performed against a background
   gene set including the genes corresponding to all 1,459 proteins tested
   in primary analyses. Gene sets with a false discovery rate-adjusted
   P < 0.05 were considered statistically significant.

Main cis-Mendelian randomization analyses

   We performed two-sample MR analyses to explore the causal roles of
   proteins that were statistically significantly associated with one or
   more cardiac outcomes in epidemiological models (primary analyses).
   These analyses tested the associations of protein quantitative trait
   loci (pQTLs; genetic variants associated with circulating protein
   levels) with coronary artery disease, heart failure, atrial
   fibrillation and aortic stenosis. We obtained pQTL data from 35,571
   UKB-PPP participants who had their circulating proteomes profiled using
   the Olink Explore 1536 platform^[223]14. FinnGen (freeze 9;
   [224]https://r9.finngen.fi/) was used for genetic association data for
   coronary artery disease (cases of total participants: n = 43,518 of
   377,277), heart failure (n = 27,304 of 377,277), atrial
   fibrillation/flutter (n = 45,766 of 237,690) and operated calcific
   aortic stenosis (n = 9,153 of 377,277). Genetic association data were
   obtained from FinnGen rather than larger meta-GWASs (which often
   included the UKB) to avoid sample overlap between the exposure and
   outcome cohorts, which increases the risk of weak instrument bias in
   two-sample MR leading to inflated type I error rates^[225]51. All
   genetic data were derived from individuals of European ancestry, and
   there was no overlap between the exposure and outcome study cohorts.

   Because the use of cis-pQTLs (pQTLs that map near the protein-encoding
   gene) facilitates adherence to the assumptions of MR^[226]20,[227]21,
   we only used variants within a 200-kilobase range of the
   protein-encoding gene to construct our genetic instruments. We used a
   relaxed P value threshold for instrument selection (P < 5 × 10^−6)
   relative to the conventional genome-wide threshold (P < 5 × 10^−8) to
   increase the number of genetic instruments as the cis-regions for the
   assayed proteins represent only a small fraction of the genome, are
   expected to be enriched for associations, and to optimize power. All
   cis-pQTLs with P < 5 × 10^−6 were clumped into largely independent loci
   (linkage disequilibrium R^2 < 0.1) using PLINK^[228]52. Linkage
   disequilibrium information was obtained from the European panel of
   phase 3 of the 1000 Genomes Project^[229]53.

   To minimize the risk of weak instrument bias, we only performed cis-MR
   analyses for genetic instruments with F-statistics >10. F-statistics
   were obtained by performing linear regression analyses of a protein’s
   genetic risk score (as the independent variable) against the measured
   levels of that protein (as the dependent variable) in the UKB-PPP.
   Genetic risk scores were calculated as weighted allele scores using the
   ‘clumping and thresholding’ method, applying the same P value and
   linkage disequilibrium R^2 thresholds as those used in our primary
   cis-MR analyses (P < 5 × 10^−6 and R^2 < 0.1). All scores were
   calculated using genotype array data; for proteins where genetic risk
   score calculation failed, F-statistics were estimated using summary
   statistics as equation ([230]1):
   [MATH:
   <mrow><mi>F</mi><mo>=</mo><mrow><mo>(</mo><mrow><mrow><mo>(</mo><mrow><
   mi>n</mi><mo>−</mo><mi>k</mi><mo>−</mo><mn>1</mn></mrow><mo>)</mo></mro
   w><mo>/</mo><mi>k</mi></mrow><mo>)</mo></mrow><mo>×</mo><mfenced
   close=")"
   open="("><mrow><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></ms
   up><mo>/</mo><mrow><mo>(</mo><mrow><mn>1</mn><mo>−</mo><msup><mrow><mi>
   R</mi></mrow><mrow><mn>2</mn></mrow></msup></mrow><mo>)</mo></mrow></mr
   ow></mfenced></mrow> :MATH]
   1

   where n indicates the sample size of the original genome-wide
   association study, k the number of variants included in the genetic
   instrument and R^2 the variance in the exposure explained by the
   genetic variants^[231]54.

   Depending on the number of cis-pQTLs included in a protein’s genetic
   instrument, we used different MR methods to infer causal
   effects^[232]13. The inverse-variance-weighted (IVW) method was used
   with fixed effects for genetic instruments with two to three cis-pQTLs
   and with multiplicative random effects for those with more than three
   cis-pQTLs. The Wald ratio estimator was used for genetic instruments
   with only one cis-pQTL. We adjusted for between-variant correlation
   structure in all IVW models to avoid inflated estimates caused by
   residual correlation^[233]55,[234]56.

Sensitivity cis-Mendelian randomization analyses

   Because it is routinely recommended to evaluate the robustness of
   cis-MR estimates using multiple sensitivity analyses^[235]57, we
   performed additional analyses using different MR approaches and
   instrument selection parameters. First, we evaluated the possibility of
   reverse causation affecting our analyses by performing Steiger
   filtering to exclude variants explaining more variance in the outcome
   (cardiac diseases) than the exposure (circulating protein levels).
   Second, as cis-MR analyses often rely on pQTLs that may be residually
   correlated with each other, we carried out sensitivity analyses testing
   genetic instruments that were constructed using a range of linkage
   disequilibrium R^2 thresholds (R^2 < 0.001, R^2 < 0.01, R^2 < 0.1 and
   R^2 < 0.2). Third, because the primary genetic instruments were
   constructed using subgenome-wide significant cis-pQTLs, we verified the
   robustness of our genetic associations against different P value
   thresholds (P < 5 × 10^−4, P < 5 × 10^−6 and P < 5 × 10^−8). Fourth, we
   calculated effect estimates using the MR-Egger method to account for
   horizontal pleiotropy (effects of the genetic instruments on the
   outcome through pathways other than the exposure of interest). Fifth,
   we performed one-sample cis-MR analyses to test the associations of the
   prioritized proteins’ genetic risk scores with cardiac diseases in an
   external UKB sample (see below). Sixth, to account for the possibility
   that a certain protein’s genetic instrument could affect the outcomes
   through one or more other proteins, we calculated
   multivariable-adjusted cis-MR estimates that were adjusted for the
   genetic instruments of all prioritized proteins significantly
   associated with the tested protein’s genetic risk score (see below).

   Genetic risk scores were calculated from genetic association data from
   the UKB-PPP as weighted allele scores using the ‘clumping and
   thresholding’ method, applying the same P value and linkage
   disequilibrium R^2 thresholds as those used in the primary cis-MR
   analyses (P < 5 × 10^−6 and R^2 < 0.1). One-sample cis-MR was performed
   in UKB participants who were not included in the UKB-PPP, were free of
   cardiac diseases at baseline, and had no missing covariates
   (n = 407,230). Associations of the circulating proteins’ genetic risk
   scores with cardiac outcomes were tested using Cox regression models
   adjusted for age, age², sex, race/ethnicity and the first ten principal
   components of genetic ancestry. In addition, we interrogated the
   possibility that a certain protein’s genetic instrument was also
   associated with other proteins’ circulating levels (proteins with
   shared pQTLs). To investigate this, we tested the associations of
   genetic risk scores for all proteins with putative causal associations
   in the primary cis-MR analyses. Linear regression models adjusted for
   age, age², sex, race/ethnicity and the first ten principal components
   of genetic ancestry were employed for these analyses. For proteins with
   genetic instruments that were significantly associated with one or more
   other proteins (‘correlated proteins’), we then calculated
   multivariable-adjusted cis-MR estimates in the independent UKB sample
   (n = 407,230) using Cox regression models adjusted for age, age², sex,
   race/ethnicity, the first ten principal components of genetic ancestry
   and the genetic risk scores of all ‘correlated’ proteins.

   Sensitivity analyses were performed for all genetic protein–disease
   associations with at least nominal significance (unadjusted P < 0.05)
   in primary cis-MR analysis. Genetic associations were considered robust
   if (1) the effect estimates were directionally consistent across all
   primary and sensitivity analyses and (2) MR-Egger suggested no
   horizontal pleiotropy (P ≥ 0.05 for the intercept test or P < 0.05 for
   the intercept test with P < 0.05 for the causal test). MR analyses were
   performed using the TwoSampleMR and MendelianRandomization packages in
   R^[236]58,[237]59. Druggability profiles of proteins with robust
   genetic associations were extracted from a published list of druggable
   genes^[238]22.

Colocalization analyses

   We performed colocalization analyses to test for shared causal variants
   between the prioritized proteins’ cis loci (from the UKB-PPP) and
   corresponding cardiac outcomes (from FinnGen). Analyses considered all
   variants that were present in the protein and outcome summary
   statistics within ±200 kilobases of each biomarker’s protein-encoding
   region. Colocalization analyses were performed using the coloc.abf()
   function (coloc package^[239]60 in R). All colocalization analysis
   results were expressed as test statistics representing the posterior
   probabilities of five hypotheses: H[0], neither trait has an
   association with a genetic variant in the region; H[1], only the
   indicated protein has an association with a genetic variant in the
   region; H[2], only the indicated cardiac disease has an association
   with a genetic variant in the region; H[3], both traits are associated
   but with different causal variants; and H[4], both traits are
   associated and share a single causal variant. A posterior probability
   for H[4] > 0.80 indicated strong colocalization evidence.

Sex-stratified association analyses

   Sex-stratified analyses tested the associations of circulating protein
   levels with incident coronary artery disease, heart failure, atrial
   fibrillation and aortic stenosis in self-reported female and male
   participants separately. These analyses were performed using Cox
   proportional hazards models adjusted for age, age², self-reported
   race/ethnicity, the first ten principal components of genetic ancestry,
   smoking, normalized Townsend deprivation index, BMI, systolic blood
   pressure, antihypertensive medication use, total cholesterol, HDL
   cholesterol, cholesterol-lowering medication use, serum creatinine and
   prevalent type 2 diabetes. Coronary artery disease, heart failure,
   atrial fibrillation and aortic stenosis were included as time-varying
   covariates. The difference in effect size for the protein–disease
   associations was quantified by subtracting the natural logarithm of the
   HR in females from the natural logarithm of the HR in males
   (log(HR)[males] − log(HR)[females]).

   We tested all protein–disease association reaching significance
   (P < 0.05/5,836) in at least one sex for protein-by-sex interactions.
   These analyses were performed by fitting an interaction term
   (sex × circulating protein levels) in Cox proportional hazards models
   adjusted for sex, age, age², self-reported race/ethnicity, the first
   ten principal components of genetic ancestry, smoking, normalized
   Townsend deprivation index, BMI, systolic blood pressure,
   antihypertensive medication use, total cholesterol, HDL cholesterol,
   cholesterol-lowering medication use, serum creatinine, prevalent type 2
   diabetes and circulating levels of the tested protein.

Construction of protein-based prediction models

   We constructed protein-based risk scores to predict incident cardiac
   events in the UKB-PPP. We created three risk scores for each cardiac
   outcome using logistic LASSO regression, including (1) a score based on
   clinical risk factors; (2) a score based on circulating proteins; and
   (3) a combined score (that is, using clinical risk factors and
   circulating proteins). The clinically evaluable variables used as
   covariates in primary analyses (age, sex, self-reported race/ethnicity,
   smoking, BMI, systolic blood pressure, antihypertensive medication use,
   total cholesterol, HDL cholesterol, cholesterol-lowering medication
   use, serum creatinine and type 2 diabetes) were fed into LASSO models
   for the clinical risk scores. Circulating levels of the 1,459 proteins
   tested in primary analyses were fed into LASSO models for the
   protein-based risk scores.

   The study cohort was randomly divided into a training (80%; n = 35,450)
   and testing (20%; n = 8,863) set. All clinical, proteomic and combined
   prediction scores were constructed using LASSO regression for variable
   selection and regularization. In brief, LASSO is a regularized
   regression method that selects informative variables (for example,
   proteins or clinical risk factors) from high-dimensional and correlated
   datasets while shrinking the regression coefficients of less
   informative variables to zero. We used tenfold cross-validation to tune
   the regularization parameter (λ; the parameter that controls the
   strength of shrinkage and variable selection) for each LASSO model.
   During the cross-validation procedure, multiple LASSO models are
   iteratively constructed for each set of predictors (clinical risk
   factors, proteins or both) using different values for λ, with each λ
   corresponding to a certain number of variables included in the
   prediction model. A higher λ value corresponds to fewer predictive
   variables in the regression model. The accuracy of each LASSO model
   (with its respective λ value) was quantified using the ROC AUC.

   We used the ‘one standard error rule’ to determine the optimal λ for
   all proteomic and combined prediction models. This approach reduces the
   complexity of prediction models that are derived from high-dimensional
   datasets by selecting the largest λ (which corresponds to the smallest
   number of predictive covariates) for which the AUC is within one
   standard error of the highest AUC value during the cross-validation
   process. For models based solely on clinical risk factors, the λ
   corresponding to the highest AUC was used, considering that these risk
   scores were derived from a specific set of risk factors rather than a
   high-dimensional dataset.

Evaluation of protein-based prediction models

   The performance of each prediction model was evaluated internally in
   the testing set of the UKB-PPP cohort (see above) and externally in the
   WHI-LLS dataset (see below). in the testing set by ROC curve analysis,
   and the DeLong test was used to evaluate differences between AUCs. We
   also calculated the detection rates for each model at false positive
   rates of 0.01, 0.05 and 0.1. ‘Exact’ detection rates were calculated as
   the proportion of affected individuals with positive test results (the
   number of true positives divided by the number of true positives plus
   false negatives). ‘Approximated’ detection rates were calculated using
   equation ([240]2):
   [MATH: <mrow><mi
   mathvariant="normal">DR</mi><mo>=</mo><mn>1</mn><mstyle><mtext>–</mtext
   ></mstyle><mi>Φ</mi><mrow><mo>(</mo><mrow><msup><mrow><mi>Φ</mi></mrow>
   <mrow><mo>−</mo><mn>1</mn></mrow></msup><mrow><mo>(</mo><mrow><mn>1</mn
   ><mstyle><mtext>–</mtext></mstyle><mi
   mathvariant="normal">FPR</mi></mrow><mo>)</mo></mrow><mstyle><mtext>–</
   mtext></mstyle><mrow><mo>(</mo><mrow><msub><mrow><mi>μ</mi></mrow><mrow
   ><mi
   mathvariant="normal">cases</mi></mrow></msub><mo>−</mo><msub><mrow><mi>
   μ</mi></mrow><mrow><mi
   mathvariant="normal">controls</mi></mrow></msub></mrow><mo>)</mo></mrow
   ><mo>/</mo><msub><mrow><mi>σ</mi></mrow><mrow><mi
   mathvariant="normal">controls</mi></mrow></msub></mrow><mo>)</mo></mrow
   ></mrow> :MATH]
   2

   where DR is the detection rate; Φ the cumulative distribution function
   of the normal distribution with 0 as mean and (σ[cases]/σ[controls]) as
   s.d.; Φ^−1 the inverse cumulative distribution function of the standard
   normal distribution; μ[cases] the mean of the cases; μ[controls] the
   mean of the controls; σ[cases] the s.d. of the cases; σ[controls] the
   s.d. of the controls; and FPR the false positive rate.

   In addition, we constructed Kaplan–Meier plots to visualize the
   cumulative incidence of each outcome during follow-up according to
   proteomic risk score quintiles. We also tested the
   multivariable-adjusted association of each risk score (as a continuous
   variable) with their corresponding outcome using multivariable-adjusted
   Cox regression models adjusted for age, age², self-reported
   race/ethnicity, the first ten principal components of genetic ancestry,
   smoking, normalized Townsend deprivation index, BMI, systolic blood
   pressure, antihypertensive medication use, total cholesterol, HDL
   cholesterol, cholesterol-lowering medication use, serum creatinine and
   prevalent type 2 diabetes. Coronary artery disease, heart failure,
   atrial fibrillation and aortic stenosis were included as time-varying
   covariates. We used the glmnet^[241]61 and pROC^[242]62 packages in R
   to construct and test all risk scores.

External validation of protein-based prediction models

   External validation analyses were performed in the WHI^[243]63—a
   prospective study of women recruited at 40 centers across the United
   States from 1993 to 1998—for coronary artery disease, heart failure and
   atrial fibrillation (all outcomes for which proteomic data
   significantly improved prediction in the UKB-PPP). A subset of WHI
   participants were invited for the LLS, which consisted of a one-time
   in-person study visit (between March 2012 and May 2013) including a
   blood draw, clinical evaluation and assessment of functional status. A
   total of 1,333 WHI-LLS participants underwent proteomic profiling.
   After excluding participants with missing values for >10% of measured
   proteins, missing data on time between enrollment and time of blood
   donation, or a history of heart disease, we included data from 1,083
   WHI-LLS participants (Extended Data Fig. [244]7 and Supplementary Table
   [245]19).

   Proteomic profiling was performed using six Olink Target 96 platforms
   (the Cardiometabolic, Cardiovascular II, Cardiovascular III,
   Inflammation, Neurology and Oncology III panels), measuring a total of
   552 protein analytes representing 518 unique proteins that were also
   measured the UKB-PPP (Supplementary Table [246]20). Because only a
   subset of proteins was measured in both the WHI-LLS and the UKB-PPP,
   all proteomic and combined models were retrained in the UKB-PPP using
   only the subset of proteins that was measured in both the WHI-LLS and
   the UKB-PPP.

   WHI-LLS participants underwent follow-up for coronary artery disease
   (defined as a composite of fatal and nonfatal myocardial infarction
   using a standardized adjudication process^[247]64), heart failure
   (defined as probable or definite congestive heart failure
   hospitalization using a standardized adjudication process^[248]65) and
   atrial fibrillation (defined as a composite of hospitalized and
   outpatient atrial fibrillation using self-report). WHI-LLS participants
   underwent follow-up through February 2022, resulting in a median (IQR)
   follow-up duration of 8.4 (6.1–8.9) years after blood draw. The
   performance of each prediction model was evaluated by ROC curve
   analysis.

Statistical analyses

   All tests were two-sided. Data analysis was performed using R (v.4.1.0;
   R Project for Statistical Computing) unless otherwise specified.

Reporting summary

   Further information on research design is available in the [249]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [250]Supplementary Information^ (597.1KB, pdf)

   Supplementary Figs. 1 and 2.
   [251]Reporting Summary^ (2.1MB, pdf)
   [252]Supplementary Tables 1, 2 and 4–22.^ (8.3MB, xlsx)
   [253]Supplementary Table 3.^ (29MB, xlsx)

Source data

   [254]Source Data Fig. 2^ (204.8KB, csv)

   Statistical source data.
   [255]Source Data Fig. 3^ (53.7KB, csv)

   Statistical source data.
   [256]Source Data Fig. 4a^ (276.8KB, csv)

   Statistical source data.
   [257]Source Data Fig. 4b^ (3KB, csv)

   Statistical source data.
   [258]Source Data Fig. 5a^ (210.8KB, csv)

   Statistical source data.
   [259]Source Data Fig. 5b^ (990.4KB, csv)

   Statistical source data.
   [260]Source Data Fig. 5c^ (1.3KB, csv)

   Statistical source data.
   [261]Source Data Fig. 5d^ (5.6MB, csv)

   Statistical source data.
   [262]Source Data Extended Data Fig. 1^ (674.5KB, csv)

   Statistical source data.
   [263]Source Data Extended Data Fig. 2^ (24.4MB, csv)

   Statistical source data.
   [264]Source Data Extended Data Fig. 3^ (186B, csv)

   Statistical source data.
   [265]Source Data Extended Data Fig. 4^ (5.4KB, csv)

   Statistical source data.
   [266]Source Data Extended Data Fig. 5^ (315.4KB, csv)

   Statistical source data.
   [267]Source Data Extended Data Fig. 6^ (10.7KB, csv)

   Statistical source data.
   [268]Source Data Extended Data Fig. 8a^ (156.8KB, csv)

   Statistical source data.
   [269]Source Data Extended Data Fig. 8b^ (150.4KB, csv)

   Statistical source data.
   [270]Source Data Extended Data Fig. 8c^ (997B, csv)

   Statistical source data.
   [271]Source Data Extended Data Fig. 8d^ (512.1KB, csv)

   Statistical source data.
   [272]Source Data Extended Data Fig. 9^ (2.6MB, csv)

   Statistical source data.
   [273]Source Data Extended Data Fig. 10^ (2.6MB, csv)

   Statistical source data.

Acknowledgements