Abstract
Genomics can provide insight into the etiology of type 2 diabetes and
its comorbidities, but assigning functionality to non-coding variants
remains challenging. Polygenic scores, which aggregate variant effects,
can uncover mechanisms when paired with molecular data. Here, we test
polygenic scores for type 2 diabetes and cardiometabolic comorbidities
for associations with 2,922 circulating proteins in the UK Biobank. The
genome-wide type 2 diabetes polygenic score associates with 617
proteins, of which 75% also associate with another cardiometabolic
score. Partitioned type 2 diabetes scores, which capture distinct
disease biology, associate with 342 proteins (20% unique). In this
work, we identify key pathways (e.g., complement cascade), potential
therapeutic targets (e.g., FAM3D in type 2 diabetes), and biomarkers of
diabetic comorbidities (e.g., EFEMP1 and IGFBP2) through causal
inference, pathway enrichment, and Cox regression of clinical trial
outcomes. Our results are available via an interactive portal
([64]https://public.cgr.astrazeneca.com/t2d-pgs/v1/).
Subject terms: Type 2 diabetes, Personalized medicine
__________________________________________________________________
Here, the authors leveraged genetic and proteomic information from the
UK Biobank and two randomized controlled trials to characterize how
polygenic disease risk for T2D and related comorbidities perturbs the
plasma proteome and glean therapeutic insights.
Introduction
Diabetes mellitus is a complex, multifactorial metabolic disorder
diagnosed via a single clinical feature, hyperglycaemia^[65]1,[66]2.
Among the major diagnoses of diabetes mellitus, type 2 diabetes (T2D)
has the largest worldwide disease burden^[67]1. To date, large-scale
genome-wide association studies (GWAS) for T2D^[68]3,[69]4 have helped
uncover important T2D biology, such as the link between C2CD4A and
beta-cell dysfunction^[70]5,[71]6. However, many T2D risk variants
identified via GWAS have small effect sizes and tend to map to
non-protein-coding regions, which makes it challenging to uncover their
downstream biological effects. Polygenic risk scores (PRS), also known
as polygenic scores (PGS), aggregate the small effects of these
variants and can be utilised for risk prediction and stratification,
including for T2D^[72]7,[73]8.
Pairing PGS with molecular data can provide a powerful approach to
understanding the downstream biological effects of polygenic risk to
disease. This can help elucidate key pathways relevant to disease
pathophysiology and potentially find therapeutic targets that may be
missed by traditional disease-gene mapping approaches. For example,
studies by Ritchie et al. and Steffen et al. tested the association
between PGS and protein expression levels, uncovering molecular
mechanisms underlying polygenic risk of cardiometabolic
diseases^[74]9,[75]10. Furthermore, PGS and proteomics information,
coupled with causal inference methodology such as mediation,
complements approaches that use individual genetic variants as
instruments, such as Mendelian randomisation (MR). This is particularly
crucial for proteins that lack the genetic variation (i.e., large
enough effect size and/or frequency) needed to serve as MR instruments.
T2D is highly heterogeneous, with affected individuals having different
degrees of insulin resistance and beta-cell dysfunction^[76]11. In
addition, the vast majority of T2D patients have at least one
additional comorbidity, such as hypertension, obesity or
hyperlipidaemia^[77]12, resulting from the complex interplay between
the pathophysiology of T2D, adiposity, and the other comorbidities. A
single T2D PGS is unlikely to capture this heterogeneity. To address
this, partitioned polygenic scores (pPS), derived from the genetic
clustering of GWAS-identified T2D variants, have been developed to
capture biological processes underlying T2D genetic risk^[78]13,[79]14.
These pPS improve the prediction of clinical outcomes in patients with
T2D compared to a conventional genome-wide T2D PGS^[80]15. Based on
these observations, we hypothesise that leveraging genome-wide and
partitioned T2D PGS, cardiometabolic PGS, and large-scale proteomics
data will identify proteins and pathways that lead to the development
of T2D or comorbidity and discover molecular mechanisms where these
processes intersect.
In this work, we interrogate how PGS for T2D (including five
partitioned T2D scores^[81]13: beta cell, lipodystrophy, liver lipid,
obesity, proinsulin) and its cardiometabolic complications (coronary
artery disease - CAD, chronic kidney disease - CKD and adiposity
ascertained as body mass index - BMI) perturb the plasma proteome using
data from the UK Biobank Pharma Proteomics Project
(UKB-PPP)^[82]16,[83]17. We further perform causal inference in the UK
Biobank and survival analysis for cardiorenal outcomes in randomised
controlled trials (RCTs) using the PGS-associated proteins to identify
potential therapeutic targets and biomarkers of T2D comorbidities. By
leveraging a large-scale biobank and two RCTs, our study provides
insights into the etiology of T2D and its comorbidities that may have
translational potential.
Results
Study design
This study leverages proteogenomic data from three studies: the UK
Biobank^[84]18 (UKB), Exenatide Study of Cardiovascular Event
Lowering^[85]19 (EXSCEL), and Dapagliflozin Effect on Cardiovascular
Events–Thrombolysis in Myocardial Infarction 58^[86]20
(DECLARE–TIMI 58). The UKB is a population-level biobank with 14 years
of follow-up time, with proteomic data available as part of the UK
Biobank Pharma Proteomics Project (UKB-PPP). EXSCEL and DECLARE-TIMI 58
were both cardiovascular outcome trials in patients with T2D with mean
follow-up times of 3.2 and 4.2 years, respectively (see Methods for a
full cohort description). We first tested T2D and cardiometabolic PGS
for association with circulating proteins in the UKB, followed by
further analyses to identify putatively causative proteins among the
set of PGS-associated protein biomarkers. Then, to assess the
association between PGS-associated proteins and common comorbidities,
we used cardiorenal outcomes in EXSCEL and DECLARE-TIMI 58. See
Fig. [87]1 for an overview of our workflow, Methods, and Supplementary
Data [88]4 and [89]5 for descriptions of all PGS.
Fig. 1. Study workflow.
[90]Fig. 1
[91]Open in a new tab
The centre of the flowchart corresponds to the primary objectives or
aims of the study (in Blue). Branching off each objective are short
summaries of the analyses corresponding to that objective (in Red).
Associations of T2D PGS with protein expression levels
To determine how polygenic risk for T2D impacts the circulating
proteome, we tested PGS for T2D, including the partitioned scores, for
association with circulating proteins in the UKB-PPP. The genome-wide
T2D PGS (PGS[T2D_gw]) was associated with 648 proteins in the UKB
discovery set (Table [92]1); of these, 617 replicated in the UKB
replication set (FDR < 5%). The proteins that were among the top 1% in
terms of variance (R^2) explained by the PGS[T2D_gw] include PON3, CKB,
APOF, and IGFBP2 (Fig. [93]2A). The partitioned T2D scores and the
PGS[T2D_gwas] (i.e., the PGS derived from GWAS-significant variants)
were associated with fewer proteins compared to the genome-wide T2D
score (PGS[T2D_gw]) (Supplementary Figs. [94]2–[95]6 and Supplementary
Data [96]6). Despite comprising fewer variants than PGS[T2D_gw], three
of the partitioned T2D scores were significantly associated with
proteins that were not associated with the PGS[T2D_gw]:
PGS[T2D_beta_cell] (76% of its associations), PGS[T2D_liver_lipid]
(55%), and PGS[T2D_lipodsytrophy] (8%). When comparing the beta
coefficients of the different T2D PGS for the circulating proteins, we
found that they were negatively correlated between PGS[T2D_gw] and
PGS[T2D_liver_lipid] (Pearson’s r = − 0.19, p = 5.9×10^-25) and between
PGS[T2D_gw] and PGS[T2D_proinsulin] (Pearson’s r = − 0.07,
p = 3.8 × 10^−^4, Fig. [97]2C). While their effects on the circulating
proteome differ, the actual PGS[T2D_gw] and PGS[T2D_liver_lipid] scores
are positively correlated (see Supplementary Fig. [98]7). Overall, this
suggests that the partitioned T2D scores capture protein associations
representing perturbations in specific T2D-related biological pathways
that may be obscured when variant effects are aggregated in a
genome-wide score (PGS[T2D_gw]).
Table 1.
Summary of polygenic score (PGS) associations
PGS Category PGS Top protein assoc. N significant proteins N
significant after pQTL/BMI adjustment % Unique assoc.
T2D (Genome-wide) PGS[T2D_gw] IGSF9 (+) 617 (648) 381 (310) 18
T2D (GWAS-significant only) PGS[T2D_gwas] PAM (−) 61 (88) 51 (61) 2
Partitioned T2D polygenic scores PGS[T2D_beta_cell] ABO (+) 21 (30) 6
(4) 29
PGS[T2D_lipodystrophy] CD300LG (−) 97 (137) 96 (119) 4
PGS[T2D_liver_lipid] LEPR (−) 306 (392) 13 (9) 22
PGS[T2D_obesity] LEP (+) 3 (15) 2 (11) 0
PGS[T2D_proinsulin] FOLR3 (−) 0 (0) 0 (0) −
Other cardiometabolic scores PGS[CAD] GRN (+) 130 (159) 102 (78) 6
PGS[CKD] HLA-A (+) 556 (584) 534 (515) 29
PGS[BMI] LEP (+) 626 (694) 625 (662) 11
[99]Open in a new tab
PGS Category: descriptive information for the PGS. PGS: Name of tested
polygenic score. Top protein association: protein with the lowest
p-value in the combined UKB-PPP cohort (discovery + replication) for
each PGS. N significant proteins: number of significant proteins after
replication. The brackets specify the total number of proteins that
were significant in the discovery subset. N significant after pQTL/BMI
adjustment: number of replicated proteins remaining significant after
pQTL (protein quantitative trait loci) and BMI (body mass index)
adjustments. The brackets specify the total number of proteins that
were significant after adjustment in the discovery subset. For the
PGS[BMI] and the PGS[T2D_obesity], we did not also adjust for BMI. %
Unique assoc.: percentage of significant PGS-associated proteins that
were only associated with that PGS.
Fig. 2. Polygenic score (PGS) associations with circulating proteins.
[100]Fig. 2
[101]Open in a new tab
A Volcano plot of PGS[T2D_gw]-protein beta coefficients (obtained from
linear regression) and the unadjusted -log[10] p-values (two-sided),
with the colour indicating the magnitude of the -log10 p-values.
Labelled proteins are among the top 1% in terms of variance (R^2)
explained by the PGS[T2D_gw]. B Beta-beta plot of PGS[T2D_gw] beta
coefficients on circulating proteins with (y-axis) and without (x-axis)
BMI adjustment. The diagonal is dashed grey, while the regression line
is solid grey. Each point represents a protein; light blue points
indicate replicated proteins that remained significant with the
adjustment, red points indicate replicated proteins that were no longer
significant after the adjustment, and dark blue points indicate
proteins that did not significantly replicate prior to adjusting for
BMI or pQTLs. C Pearson’s correlations of PGS beta coefficients from
the regression on circulating protein levels. Red indicates pairs of
PGS with positively correlated effect sizes; blue indicates negatively
correlated effect sizes. “*” indicates correlations with a
p-value < 0.05 and “**” indicates correlations with a p-value < 0.001
(a Bonferroni correction for 45 comparisons). P-values are unadjusted
and two-sided t test as the test statistic follows a t distribution. D
Bar plot indicating the overlap between proteins significantly
associated with the T2D PGS and the other cardiometabolic PGS. The
x-axis is the PGS label, and the y-axis is the percentage of
PGS[T2D_gw]-associated proteins that are also associated with another
PGS (e.g., over 60% of proteins were also associated with the
PGS[BMI]). E Beta-beta plot of PGS[T2D_gw] effect sizes on circulating
proteins with (y-axis) and without (x-axis) pQTL adjustment, with the
same definitions as panel (B) albeit for a pQTL adjustment.
Impact of adiposity on T2D score-protein associations
T2D and adiposity are highly interconnected^[102]21, and as such,
protein-T2D score associations can be confounded or mediated by
adiposity. To explore which protein associations are independent of
adiposity, we added one of three anthropometric measures capturing
obesity as covariates to our PGS-protein models: body mass index (BMI),
waist circumference (WC), and waist-hip ratio (WHR). Out of the 648
proteins significantly associated with the PGS[T2D_gw] in the discovery
set, 314 remained significant after adjustment for BMI (Fig. [103]2B),
demonstrating the close interplay between T2D genetic risk, BMI, and
circulating protein levels. Adjusting for WC or WHR instead of BMI
produced similar results (see Supplementary Fig. [104]8 and
Supplementary Data [105]7), with associations for 301 and 360 proteins
remaining significant, respectively.
Next, to explore the extent to which adiposity is the mechanism by
which the T2D scores exert their influence on the plasma proteome, we
performed a mediation analysis with the T2D scores as the exposure, BMI
as the mediator, and protein expression levels as the outcome (see
Supplementary Data [106]7, Methods). For 94 of the 617 proteins
significantly associated with the PGS[T2D_gw], the indirect effect and
total effects were significant (p-value < 4.3 × 10^−6), but the direct
effect of the PGS was not. This suggests that for these proteins, such
as ADM, LEP and TNF, BMI mediated the bulk of the PGS[T2D_gw]’s effect
on the circulating levels. For 518 proteins, the direct effect of
PGS[T2D_gw] remained significant, but BMI still mediated part of the
PGS[T2D_gw]’s effect on the proteins: the proportion mediated by BMI
ranged from relatively small (0.05 for MANSC1) to relatively large
(0.82 for FABP4), with a median of 0.38 (see Supplementary
Fig. [107]8C). For 12 proteins, the indirect and direct effects were in
the opposite direction. One of these proteins, TIMP4, has been
previously reported to have a discordant relationship with adiposity
and T2D^[108]21. Among the partitioned T2D scores, it was only
PGS[T2D_obesity] that the bulk of the effect was mediated via BMI.
Cardiometabolic score associations
We then tested a selection of cardiometabolic PGS representing common
T2D comorbidities (CAD, CKD, and BMI) for their association with
protein expression levels (Panel A from Supplementary
Figs. [109]9–[110]11). This analysis aimed to assess proteomic
signatures that are specific to and shared across polygenic risk for
T2D and its comorbidities. The proteins associated with the PGS[T2D_gw]
were frequently associated with one or more cardiometabolic PGS (see
Fig. [111]2D). Overall, 64% were also significantly associated with the
PGS[BMI], 33% were associated with the PGS[CKD], and 17% were also
associated with the PGS[CAD]. The beta coefficients of the various
comorbidity scores on the circulating protein levels were all
positively correlated with that of the PGS[T2D_gw] (p-value < 0.05,
Fig. [112]2C), supporting the epidemiological observation that
cardiometabolic diseases are highly interconnected^[113]22,[114]23.
However, for a subset of proteins, the directions of the effect of the
T2D[T2D_gw] and a comorbidity PGS on circulating levels were opposite
(e.g., the effect of PGS[T2D_gw] and PGS[CKD] on MANSC4 levels, see
Supplementary Data [115]8), possibly indicating more complicated
relationships such as compensatory responses^[116]21,[117]24.
Genetic ancestry and PGS-protein associations
To examine whether our findings are robust and portable across
ancestries, we compared and contrasted PGS-protein association patterns
across five genetically predicted ancestry groups from the UKB-PPP
cohort (see Methods). As over 92% of the UKB-PPP cohort is of European
ancestry, we compared beta coefficients (rather than p-values) of the
PGS-protein associations between the different ancestry groups. The
correlations of the PGS-protein beta coefficients between genetically
predicted European and non-European ancestries were less than 1
(Pearson’s r), reflecting the fact that translatability of PGS and, by
extension, their proteomic associations, across ancestries is a
potential concern (see Supplementary Figs. [118]12–[119]15 and
Supplementary Data [120]6). However, the beta coefficients were notably
more correlated across the ancestry groups when we restricted our
comparison to the statistically significant PGS-protein associations,
suggesting that our replication strategy was effective and that
significant associations are more likely to be robust and translatable
across ancestries.
Polygenicity of PGS-protein associations and widespread effects of GCKR
A key motivation of our PGS-based approach is that it can help detect
proteomic associations that may otherwise be undetectable via
single-variant analysis (i.e., protein quantitative trait loci or pQTL
analysis). To evaluate whether the PGS-protein associations we detected
were driven by a single pQTL tagged by the PGS or represent the
cumulative, polygenic effect of variants in the PGS, we additionally
tested the PGS-protein associations by including previously reported
pQTLs as covariates in the regression model. After adjusting for both
cis and trans pQTLs (see Methods), most protein-PGS[T2D] associations
remained significant (612 out of 617; see Fig. [121]2E, Table [122]1,
Supplementary Data [123]6), demonstrating that these associations were
indeed polygenic in nature. This was largely true for all evaluated
PGS. However, for the PGS[T2D_liver_lipid] and the PGS[T2D_beta_cell],
most protein associations were not significant after pQTL adjustment
(Supplementary Fig. [124]2C and [125]4C). In the case of the
PGS[T2D_liver_lipid], the index variant at the GCKR locus (rs1260326)
explained the bulk of its association signature (280 out of 306
proteins), highlighting the pleiotropic effect of this variant on
circulating proteins. In contrast, most proteins associated with the
PGS[T2D_beta_cell] score were explained by 7 different pQTLs (four from
the ABO locus, one from the NCR3LG1 locus, and one from the FGFBP3
locus).
Causal inference using Mendelian randomisation and cis instruments
To assess if a PGS-protein association represents forward causality
(i.e., the PGS perturbs a protein’s expression level that leads to
disease) and, therefore, characterise putatively causal proteins for
T2D and its key comorbidities (BMI, CAD, and CKD), we utilised
two-sample Mendelian randomisation (MR) with statistically independent
pQTLs as instruments. We applied four conventional MR methods (IVW,
median, weighted median, and MR-Egger; see Methods) to those proteins
with sufficient genetic instruments.
Six proteins had significant causal evidence with regards to T2D
(median p-value across the four MR methods <0.05, FDR-adjusted) without
evidence of pleiotropy (MR-Egger intercept p-value > 0.05): MANSC4,
GLRX5, NUCB2, PAM, ARG1, and NCR3LG1, (see Fig. [126]3A and
Supplementary Data [127]9), with PAM and MANSC4 significant after a
more stringent Bonferroni correction (p-value < 9.2 × 10^−6). For BMI,
14 proteins were significant using cis instruments (see Supplementary
Fig. [128]16A), with NADK remaining significant after a Bonferroni
correction (p-value < 8.7 × 10^−^6). For CAD, we identified eight
proteins using cis instruments (FDR-adjusted p-value < 0.05; see
Fig. [129]3D), with FES, LPA and PCSK9 significant after a Bonferroni
correction. For CKD, we identified one protein using cis instruments,
UMOD (with UMOD significant after a Bonferroni correction; see
Supplementary Fig. [130]16D). Finally, we conducted an MR analysis for
cardiovascular and renal outcomes in UKB participants with T2D. Using
cis instruments, we identified two proteins: UMOD and CCN4 (FDR
p-value < 0.05; see Supplementary Fig. [131]17A).
Fig. 3. Two-sample Mendelian randomisation analysis in the UK Biobank.
[132]Fig. 3
[133]Open in a new tab
A Type 2 diabetes (T2D) Mendelian randomisation (MR) with cis
instruments for each protein as the exposure. B T2D MR with both cis
and trans instruments for each protein as the exposure. C Cis
colocalization using T2D and protein quantitative loci (pQTL)
genome-wide association study (GWAS) information. D Coronary artery
disease (CAD) MR with cis instruments for each protein as the exposure.
E CAD MR with both cis and trans instruments for each protein as the
exposure. F CAD colocalization using CAD and pQTL GWAS information. In
the MR plots, four conventional MR methods are displayed (simple
median, weighted median, IVW, MR-Egger), plus an additional MR method
called MR-Link-2. All proteins displayed in this figure had a median
p-value across the four conventional MR methods < 0.05 (FDR-adjusted)
and no pleiotropy as detected by MR-Egger (MR-Egger intercept
p-value > 0.05). For panels (A, B, D, and E), the points represent the
causal estimate obtained by each MR method (in the log-odds scale) and
error bars represent the 95% confidence interval calculated using the
standard error for each MR estimate. Note that MR-Link-2 estimates are
on a different scale than the other MR methods but show consistency in
the direction of effect. Finally, “*” signifies proteins with
colocalization evidence. For panels A through F, the exposure (protein)
GWAS had a sample size of 34,557 European-ancestry UK Biobank
participants, and the outcome GWAS (T2D, CAD) had a sample size of
409,048 non-overlapping European-ancestry UK Biobank participants.
We also employed a pleiotropy-robust method designed for cis molecular
data, MR-Link-2^[134]25 (see Methods). As MR with only one locus can be
underpowered and can present challenges for pleiotropy estimation, this
MR method complements our conventional MR analysis. MR-Link-2 and the
conventional MR were largely concordant, though MR-Link-2 did identify
an additional 12 proteins that were not found using the other MR
methods (see Supplementary Data [135]10). Ultimately, using cis
instruments, we found evidence of a causal link for 44 proteins, 70% of
which were associated with at least one PGS.
Mendelian randomisation with the addition of trans instruments
After adding trans instruments in our MR models, we found 4 more
proteins with significant causal evidence for T2D (LGALS4, MENT, PDIA5,
and PPM1B, FDR-adjusted p-value < 0.05), with LGAL34 significant after
Bonferroni correction (p-value < 8.7 × 10^−^6). For BMI, we identified
an additional 7 proteins (see Supplementary Fig. [136]16B), four of
which were also significant after a Bonferroni correction (APOBR,
PRSS53, GYS1, RAP1A). For CAD, six additional proteins were associated
with CAD (FDR p-value < 0.05; see Fig. [137]3E), with FURIN and LMOD1
significant after a Bonferroni correction (p-value < 8.7 × 10^−^6). No
additional proteins were added after the inclusion of trans instruments
for CKD. For the comorbidities in UK Biobank participants with T2D, we
found that SELE increased the risk of myocardial infarction.
Multivariable Mendelian randomisation with BMI and T2D
For BMI and T2D, due to their high intercorrelation (see above), we
also employed multivariable MR (MVMR)^[138]26. By using genetic
variants that impact both BMI and protein expression levels as MR
instruments, we can estimate the effects of both exposures, even if
they are related (i.e., through mediation). We identified 55 loci (cis
regions of proteins) with sufficient instrument strength and without
evidence of pleiotropy, to simultaneously evaluate both BMI and protein
exposures on T2D (see Supplementary Data [139]11). Using this approach,
we found 11 with nominal evidence (p-value < 0.05) that the BMI
exposure (3 proteins), the protein exposure (7 proteins), or both (1
protein) lead to T2D risk, with one, FAM171B, significant after a
multiple testing correction (FDR p-value < 0.05; see Supplementary
Fig. [140]17B and C). Including trans instruments allowed us to
evaluate 19 additional proteins with MVMR. Of these, 7 had nominal
evidence (p-value < 0.05) that the BMI exposure (4 proteins, including
TNF), the protein exposure (2 proteins), or both (1 protein) conferred
T2D risk, with UROD significant after an FDR correction.
Colocalization of T2D and comorbidity GWAS signals with UK Biobank pQTLs
We performed statistical colocalization to test whether pQTL and
T2D/comorbidity GWAS signals arise from the same causal variant, thus
providing additional support for our MR analyses. The cis regions of 9
proteins colocalized with T2D GWAS signals (see Fig. [141]3C), 31
proteins for BMI GWAS signals (Supplementary Fig. [142]16C), 14
proteins for CAD (Fig. [143]3F), and 3 for CKD (Supplementary
Fig. [144]16E). For trans pQTLs, 70 had evidence for
colocalization with T2D GWAS signals, corresponding to 502 proteins,
suggesting that trans effects plays a large part in the genetic risk
for complex disease as previously proposed^[145]27 (see Supplementary
Data [146]9). Similarly, 117 trans pQTLs had evidence for
colocalization with BMI GWAS signals. For the comorbidities in UKB
participants with T2D, one signal was colocalized (UMOD and CKD).
The mediation of PGS effects by circulating proteins in the UKB
An orthogonal approach to MR for inferring the causal pathway is
mediation, which tests whether a protein mediates the effect of a PGS
on incident disease risk. Mediation can be used to support MR findings,
and it can also provide information on directionality for proteins that
lack the requisite number of genetic instruments for MR. Among the UKB
participants with proteomics data, 2081 were diagnosed with T2D during
14 years of follow-up time. In our mediation analysis, we modelled the
PGS as the exposure, an individual protein as the mediator, and either
incident T2D, CKD, or CAD as the outcome. Ultimately, 520 of the 617
PGS[T2D_gw]-associated proteins significantly mediated the effect of
the PGS on incident T2D risk. After adjusting for BMI, this is reduced
to 238 proteins (Supplementary Fig. [147]18A and Supplementary
Data [148]12). We also performed mediation using the partitioned scores
(Supplementary Fig. [149]18B): 83 proteins mediated the effect of the
PGS[T2D_lipodystrophy] score (74 after BMI adjustment) and 6 mediated
the effect of the PGS[T2D_beta_cell] score (4 after BMI adjustment).
Finally, we also performed mediation analysis for the PGS[CKD] with
incident CKD and the PGS[CAD] with incident CAD, finding 470 (422 after
BMI adjustment) and 33 (12 after BMI adjustment) mediating proteins,
respectively (Supplementary Fig. [150]18A).
Reverse Mendelian randomisation with cardiometabolic traits
Next, we employed reverse MR to identify instances of reverse
causation, i.e., where the cardiometabolic trait (T2D, CAD, CKD, BMI)
alters the protein expression levels (Methods). If a PGS-associated
protein is implicated in reverse MR, this suggests that the developing
disease state affects the levels of the protein. After applying a
Bonferroni correction as in our primary PGS-protein association
analysis (p-value < 4.3 × 10^−^6 across 4 MR methods), our reverse MR
analysis suggested that circulating levels of 38 proteins were
influenced by T2D, including GDF15 (Supplementary Data [151]13), and
several other proteins strongly associated with the PGS[T2D_gw] (e.g.,
APOF, PON3, PRCP). In the case of GDF15, elevated serum levels have
been reported in T2D and it is known to play a role in regulating food
intake and metabolism^[152]28. However, none of the proteins identified
in the forward MR analysis for T2D were implicated in our reverse MR.
It is worth noting that proteins identified via reverse MR could still
be causal for other comorbidities or influence T2D risk via feedback
mechanisms.
For the other cardiometabolic disorders, we found 3 proteins influenced
by CAD risk (MMP12, CNTN4, LGALS4), 8 by CKD risk, and 539 by BMI
(Supplementary Data [153]13), using reverse MR analysis. The high
number of proteins influenced by BMI genetic risk indicates that the
levels of many circulating proteins are impacted by adiposity levels.
For CKD and CAD, none of the proteins we identified in our forward MR
were associated in the reverse MR, though this was the case for 7 of
the 21 BMI-associated proteins (CXCL16, LGALS3, GALNT10, PSCA, IL12RB2,
ITGAL, SERPINA7). However, when we overlaid our mediation results with
the reverse MR results, a proportion of mediating proteins had evidence
for reverse causality, ranging from 2% of proteins for CKD and PGS[CKD]
to 7% for the PGS[T2D_gw].
Time-to-event analyses of PGS-associated proteins in EXSCEL and DECLARE
As PGS-associated proteins could have clinical relevance as biomarkers
for diabetic comorbidities and, in some cases, could be causally
linked, we tested their association with cardiovascular and renal
clinical trial endpoints in the placebo arms of EXSCEL and DECLARE-TIMI
58 (DECLARE).
In EXSCEL, the baseline levels of 241 proteins were significantly
associated with time to a major adverse cardiovascular event (MACE),
time to hospitalisation for heart failure (HHF), or time to the renal
outcome (see Methods). Of these 241 proteins, 50% were significantly
associated with the PGS[T2D_gw], 72% were significantly associated with
the PGS[CKD], and 9% were significantly associated with the PGS[CAD] in
the UKB PGS analysis. After adjusting for clinical risk factors, 118
proteins remained significant (see Fig. [154]4A and Supplementary
Data [155]14). Notably, 3 proteins were also independent of NT-proBNP
for time to HHF (MARCO, EGFR, and EFEMP1), as well as 20 proteins for
MACE. In DECLARE, 64 proteins associated with an outcome in EXSCEL were
available for replication, of which 20 were significantly replicated
using the corresponding outcome in DECLARE (Fig. [156]4B and
Supplementary Data [157]15). When adjusting for clinical risk factors,
6 proteins were significantly associated with either MACE or HHF
(FDR-adjusted p-value < 0.05), while for the time to DECLARE’s
composite renal outcome, IGFBP6 was nominally significant
(p-value = 0.03; Fig. [158]4C–E). EFEMP1, SPON1, and CST3 remained
significantly associated with the HHF endpoint after adjusting for both
clinical risk factors and NT-proBNP.
Fig. 4. Association of proteins with clinical trial outcomes and survival
analyses.
[159]Fig. 4
[160]Open in a new tab
A Summary of results in EXSCEL, with the x-axis corresponding to the
three different models used (see Methods) and the y-axis corresponding
to the number of proteins significant for each of the three outcomes
(Bonferroni p-value < 0.05). B Summary of the replication results in
DECLARE, with the x-axis corresponding to the three different models
used and the y-axis corresponding to the number of proteins significant
for each of the three outcomes (FDR p-value < 0.05). C Results from the
survival analysis of the study-specific renal outcome in the placebo
arms in EXSCEL and DECLARE. All displayed proteins replicated in
DECLARE for the base model (with age, sex, age^2, age*sex, and genetic
PCs 1–10 as covariates). D Results from the survival analysis of the
major adverse cardiovascular event (MACE) outcome in the placebo arms
in EXSCEL and DECLARE. All displayed proteins replicated in DECLARE for
the base model. E Results from the survival analysis of the
hospitalisation for heart failure (HHF) outcome in the placebo arms in
EXSCEL and DECLARE. For panels (C–E), the points represent the hazard
ratios and error bars represent the 95% confidence interval obtained
using the standard error for each hazard ratio. Note that all displayed
proteins replicated in DECLARE for the base model. In panels (C–E),
EXSCEL had a sample size of 1407 study participants from the placebo
arm with available proteomics information, while DECLARE had a sample
size of 497 study participants from the placebo arm with available
proteomics information.
Both trials featured repeat measurements with proteins assayed at
baseline and a second timepoint (12 months for EXSCEL, 6 months for
DECLARE). We evaluated proteins using the same procedure as with the
baseline measurements (see Supplementary Fig. [161]19). In general,
while the results were similar between time points, hazard ratios were
systemically larger for the renal outcome in EXSCEL and the composite
renal outcome in DECLARE (see Supplementary Fig. [162]20A and B).
Additional proteins were significant for MACE, HHF, and the
trial-specific respective renal outcome (e.g., TFF3) at the second
timepoint. For the renal outcome in EXSCEL and the composite renal
outcome in EXSCEL, this trend is partially ameliorated through the
adjustment of baseline measurements (see Supplementary Fig. [163]20G
and H). Overall, larger hazard ratios at the second time point could be
indicative of cumulative exposure to a causal protein or reverse
causality.
PGS mediation in EXSCEL and DECLARE
Building on our endpoint analysis in the clinical trials, we sought to
identify which protein biomarkers had evidence for causality. First, we
evaluated whether any PGS were associated with clinical outcomes and,
if so, whether they were associated with any circulating proteins in
the trials (see Supplementary Data [164]16–[165]18). We then performed
mediation when both conditions were satisfied. In EXSCEL, we modelled
PGS[CAD] as the exposure, PGS[CAD]-associated proteins as the mediator,
and MACE as the outcome. We found evidence that the PGS[CAD] mediates
its effect through the circulating protein levels of C9, LBP, ITIH4,
APOM, and HS6ST2 (Supplementary Fig. [166]21A and Supplementary
Data [167]19). In DECLARE, the PGS[BMI] and PGS[CAD] were significantly
associated with HHF and MACE, respectively. In the DECLARE mediation
analysis, we did not identify any proteins that mediated the PGS[CAD]’s
effect. However, for the PGS[BMI] and HHF, we found evidence for 63
proteins mediating its effect (FDR-adjusted p-value < 0.05; see
Supplementary Fig. [168]21B and Supplementary Data [169]20). When
including BMI in the models, 17 proteins still appeared to mediate the
PGS[BMI] effect, though at a nominal significance level (p < 0.05;
Supplementary Fig. [170]21C and Supplementary Data [171]20).
Notable Pathways enriched across multiple PGS-protein sets
Next, to identify pathways underlying polygenic risk for
cardiometabolic disorders, we performed pathway enrichment analysis of
the PGS-associated proteins. The PGS[T2D_gw] protein set was
significantly enriched for 12 pathways after p-value adjustment (see
Methods), of which 3 were also enriched in the PGS[CAD] protein set
(Supplementary Fig. [172]22). Pathways that are significantly enriched
in both the PGS[T2D_gw] and PGS[CAD] protein sets may reflect shared
causal mechanisms, such as those involving the complement system
(Supplementary Figs. [173]23). Proteins in the complement and
coagulation cascades pathway were identified as causal for
cardiovascular disease in MR (C1R and C1S) and mediation (18
complement-related proteins mediated the effect of PGS[T2D_gw] and 4
complement-related proteins mediated the effect of the PGS[CAD]). In
the clinical trials, C9 mediated the effect of the PGS[CAD] on the MACE
outcome. In terms of biomarkers, C2 and CD59 from the complement
cascade and PLAUR from the coagulation cascade were associated with the
time to cardiovascular outcomes in EXSCEL and DECLARE (Supplementary
Fig. [174]23B and D).
While the insulin-like growth factor binding proteins (IGFBPs) pathway
was not enriched in the PGS[CAD] or the PGS[CKD] associated protein
sets, it was enriched in the PGS[T2D_gw], PGS[BMI], and the
PGS[T2D_liver_lipid] protein association sets. In addition, its
constituent proteins were identified in many of our analyses for both
T2D and its comorbidities. IGFBP2 was among the most significant
protein associations for the PGS[T2D_gw], while IGFBP4 and IGFBP6 were
strongly associated with the PGS[CKD] (Fig. [175]5A). While causal
effects for IGFBPs on the tested diseases were not supported in our MR
analysis, IGFBP2 and IGFBP6 were implicated with T2D and CKD,
respectively, using mediation. In the clinical trials, many IGF-related
proteins were significantly associated with both cardiovascular and
renal outcomes (Fig. [176]5B and C).
Fig. 5. IGF regulation by IGFBPs pathway.
[177]Fig. 5
[178]Open in a new tab
A PGS associations with IGF binding proteins. A single asterisk (*)
indicates the association was nominally significant, while two (**)
indicates significance using FDR and three (***) indicates significance
using a Bonferroni correction from linear regression of circulating
protein levels in the UK Biobank. B Associations of proteins in this
pathway with clinical trial outcomes in EXSCEL using Cox proportional
hazards regression, adjusting for demographic covariates and clinical
risk factors (see Methods). C Associations of proteins in this pathway
with clinical trial outcomes in DECLARE using Cox proportional hazards
regression, adjusting for demographic covariates and clinical risk
factors (see Methods). For panels B and C, the dashed line corresponds
the p-value threshold where FDR < 5% (when applied to the proteins in
this pathway). Note that panels (B and C) display unadjusted, two-sided
p-values obtained from Cox proportional hazards regression.
Discussion
In this study, we defined the intersection of polygenic risk for T2D,
its cardiometabolic comorbidities, and the plasma proteome by
leveraging proteogenomic data across a population-scale biobank and two
RCTs. As variants identified through T2D GWAS are often intergenic with
small effect sizes, understanding their functional consequences can be
challenging. By aggregating GWAS variants into PGS (including T2D,
pathway-specific, and comorbidity PGS) and testing their association
with large-scale proteomic data, our study provides an enhanced
understanding of proteomic signature of T2D as well as causal proteins
and pathways that may mediate T2D risk. Furthermore, our PGS-directed
approach has the added benefit of evaluating more proteins for
causality than analyses such as Mendelian randomisation (MR) that rely
on individual variants (only 69% of the Olink-assayed proteins have
sufficient pQTLs to perform MR). Our study has potential implications
in highlighting biomarkers of T2D comorbidities, providing therapeutic
targets for T2D, and uncovering the biological underpinning of
cardiometabolic diseases.
Our study yielded insights into the proteomic consequences of polygenic
risk for T2D and its comorbidities. The PGS[T2D_gwas], despite its
derivation from the same summary statistics as the PGS[T2D_gw], was
strongly associated with proteins identified by single locus methods,
such as PAM, while the PGS[T2D_gw]’s associations appeared to be more
polygenic (Fig. [179]2). Given the causal links between adiposity and
T2D, nearly all of the PGS[T2D_gw] protein associations are at least
partially mediated by BMI, though the contribution of BMI can range
from relatively small (e.g., 7% of PGS[T2D_gw]’s effect on MANSC4) to
relatively large (e.g., 90% of the PGS[T2D_gw]’s effect on LEP).
Notably, the partitioned T2D scores, apart from PGS[T2D_obesity], were
less influenced by adiposity (Supplementary Figs. [180]2–[181]6). The
partitioned T2D scores also seemed to capture unique aspects of T2D
biology (189 proteins were associated with the partitioned scores but
not the overall T2D score, i.e., PGS[T2D_gw]), thus, serving as proof
of concept for the development of such scores. For example, the effect
of the PGS[T2D_beta_cell] score on incident T2D was mediated by FAM3D,
a causal relationship not detected by the overall PGS[T2D_gw]. FAM3D is
thought to be involved in glucose regulation^[182]29. Further, we found
the beta coefficients of PGS[T2D_liver_lipid] and the PGS[T2D_gw] on
the plasma proteome were frequently in the opposite direction, even
though the scores themselves are positively correlated (Fig. [183]2 and
Supplementary Fig. [184]7). This is consistent with the known
discordant effects of PGS[T2D_liver_lipid] and the PGS[T2D_gw] on
clinical biomarkers such as plasma triglycerides^[185]13. It seems
plausible that the GCKR locus (index variant: rs1260236), captured by
PGS[T2D_liver_lipid], drives a specific subtype of T2D with unique
phenotypic consequences^[186]13,[187]15.
To identify PGS-protein associations that are attributable to forward
causation, and discover potential therapeutic targets, we performed
causal inference using Mendelian randomisation (MR) and mediation for
the relevant clinical endpoints (i.e., T2D, BMI, and the T2D
comorbidities CAD and CKD). For T2D, we found evidence using both MR
and cis colocalization for 5 proteins, although two of them, ABO and
APOE, likely harbour pleiotropic effects. PAM’s involvement in the
etiology of T2D has been described in previous studies^[188]30. Less is
known about MANSC4 and NCR3LG1, but we note that MANSC4 has previously
been identified via MR^[189]31. Among MR-significant proteins without
colocalization evidence are NUCB2, ARG1, and LGALS4, all of which
either play a role in glucose regulation^[190]32, insulin
resistance^[191]33 or have been previously linked to
diabetes^[192]34,[193]35. For CAD, we identified 7 proteins with
significant MR and cis colocalization evidence (see Fig. [194]3). Among
these, LPA and PSCK9’s role in the etiology of cardiovascular disease
is well-established^[195]36–[196]38, while evidence in the literature
for FURIN and FES continues to grow^[197]39–[198]41. Notably, ERBB4,
with significant MR estimates and cis colocalization for BMI (IVW:
− 0.26, 95% CI: − 0.378 to − 0.138), has previously been linked to
metabolic disorders and obesity^[199]42–[200]44. With an alternative
method, MR-Link-2, we identified additional proteins, such as FGF5 with
CAD, and confirmed many of the associations we found with conventional
MR. However, this approach also found more pleiotropic effects than our
primary MR analysis (see Supplementary Data [201]9 and [202]10). As
MR-Link-2 is designed to be pleiotropy-robust, we did not filter
potentially pleiotropic instruments in advance as we did for the other
MR analyses. In the case of some loci (such as LPA), the observed
pleiotropic effect could be caused by statistically independent pQTLs
in linkage disequilibrium^[203]25.
Of the 56 proteins identified via MR, 40 were significantly associated
with at least one PGS. For the three traits evaluated via mediation
(CKD, CAD, T2D), 49% of MR-identified proteins also significantly
mediated the effect of a PGS on incident disease. Proteins with causal
evidence based on MR, promising as biomarkers for cardiovascular/renal
comorbidities, or among the top 5 (by variance explained) of mediating
proteins for each score are summarised in Supplementary Data [204]21,
along with their druggability and gene expression information obtained
from DrugnomAI^[205]45,[206]46.
We aimed to identify putative causal proteins and pathways that link
T2D with its cardiovascular and renal comorbidities. To this end, we
performed MR in patients with T2D as well as survival analysis and
mediation analyses in EXSCEL and DECLARE-TIMI 58. With MR, we found
that CCN4 (WISP1) and SELE conveyed risk for the development of
cardiovascular outcomes while UMOD conveyed risk for renal
comorbidities. UMOD’s role in CKD has previously been supported by
MR^[207]47, while CCN4 and SELE (E-selectin) are thought to be involved
in T2D progression^[208]48 and vascular inflammation^[209]49,
respectively. Several proteins implicated in our survival analysis,
such as TFF3 and EFEMP1, could potentially explain mechanisms
underlying the development of comorbidities^[210]50,[211]51. We found 6
proteins that mediated the PGS[CAD]’s effect on MACE in EXSCEL (C9,
LBP, ITIH4, APOM, CES1, and HS6ST2), providing supporting evidence that
they might serve as biomarkers for cardiovascular disease in patients
with T2D^[212]52–[213]54.
Pathway enrichment of PGS-associated proteins can inform whether the
proteins act collectively via certain biological pathways and can help
guide therapeutic development (Supplementary Fig. [214]22). Pathways
shared across multiple cardiometabolic PGS-protein sets could also
reveal mechanisms relevant to the development of these comorbidities.
Pathways involving the complement system, notably, were enriched in
both the PGS[CAD] and PGS[T2D_gw] protein sets. For example, C1R and
C1S, which activate the C1 complex in the classical complement
pathway^[215]55, were significantly associated with the PGS[T2D_gw] in
addition to their MR and colocalization with CAD. Therapies targeting
the complement system have been developed or proposed for a wide range
of diseases, including inflammatory kidney disorders and cardiovascular
disease^[216]56,[217]57. The insulin-like growth factor binding
proteins (IGFBPs) pathway was enriched in both the PGS[T2D_gw] and
PGS[BMI] protein sets. IGFBP2 was a biomarker for both cardiovascular
and renal outcomes in EXSCEL and DECLARE-TIMI 58. Interestingly, lower
levels of IGFBP2 are associated with incident T2D while higher levels
are associated with incident CKD (see Fig. [218]5A), as has previously
been observed^[219]58,[220]59. While the exact physiological role of
IGFBPs in the pathogenesis of cardiometabolic disease requires further
investigation, we provided evidence of a shared role in both adiposity
traits and cardiometabolic disease^[221]60–[222]62.
Our study has limitations. First, the trans-ancestry portability
problem is commonly observed in biomedical research, including with
PGS, though we sought to mitigate this by employing multi-ancestry PGS
and trans-ancestry analyses. Second, since our study included UKB data,
we selected scores that did not make use of the UKB (e.g., relatively
older and smaller GWAS) to avoid overfitting, which may attenuate our
theoretical statistical power. Third, despite excluding subjects with
prevalent cardiometabolic diagnoses, the PGS-related analyses were not
immune to reverse causality as several proteins strongly associated
with the PGS[T2D_gw] were implicated in our reverse MR analysis. It is
also possible that we are capturing instances of feedback mechanisms.
Fourth, defining phenotypes in the UKB via electronic health records
could have an impact on analyses due to potential issues such as
misclassification error^[223]63. Fifth, regression analyses including
both T2D and BMI could exhibit collider bias as T2D and adiposity are
causally linked, and as such, we do not consider any model that is
significant after BMI adjustment when it is not significant without the
BMI adjustment. Collider bias could also impact the analyses in the
clinical trials as all participants were selected on a particular
characteristic^[224]64 (i.e., T2D with cardiovascular risk factors);
however, adjusting for the risk factors as we have done could address
this. Sixth, this study used data derived from circulating proteins,
which may not necessarily represent the causative tissue for the
diseases under examination. Seventh, while the cardiovascular outcomes
in EXSCEL and DECLARE are aligned, the renal endpoints differ between
the two trials (see Methods). This, in addition to differing
eligibility criteria, could contribute to the effect size differences
we observed in the survival analyses for the renal outcomes. Finally,
the PGS, and by extension the mediation analyses using the PGS, could
be impacted by horizontal pleiotropy as the PGS by design is not
limited to variation reflecting a single exposure^[225]9. However, the
partitioned scores help reduce this to a single mechanism. Our
mediation sensitivity analyses likely reflect this as many of our
mediation models were not robust to potential mediator-mediator
confounding (see Methods, Supplementary Data [226]12). The same
limitation applies to MR analyses of complex traits and protein levels
using trans pQTLs, though we have sought to minimise this confounding
by testing for pleiotropy in our analytical framework.
Overall, we leveraged data from both a population-based setting and
clinical trials to elucidate the proteomic signatures of polygenic risk
for T2D and its comorbidities; provide causal evidence for the
associated proteins; identify proteins and biological pathways that
connect T2D with its comorbidities; and provide evidence for existing
therapeutic and potentially new target opportunities. We also developed
an interactive portal that allows users to interrogate and download the
results of our analyses
([227]https://public.cgr.astrazeneca.com/t2d-pgs/v1/).
Methods
Cohort description
The UK Biobank (UKB) is a deeply phenotyped population-based cohort
comprised of approximately 500,000 subjects with array genotyping,
exome/whole genome sequencing data and linkage to electronic health
care record data with over 14 years of follow-up time^[228]18. The UK
Biobank Pharma Proteomics Project (UKB-PPP) is a private-public
partnership that assayed 2923 unique proteins (2922 after excluding one
protein without sufficient measurements) in a subset of 54,306 UKB
participants using the Olink Explore proteomics
platform^[229]16,[230]17. The UK Biobank operates under approval from
the North West Multi-centre Research Ethics Committee (MREC). All UKB
participants provided informed consent.
Exenatide Study of Cardiovascular Event Lowering (EXSCEL) examined the
cardiovascular effects of once-weekly exenatide, a glucagon-like
peptide-1 (GLP-1) agonist, in T2D patients with a median follow-up time
of 3.2 years^[231]19. In a subset of trial participants (N = 2823),
both genotyping and SomaScan proteomics data were generated (see
Supplementary Data [232]1). All participants provided written informed
consent, and the trial protocol was approved by ethics committees at
each of the trial’s participating sites. EXSCEL can be found on
ClinicalTrials.gov ([233]NCT01144338).
Dapagliflozin Effect on Cardiovascular Events–Thrombolysis in
Myocardial Infarction 58 (DECLARE-TIMI 58) was a phase 3 RCT that
examined the cardiovascular effect of dapagliflozin, an inhibitor of
sodium-glucose co-transporter-2 (SGLT2), in patients with T2D with
multiple risk factors for or established atherosclerotic cardiovascular
disease with a median follow-up time of 4.2 years^[234]20. Similar to
EXSCEL, for a subset of participants both genotyping and Olink
proteomics data were available (N = 915). All participants provided
written informed consent, and the trial protocol was approved by the
institutional review board at each of the trial’s participating sites.
DECLARE-TIMI 58 can be found on ClinicalTrials.gov ([235]NCT01730534).
See Extended Methods in Supplementary Information for a description of
genotyping and proteomics quality control. Supplementary Fig. [236]1
shows the proteomic intersection between the three cohorts with
proteomics data. We note that a previous study has found that SomaScan
and Olink measurements to be moderately correlated^[237]65.
UK Biobank phenotype definitions
For the UKB, we used ICD10 codes and clinically meaningful “Union”
phenotypes constructed by merging relevant ICD10 codes (release from
Feb 2022) as detailed in Wang et al. ^[238]66. For identifying
prevalent cardiometabolic conditions that needed to be excluded from
analyses, we used the following ICD10 codes: E10-E14 (any diabetes
diagnosis), N18 (CKD), and I20-I25 (ischaemic heart disease, also known
as CAD). Incident cases were defined using ICD10 codes (E11 for T2D,
N18 for CKD, I25 for CAD) when the earliest date of diagnosis
(determined by fields 41270, 40001, 40002) occurred after baseline
(when a sample was donated for proteomics). Note that relatively few
cases were diagnosed within 30 days of sample collection (Supplementary
Data [239]2).
Randomised controlled trials (RCTs) phenotype definitions
We used three trial outcomes that were available in both trials, i.e.,
time to the composite cardiovascular outcome (major cardiovascular
events or MACE, comprising of cardiovascular death, nonfatal myocardial
infarction, or non-fatal ischaemic stroke), time to hospitalisation for
heart failure (HHF), time to the renal outcome (for EXSCEL, two
consecutive measurements of eGFR < 30 ml/min/1.73 m^2 and for
DECLARE-TIMI 58, a composite renal outcome comprising of a sustained
decrease of 40% or more in eGFR to < 60 ml/min/1.73 m^2, new end-stage
renal disease, or death from renal causes). Note that the competing
risk of death was addressed through censoring and the use of Cox
regression^[240]67. DECLARE-TIMI 58 cardiovascular endpoints were
adjudicated by independent adjudication committees. See Supplementary
Data [241]3 for a description of phenotypes.
Polygenic score estimation
To avoid overfitting when estimating the PGS in the UKB, we retrieved
genome-wide association study (GWAS) summary statistics from external
studies^[242]43,[243]68–[244]73 that did not contain UKB participants
(Supplementary Data [245]4), maximising both sample size and diversity.
We trained genome-wide T2D, CAD, BMI, and CKD PGS using PRS-CS^[246]74
(when only one GWAS per trait was available) or PRS-CSx^[247]75 with
these GWAS summary statistics. We ran PRS-CSx and PRS-CS with phi set
to ‘auto’. For the partitioned T2D polygenic scores (pPS), we obtained
the variants and cluster weights generated by Udler et al.
corresponding to five distinct genetic clusters, i.e., beta cell,
lipodystrophy, liver lipids, obesity, and proinsulin^[248]13. To enable
comparisons between genome-wide and GWAS-significant PGS, we also
generated a set of GWAS-significant T2D summary statistics using the
clump procedure implemented in PLINK v1.9 (p-value < 5 × 10^−8,
R^2 < 0.1, 250 kilobase window around each index variant). All PGS for
UKB, UKB-PPP, EXSCEL, and DECLARE-TIMI 58 were estimated using post-QC
imputed data and PLINK v2.00a4LM^[249]76 (see Supplementary
Data [250]5).
PGS validation
To validate the PGS, we tested their associations with traits defined
using ICD10 codes and/or biomarkers (Supplementary Data [251]2) in all
unrelated UKB-PPP participants (resolved to the 2^nd degree using KING
2.3.0^[252]77) and using R and adjusting for age, age^2, sex, age*sex,
age^2*sex, UKB centre, array, and PCs 1-20. We also used KING 2.3.0,
with the 1000 Genomes Project (1KGP), as a reference, to predict
genetic ancestry. KING first performs principal component analysis on
the 1KGP cohort, followed by projecting the UKB-PPP samples into the
1KGP PC space. Then, it uses a support vector machine and the 1KGP
super-population labels to train a model and predict the ancestry of
the target cohort. We tested the PGS for association with its target
trait in the genetically predicted subsets (European, South Asian, East
Asian, African, and Admixed American, all with a posterior
probability > 0.9) and in the entire multi-ancestry cohort. For the
CAD, BMI, and T2D, PRS-CSx inferred three sets of PGS (a “European”
PGS, an “East Asian” PGS, and a meta-analysis of the two). In these
cases, we retained the PGS with the strongest association with CAD,
BMI, and T2D in the full cohort analysis.
PGS and protein associations in UK Biobank
For testing PGS for association with protein expression levels, we
modelled our analysis on the same internal replication structure as
used in the UKB-PPP consortium pQTL GWAS. First, we restricted the
analysis to unrelated participants (resolved to the 2^nd degree)
without a baseline diagnosis of diabetes or a relevant cardiometabolic
condition (any Diabetes diagnosis, CAD, CKD) at data collection (UKB’s
baseline timepoint; N = 44,381) to reduce confounding due to reverse
causality as previously suggested by Ritchie et al. ^[253]9. Then, we
stratified the cohort into the consortium-identified discovery subset
consisting of European-ancestry participants (N = 29,496) and the
replication subset of remaining pan-ancestry participants (N = 14,885).
We tested each PGS for association with protein expression levels in
the discovery subset using linear regression in R, using the same
covariates (age, age^2, sex, age*sex, age^2*sex, batch, UKB centre,
array, time to analysis, genetic PCs 1-20) as the UKB-PPP consortium
pQTL GWAS^[254]16. Significant PGS-protein associations after a
Bonferroni correction accounting for 10 scores and 2922 proteins
(p-value < 1.7 × 10^-6) were then moved forward to be tested in the
replication subset. To account for the difference in sample sizes in
the two subsets, we applied an FDR correction to our replication
analysis separately to each PGS^[255]9. Associations were considered
internally replicated with an FDR-adjusted p-value < 0.05.
To assess whether a PGS-protein association was driven by a single
locus, we repeated the PGS-protein association analyses after adjusting
for independent cis and trans pQTLs obtained from the UKB-PPP
consortium’s pQTL GWAS. Similarly, we repeated PGS-protein association
analyses after adjusting for BMI, waist-hip ratio, and waist
circumference to describe the influence of adiposity on the PGS
associations. Finally, to explore the impact of genetic ancestry on the
transferability of PGS-protein associations, we tested PGS-protein
associations after stratifying the UKB-PPP cohort by the KING-predicted
ancestry labels described above. We then compared and contrasted
PGS-protein association patterns in the genetically predicted European
ancestry subset with that of the other four predicted
super-populations. As the UKB-PPP cohort is over 92% European ancestry,
we used beta coefficients rather than p-values and compared the beta
coefficients of the PGS on circulating protein levels using Pearson’s r
and the slope of the regression line fitted to the beta coefficients.
Mediation of PGS-protein associations by adiposity
For the PGS[T2D_gw]-protein associations that were no longer
significant, we performed mediation analysis to investigate if BMI
mediated the effect of the PGS on the circulating protein levels. We
used the 'mediate' function from mediation R package^[256]78 (4.5.0) to
perform mediation with the PGS as the exposure, BMI as the mediator,
and the circulating protein levels as the outcome. We performed a
sensitivity analysis using the 'medsens' function from the same R
package.
GWAS and pQTL summary statistics for Mendelian randomisation
We leveraged statistically independent pQTLs identifed by the UKB-PPP
consortium from the European ancestry 35K discovery subset to use as
instruments for our two-sample Mendelian randomisation (MR) analysis.
The UKB-PPP pQTL GWAS^[257]16 used a minor allele count cut-off of 50,
a minimum INFO score of 0.7, and a study-wide significant threshold
p-value < 3.4 × 10^−11. The pQTLs were defined as cis if they were
within +/− 1MB of a gene’s transcription start site. For this same set
of pQTLs, we also extracted summary statistics using the pan-ancestry
combined UKB-PPP cohort for our two-sample trans-ancestry MR analysis.
We then generated trait-level GWAS summary statistics using the UKB
after excluding the UKB-PPP participants to avoid sample overlap.
Individuals were retained for the GWAS if they passed an internal set
of quality control criteria including sex concordance, heterozygosity,
ploidy, were assigned a continental PEDDY-predicted^[258]79 ancestry
with a probability ≥ 0.90, and were not included in the UKB-PPP study.
For the generation of summary statistics for Mendelian randomisation,
we used E11 for T2D, N18 for CKD, the union term of 120-I25 for CAD,
and body mass index (BMI) measured at baseline. For traits in UKB
patients with T2D, we used the ICD10 codes in Supplementary
Data [259]2, which contains a complete description of UKB phenotype
definitions and cohort sample size (pre-QC). We utilised REGENIE for
the GWAS, which employs a two-step approach as previously
described^[260]80. For the first step, we used quality-controlled
genotyped variants (MAF > 1%, genotyping rate >99%, HWE
p-value > 10^−15, < 10% missingness and LD pruning using 1000 variant
windows, 100 sliding windows and r^2 < 0.8), while for the second step
we used imputed variants with a MAC > 50 and an INFO score > 0.7. For
traits assessed in the full cohort (T2D, BMI, CKD, CAD), we performed
each GWAS within each of the predicted continental ancestries,
adjusting for age, sex, and genetic PCs 1-10. For traits assessed in
UKB participants with T2D, we performed each GWAS within the
predicted European-ancestry subset, adjusting for age, sex, and genetic
PCs 1-10.
2-sample Mendelian randomisation (MR)
We performed a two-sample Mendelian randomisation (MR) analysis to
conduct a proteome-wide scan for proteins suspected to play a potential
causal role in the development of T2D or T2D comorbidity. For our
primary MR analysis, we utilised the European ancestry GWAS data. To
identify weak instruments, we calculated the F-statistic^[261]81 for
each instrument (F-statistic =
[MATH:
β2/Se2
:MATH]
) and removed instruments with an F-statistic < 10. We also removed
instruments that were pQTLs for 5 or more proteins to reduce
pleiotropy. We performed MR on all proteins with 3 or more cis pQTLs
using the MendelianRandomization^[262]82 R package (version 0.7.0) and
applying the simple and weighted median^[263]83, IVW^[264]84, and
MR-Egger^[265]85 methods. We then repeated the analysis using both cis
and trans instruments. We addressed violations of the pleiotropy
assumption by flagging results with an MR-Egger intercept
p-value < 0.05^[266]9. We then took the median p-value across all MR
methods to identify significant associations. For traits available in
the entire UKB cohort, we reported both Bonferroni and FDR levels of
significance, while for traits in patients with T2D (ICD10 code E11),
we only applied an FDR correction due to the smaller sample sizes.
Multi-variable Mendelian randomisation
As T2D and adiposity are closely intertwined, we performed
multivariable MR (MVMR) using the MVMR R package^[267]26, BMI GWAS data
from the FinnGen cohort (r11)^[268]86 and pQTL data from the UKB-PPP
discovery cohort (European ancestry) as exposures, and T2D GWAS data
from the European-ancestry subset of the UKB (excluding UKB-PPP
participants) as the outcome. The FinnGen study is a large-scale
genomics initiative that has analyzed over 500,000 Finnish biobank
samples and correlated genetic variation with health data to understand
disease mechanisms and predispositions. The project is a collaboration
between research organisations and biobanks within Finland and
international industry partners. MVMR reports both instrument strength
and pleiotropy estimations, and we retained analyses with the
conditional F-statistic > 10 for both exposures and no evidence of
pleiotropy (Q-statistic p-value > 0.05). Like our primary MR analysis,
we repeated MVMR using cis only and cis plus trans instruments. We then
applied an FDR correction on the results.
Mendelian randomisation using cis molecular data and MR-Link-2
We also performed cis MR analysis using MR-Link-2^[269]25, a
methodology developed explicitly for MR using GWAS information obtained
cis molecular data. We again use the pQTL GWAS data from the UKB-PPP as
the exposure and the UKB GWAS data as the outcome, though in this case
taking the entire cis region as input. MR-Link-2 requires genotype data
for LD, and for this, we used the UKB-PPP participants from the GWAS
data (the European ancestry discovery cohort).
Mendelian randomisation sensitivity analyses including statistical
colocalization
For sensitivity analyses, we repeated MR analyses using pan-ancestry
summary statistics from the full UKB-PPP cohort and the meta-analysed
trans-ancestry GWAS MR analyses (via a fixed-effected meta-analysis).
We also conducted a colocalization analysis for any variant with GWAS
p-value < 1 × 10^−^6 for both the trait and the protein using the coloc
package^[270]87 (version 5.1.0.1) in R 4.2.2 on a +/− 250 kilobase
window centred on the variant. We used the summary statistics-based
method (coloc.abf) with the default priors (p1: 10^−4, p2: 10^-4, p12:
10^-5) and considered the GWAS variant and the pQTL to have strong
colocalization evidence following recommended criteria
(PP.H3 + PP.H4 > 0.99 and PP.H4/PP.H3 > 5)^[271]9,[272]87,[273]88. For
loci located in the cis region of an Olink-assayed protein, we also
used the coloc.susie^[274]89 method with LD matrices computed with the
imputed UKB-PPP data and PLINK v1.90b6.18^[275]76,[276]90.
Mediation analyses of PGS, circulating proteins, and incident disease in UKB
In this framework, we set the PGS as the exposure, the protein as the
mediator, and incident disease as the outcome. We utilised incident
cases (prevalent cases were excluded), the same set of covariates as
the UKB PGS-protein association analysis (see above), and the Medflex R
package^[277]9,[278]91 (version 0.6-10) to perform mediation analysis
with natural effects models. We considered a mediation model to be
significant if the mediation (indirect) p-value and the total p-value
were both significant after a Bonferroni correction
(p-value < 1.4 × 10^−6); the direct effect p-value was allowed to be
not significant as it is possible that the PGS’s effect is primarily
mediated through the tested protein. As a sensitivity analysis, we
performed mediation for both all participants and only European
ancestry; we filtered out proteins that were not significant in both
scenarios.
To evaluate robustness to potential mediator-mediator confounding, we
employed the mediation R package (4.5.0)^[279]78. First, we again
performed mediation with the PGS as exposure, the protein as a
mediator, and the incident disease as the outcome using the 'mediate'
function. Then, we performed the sensitivity analysis using the
'medsens' function, which performs a simulation-based sensitivity
analysis. As the PGS is fixed at birth, it is assumed that there will
not be any PGS-mediator confounding.
Reverse Mendelian randomisation
We obtained MR instruments for reverse MR from the European-ancestry
T2D, CKD, BMI, and CAD GWAS data that we generated for our forward MR
analysis, plus an additional model for CAD using only the ICD10 code
I25. We performed LD pruning using PLINK v1.90b6.18’s clumping
procedure with a 500 KB window size, a minimum p-value threshold of
5 × 10^−^8, and R^2 < 0.001 using UKB-PPP as the LD reference. We
matched each instrument with variants from the pQTL GWAS data. We then
performed reverse MR using the MendelianRandomization R package in the
same manner as described above.
Association and mediation of proteins with trial outcomes in EXSCEL and
DECLARE-TIMI 58
We tested proteins for their association with the time to EXSCEL
outcomes with the survival package in R 3.6.1
([280]https://github.com/therneau/survival) and replicated significant
associations in DECLARE-TIMI 58. In both cases, we restricted analyses
to the placebo arm. We used Cox proportional hazards regression and
adjusted for age, sex, age*sex, and genetic PCs 1-10. For both EXSCEL
and DECLARE-TIMI 58, two time points were available for proteomics:
baseline plus 12 months for EXSCEL, and baseline plus 6 months for
DECLARE. In both studies, we performed the proportional hazards
regression analysis three times, using the baseline measurements, the
repeat measurement, and the repeat measurement with baseline included
as an additional covariate, respectively, as exposures. For EXSCEL,
each SomaLogic aptamer was tested separately and reported. In the case
of statistically significant associations in the clinical trial
time-to-event analyses, (FDR p-value < 0.05 as well as a more stringent
Bonferroni-corrected p-value in EXSCEL), we assessed the proteins using
more comprehensive models that included clinical risk factors to
evaluate their suitability as a biomarker (see “Description of risk
factors” section below). For all models, we tested for the proportional
hazards assumption using the cox.zph() function and flagged any model
with a global p-value < 0.05 as violating model assumptions.
Description of risk factors included in clinical trial time-to-event analyses
In the case of statistically significant associations in the clinical
trial time to event analyses, we assessed the proteins using more
comprehensive models to evaluate their suitability as a biomarker.
These models included both the above covariates along with additional
risk factors specific to the outcome of interest. For heart failure
outcomes, we also adjusted for coronary artery disease, atrial
fibrillation, baseline eGFR, baseline UACR (diagnosis of albuminuria
for EXSCEL), and prior heart failure^[281]92. For MACE, we also
adjusted for hypertension, hypercholesterolaemia, BMI at baseline,
smoking (current, or smoking cessation ≤3 months), and atherosclerotic
disease (prior myocardial infarction, percutaneous coronary
intervention/coronary artery bypass grafting, cerebrovascular
accident/transient ischaemic attack, or peripheral arterial
disease)^[282]93. Note, for DECLARE, the prior cardiovascular disease
and multiple risk factors variables were used as they captured the MACE
risk factors. For renal outcomes, we adjusted for atherosclerotic
cardiovascular disease at baseline, heart failure at baseline, systolic
blood pressure at baseline, T2D duration, HbA1c at baseline, eGFR at
baseline, urine albumin-to-creatinine ratio (ACR) at baseline, and
haemoglobin at baseline^[283]94. Note, for EXSCEL, certain analytes
were only assessed intermittently, resulting in sample size loss when
adjusting for these covariates. Consequently, we replaced ACR with a
baseline diagnosis of albuminuria and performed a sensitivity analysis
with and without haemoglobin. Despite the substantial difference in
sample sizes, the hazard ratios were highly correlated (r = 0.91,
p-value < 2.2 × 10^−^16), thus we omitted the haemoglobin adjustment
for EXSCEL. Finally, for HHF and MACE, we included a third model that
also adjusted for NT-proBNP to identify proteins that were independent
of this biomarker as NT-proBNP was profiled in both EXSCEL and DECLARE.
Mediation analysis in EXSCEL and DECLARE
In the scenarios where a PGS is associated with both a clinical trial
outcome and a protein (see Extended Methods), and the protein is in
turn also associated with the same outcome, we performed mediation
analysis using the Medflex R package (version 0.6-10) in the same
manner as above, with the indirect p-value adjusted using the FDR
approach. To maximise sample size for mediation, we made use of the
full proteomics cohort and included the treatment arm as a covariate in
addition to age, sex, age*sex, and genetic PCs 1-10.
Pathway enrichment
For all sets of proteins identified by the PGS-protein analyses, we
used the gProfiler^[284]95 tool
([285]https://biit.cs.ut.ee/gprofiler/gost) to test if these sets of
encoded genes were enriched for KEGG, WikiPathways, or REACTOME
pathways. Note that we restricted the statistical domain to the genes
whose protein products are captured by the Olink panels.
Reporting summary
Further information on research design is available in the [286]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[287]Supplementary Information^ (11.9MB, pdf)
[288]41467_2025_56695_MOESM2_ESM.pdf^ (157.5KB, pdf)
Description of Additional Supplementary Files
[289]Supplementary Data 1-21^ (16.9MB, xlsx)
[290]Reporting Summary^ (3.5MB, pdf)
[291]Transparent Peer Review file^ (214.2KB, pdf)
Source data
[292]Source Data^ (591.6KB, xlsx)
Acknowledgements