Abstract
Venous thromboembolism is a life-threatening vascular event with high
prevalence and genetic determinants. PWAS has become a popular strategy
to identify therapeutic targets of complex diseases. However, the
current PWAS model only considers the linear relationship between
protein and disease. Here, we propose a novel non-linear PWAS pipeline
and identify 43 proteins exhibiting non-linear associations with venous
thromboembolism in the UK Biobank, of which eight proteins cannot be
captured by linear PWAS. We further conduct prospective cohort
replication in the UK Biobank Pharma Proteomics Project, and replicate
eight proteins with similar non-linear trends, including ULBP2, IL18BP,
MAN1A2, CCL25, ICAM2, LGALS4, VSIG2 and ABO. Pathway enrichment
analysis suggests that the identified non-linear proteins are involved
in endothelium development, fluid shear stress and atherosclerosis
pathways. In summary, we develop a novel non-linear PWAS analysis
pipeline, and identify 43 non-linear proteins with venous
thromboembolism, highlighting the importance of incorporating
non-linear analysis in PWAS.
Subject terms: Predictive markers, Proteome informatics, Haematological
diseases, Diagnostic markers, Cardiovascular diseases
__________________________________________________________________
Venous thromboembolism is a serious condition influenced by genetics,
but current protein-based analyses often miss complex patterns. Here,
the authors show that a new non-linear analysis method reveals 43
associated proteins, including eight missed by traditional models,
offering deeper insight into disease mechanisms.
Introduction
Venous thromboembolism (VTE) is a cardiovascular disease caused by an
imbalance in the regulation of hemostasis, leading to the pathologic
coagulation of blood and the formation of vascular thrombosis.
According to the location of the thrombus, VTE can be classified as
deep vein thrombosis (DVT), which occurs in the deep veins (e.g., legs
and trunk), or as pulmonary embolism (PE), when a thrombus obstructs
the pulmonary arteries^[42]1. The annual incidence of VTE in the
general population is approximately 100 cases per 100,000 individuals
(1‰)^[43]2, and more than 600,000 individuals in the United States and
Europe die due to VTE each year^[44]3.
VTE is a complex disease influenced by both environmental and genetic
factors. Twin studies have estimated the heritability of VTE to be
approximately 50%, suggesting that a large proportion of VTE risk is
genetically driven^[45]4,[46]5. Large-scale genome-wide association
studies (GWAS) have identified hundreds of loci associated with
VTE^[47]6–[48]9. However, most of the susceptibility loci identified by
GWAS are located in non-coding regions or span multiple genes, which
limits their ability to reflect the causal roles of genes and proteins.
As key molecular regulators, proteins have become the primary targets
in most drug development efforts^[49]10. Therefore, protein-wide
association studies (PWAS) are now widely performed to explore how
proteins are involved in the occurrence and progression of
diseases^[50]11. Previous PWAS have identified several proteins,
including F2, F11, ABO and PROC, as potential diagnostic and
therapeutic targets for VTE using the FUSION pipeline followed by
causal inference^[51]12, SERPINE1, VWF, EPHB4, etc., as causally
associated proteins with VTE through prospective designs and Mendelian
randomization^[52]13, and CFHR5 as a VTE associated biomarker based on
case-control designs^[53]14.
Currently, almost all PWAS studies are designed based on the assumption
of linearity, in which proteins are assumed to affect diseases only in
a linear pattern^[54]12–[55]14. However, many studies have indicated
that the relationships between proteins and diseases may follow
non-linear patterns. For example, a cohort study showed that women with
moderate levels of the protein TFF3 had the highest risk of recurrent
ovarian cancer compared to women with either low or high levels^[56]15.
Another two review studies suggested that the protein p53 functions as
a “cellular rheostat” in cancer, favoring antioxidant functions at low
stress levels while switching to a prooxidant role under high
stress^[57]16,[58]17. The similar non-linear associations were also
reported between tau protein and neurodegenerative
diseases^[59]18–[60]20.
Therefore, to identify the dose-response relationships between proteins
and diseases, non-linear patterns should be considered in PWAS. In this
study, we proposed a novel non-linear PWAS analysis pipeline based on
genetically predicted protein levels. We then applied this pipeline to
identify the 43 proteins non-linearly associated with VTE in the UK
biobank (UKB). Eight of these proteins were replicated with similar
dose-response curves in the prospective cohort of the UK Biobank Pharma
Proteomics Project (UKB-PPP). Finally, we performed pathway enrichment
analysis to elucidate the potential non-linear biological pathway for
VTE.
Results
Overview of study design
The overview of the study design is illustrated in Fig. [61]1. We first
genetically predicted protein levels of 378,474 Europeans in the UKB
using the prediction models developed by Xu et al.^[62]21. To validate
the predicted proteins, we calculated the Pearson correlation
coefficient (r) between the predicted protein levels and directly
measured protein levels in the UKB-PPP^[63]22, and only proteins with
R^2 > 0.01^[64]21,[65]23 were retained for subsequent analyses
(Fig. [66]1A).
Fig. 1. Overview of study design.
[67]Fig. 1
[68]Open in a new tab
A Predict the protein levels of European individuals in UKB using the
weights of pQTL from INTERVAL study. Predicted proteins with validated
R^2 > 0.01 in UKB-PPP were retained for the subsequent analyses. B Two
patterns of the non-linear associations between protein levels and VTE.
Pattern I: The protein has non-linear trends with disease (U-shape or
L-shape, the blue curve), while the linear trend can be also fitted
(the red line) because of the misleading statistical significance.
Pattern II: the proteins have non-linear trends, while the linear trend
is not significant due to the weak test power. The purple triangle
represents the reference OR value. C Linear and non-linear PWAS with
VTE by the logistic regression and RCS model using predicted proteins,
respectively. The gray area of the Venn diagram represents Patterns I
and II. D Cohort replication and pathway enrichment for proteins with
non-linear trends, created in BioRender. Kong, Y. (2025)
[69]https://BioRender.com/j9ioqmi. Figure 1 is a schematic diagram and
is not derived from real data.
Based on statistical theory, applying linear regression to variables
that exhibit non-linear relationships may lead to reduced test power,
or even misleading statistical significance^[70]24,[71]25. Therefore,
we considered two patterns for non-linear relationships between protein
levels and the disease. In the first pattern, protein levels indeed
have non-linear trends with the disease (e.g., U-shape or L-shape), but
a misleadingly significant linear association may still be observed
(Pattern I in Fig. [72]1B). In the second pattern, the relationship is
truly non-linear, while the linear association appears non-significant
due to reduced test power (Pattern II in Fig. [73]1B). To identify
these two patterns, we conducted both linear and non-linear PWAS of the
disease using the genetically predicted proteins, employing logistic
regression for the linear analysis and restricted cubic splines (RCS)
for the non-linear analysis. The two patterns can be distinguished by
combining results of linear and non-linear PWAS (Fig. [74]1C). Finally,
we extracted prospective cohort data in UKB-PPP to replicate the
proteins identified by the non-linear PWAS. We also performed gene-set
enrichment analyses on these non-linear proteins, to explain the
potential biological pathways for the non-linear patterns
(Fig. [75]1D).
Predict and validate protein levels in UKB
We first extracted 15,456,785 genetic variants from 378,474 Europeans
in the UKB, including 22,103 individuals with a history of VTE and
356,371 controls after quality control (Table [76]S1, details shown in
Methods section). Based on prediction models constructed by Xu et
al.^[77]21, we genetically predicted 308 Olink proteins and 2384
SomaScan proteins for these 378,474 Europeans. After quality control,
measured levels of 295 Olink proteins and 944 SomaScan proteins were
also available for 31,013 Europeans in the UKB-PPP (Table [78]S1,
details shown in Methods section). We then calculated Pearson
correlation coefficients between these overlapped genetically predicted
and measured protein levels in the UKB-PPP. Prediction performance was
shown in Fig. [79]S1. For Olink proteins, 237 had an R^2 greater than
0.01, among which 97 had R^2 > 0.1 and 5 had R^2 > 0.5. For SomaScan
proteins, 408 had an R^2 greater than 0.01, among which 151 had
R^2 > 0.1 and 9 had R^2 > 0.5. In total, 537 unique well predicted
proteins (R^2 > 0.01), including 237 Olink proteins (Supplementary
Data [80]1) and 408 SomaScan proteins (Supplementary Data [81]2), were
retained for subsequent analysis. It should be noted that multiple
aptamers may target the same SomaScan protein, so that a SomaScan
protein may have more than one measurement^[82]26. We listed these
special SomaScan proteins in Supplementary Data [83]2, and
distinguished them by superscripts of asterisk and circumflex in the
study.
Linear PWAS identified 64 proteins for VTE
There were 64 unique proteins whose predicted levels were linearly
associated with VTE at a significance level of P < 9.3 × 10^−5
(0.05/537) using Bonferroni adjustment, including 28 Olink proteins and
48 SomaScan proteins (Fig. [84]2A), of which 12 proteins (red labels in
Fig. [85]2A) were significant in both Olink and SomaScan platforms,
such as SELE, LIFR and VWF. We extracted the Z-statistics of these 12
proteins from each platform (Fig. [86]2B). The scatter plot showed a
good concordance in linear PWAS results between the Olink and SomaScan
platforms, with a Pearson correlation coefficient being 0.976
(P = 1.2 × 10^−8). Previous linear PWAS based on GWAS and pQTL summary
statistics for 257–1348 proteins have identified 21 proteins casually
associated with VTE^[87]12–[88]14, among which nine proteins were
validated in our study. The effect directions of these nine proteins
were also consistent with our linear PWAS results (risk proteins:
KLKB1, GP6, F11, ABO and VWF; protective proteins: THBS2, EFEMP1, EPHB4
and PROC)^[89]12,[90]13. Notably, several proteins previously reported
to be strongly associated with VTE, such as F2, F5 and SERPINC1, were
not validated in our study. This is because we performed PWAS based on
genetically predicted protein levels, and these specific proteins were
not well predicted and thus excluded in the external validation.
Fig. 2. Linear and non-linear PWAS results for VTE.
[91]Fig. 2
[92]Open in a new tab
A Miami plot for linear PWAS from Olink platform (top) and SomaScan
platform (bottom). The x axis is the chromosomal location of proteins,
and the y axis denotes −log10 P values from two-sided Wald tests in the
logistic regression. The red labels mark the significant proteins
present in both Olink and SomaScan platforms, while the gray labels
mark the significant proteins present in either Olink or SomaScan
platforms. The dashed line is the significant threshold at 9.3 × 10^−5
after Bonferroni adjustment. B Scatter plot for linear PWAS. The x and
y axes are Z statistics of linear PWAS from SomaScan and Olink
platforms, respectively. The Pearson correlation of scatters is shown
in the lower right corner. C Miami plot for non-linear PWAS. The y axis
denotes −log10 P values from two-sided overall significance test in the
RCS model. The blue labels mark the significant proteins present in
both Olink and SomaScan platforms, while the gray labels mark the
significant proteins present in either Olink or SomaScan platforms. The
purple dots represent the proteins that passed the two-sided
non-linearity test. For clear display, y axis scales in (A) and (C) are
compressed by the square root. D Scatter plot for non-linear PWAS. The
x and y axes are
[MATH: χ2
:MATH]
statistics of significance test in non-linear PWAS from SomaScan and
Olink platforms, respectively. Source data are provided as a [93]Source
Data file.
Non-linear PWAS identified 43 proteins for VTE
Our non-linear PWAS using the RCS model identified 34 Olink proteins
and 44 SomaScan proteins associated with VTE (P < 9.3 × 10^−5 for the
overall significance test, Fig. [94]2C). Among them, 23 Olink proteins
and 27 SomaScan proteins showed significant non-linear associations
(P < 0.05 for the non-linearity test, purple dots in the Fig. [95]2C,
the set of significance levels were shown in the Methods section). The
list of these 23 Olink proteins and 27 SomaScan proteins with
significant non-linear trends were presented in Table [96]1. After
removing duplicate proteins across Olink and SomaScan platforms, 43
unique proteins were indicated to have non-linear associations with
VTE, 26 of which were not reported by previous study, such as TEK,
CCL25 and VSIG2. 13 proteins were significant in both Olink and
SomaScan platforms, and their P values of the overall significance test
were highly correlated, with a Pearson correlation coefficient being
0.989 (P = 1.4 × 10^−10) (Fig. [97]2D). The CVs of these protein
measurements in Olink and SomaScan were listed in Supplementary
Data [98]3^[99]22.
Table 1.
The 43 unique proteins that have significant non-linear associations
with VTE
Protein^a Position^b Linear^c Non-linear^d Platform Pattern Known
loci^e
OR (95% CI) P P[RCS] P[Non–linear]
TIE1 chr1:43300982–43323108 1.12 (1.09–1.15) 4.76 × 10^−56
8.42 × 10^−60 3.48 × 10^−6 SomaScan I Not reported
F3 chr1:94529173–94541759 0.97 (0.95–1.00) 9.01 × 10^−5 9.51 × 10^−13
4.55 × 10^−10 Olink I 28827327
MAN1A2 chr1:117367449–117528872 1.09 (1.06–1.12) 1.74 × 10^−38
3.65 × 10^−50 7.47 × 10^−13 SomaScan I 36658437
PEAR1 chr1:156893698–156916429 0.97 (0.94–1.00) 5.11 × 10^−5
3.61 × 10^−20 3.93 × 10^−18 SomaScan I 28002340
SELE chr1:169722640–169764705 0.85 (0.83–0.87) 1.10 × 10^−121
3.54 × 10^−134 1.43 × 10^−14 Olink I 21980494
0.86 (0.83–0.88) 2.47 × 10^−108 1.90 × 10^−120 1.88 × 10^−13 SomaScan
LGALS8 chr1:236518000–236552981 0.96 (0.93–0.98) 1.12 × 10^−10
1.07 × 10^−17 1.98 × 10^−10 Olink I Not reported
MERTK chr2:111898607–112029561 0.96 (0.93–0.98) 9.66 × 10^−11
1.37 × 10^−10 4.87 × 10^−2 Olink I Not reported
TFPI chr2:187464230–187565760 0.93 (0.90–0.95) 1.16 × 10^−27
1.37 × 10^−26 3.26 × 10^−2 SomaScan I 36154123
ALPI chr2:232456125–232460753 0.91 (0.89–0.94) 3.85 × 10^−38
6.19 × 10^−40 1.37 × 10^−3 SomaScan I Not reported
FAM3D chr3:58633946–58666834 1.09 (1.06–1.12) 7.94 × 10^−32
9.76 × 10^−32 2.52 × 10^−2 SomaScan I Not reported
KDR chr4:55078481–55125595 0.89 (0.87–0.92) 1.03 × 10^−60 3.42 × 10^−82
3.01 × 10^−25 SomaScan I Not reported
F11 chr4:186266189–186289681 1.11 (1.08–1.14) 4.35 × 10^−52
4.04 × 10^−50 2.64 × 10^−2 SomaScan I 31420334
LIFR chr5:38474668–38608354 0.92 (0.90–0.95) 2.68 × 10^−32
2.32 × 10^−123 5.22 × 10^−97 Olink I 36658437
0.86 (0.83–0.88) 7.21 × 10^−95 4.89 × 10^−121 7.77 × 10^−28 SomaScan
FLT4 chr5:180601506–180649624 0.94 (0.91–0.96) 3.03 × 10^−20
2.91 × 10^−20 2.10 × 10^−2 SomaScan I 31765479
ULBP2 chr6:149942014–149949235 1.05 (1.02–1.08) 9.98 × 10^−14
1.03 × 10^−16 5.77 × 10^−5 Olink I Not reported
PDGFA chr7:497258–520296 1.04 (1.01–1.07) 4.42 × 10^−9 9.27 × 10^−9
4.37 × 10^−2 Olink I Not reported
1.05 (1.02–1.08) 9.07 × 10^−10 2.17 × 10^−9 3.92 × 10^−2 SomaScan
EPHB4 chr7:100802565–100827523 0.92 (0.90–0.95) 2.65 × 10^−33
1.78 × 10^−128 1.19 × 10^−100 Olink I 35285134
ANGPT1 chr8:107249482–107498055 1.04 (1.01–1.07) 2.76 × 10^−9
8.51 × 10^−10 5.29 × 10^−3 SomaScan I 36154123
ABO chr9:133233278–133276024 1.15 (1.12–1.18) 7.90 × 10^−91
9.34 × 10^−110 4.29 × 10^−23 SomaScan I 28373160
CBLIF chr11:59829273–59845499 0.95 (0.93–0.98) 1.63 × 10^−11
2.88 × 10^−12 6.80 × 10^−3 Olink I 35285134
IL18BP chr11:71998613–72007315 1.07 (1.04–1.10) 8.98 × 10^−20
4.22 × 10^−22 2.48 × 10^−4 Olink I Not reported
RELT^ chr11:73376399–73397474 0.97 (0.95–1.00) 5.47 × 10^−5
6.54 × 10^−6 5.85 × 10^−3 SomaScan I Not reported
ESAM* chr11:124752583–124762290 1.03 (1.00–1.06) 8.73 × 10^−6
4.80 × 10^−13 4.16 × 10^−9 SomaScan I Not reported
ESAM^ 1.03 (1.01–1.06) 2.08 × 10^−6 1.12 × 10^−13 3.72 × 10^−9
VWF chr12:5948877–6124770 1.19 (1.16–1.22) 1.58 × 10^−134
4.67 × 10^−154 3.36 × 10^−24 Olink I 36658437
1.17 (1.14–1.20) 7.78 × 10^−111 2.20 × 10^−143 1.15 × 10^−36 SomaScan
ISLR2* chr15:74100311–74138540 0.95 (0.92–0.97) 1.12 × 10^−14
1.94 × 10^−54 3.53 × 10^−43 SomaScan I Not reported
ISLR2^ 0.96 (0.93–0.98) 4.42 × 10^−10 2.05 × 10^−40 1.81 × 10^−33
CDH5 chr16:66366622–66404784 0.95 (0.93–0.98) 1.04 × 10^−12
1.59 × 10^−71 6.12 × 10^−63 Olink I Not reported
0.95 (0.93–0.98) 2.65 × 10^−11 4.15 × 10^−66 6.43 × 10^−59 SomaScan
LGALS9 chr17:27629798–27649560 0.96 (0.94–0.99) 3.29 × 10^−8
7.06 × 10^−8 4.06 × 10^−2 Olink I Not reported
ICAM2 chr17:64002594–64020634 0.91 (0.88–0.93) 4.12 × 10^−45
6.85 × 10^−59 1.03 × 10^−15 Olink I Not reported
0.87 (0.85–0.90) 1.67 × 10^−81 8.69 × 10^−82 1.12 × 10^−2 SomaScan
PECAM1 chr17:64319415–64413776 0.92 (0.89–0.94) 2.29 × 10^−36
8.32 × 10^−121 2.20 × 10^−91 Olink I Not reported
CD209 chr19:7739993–7747564 1.17 (1.14–1.20) 5.21 × 10^−114
7.77 × 10^−136 1.21 × 10^−25 SomaScan I 35285134
ICAM1 chr19:10271093–10286615 0.89 (0.86–0.92) 1.78 × 10^−56
5.90 × 10^−145 4.39 × 10^−94 SomaScan I 31676865
LGALS4 chr19:38801671–38812945 0.85 (0.82–0.87) 5.79 × 10^−127
5.93 × 10^−151 9.80 × 10^−28 Olink I Not reported
OXT chr20:3071620–3072517 1.04 (1.01–1.06) 3.16 × 10^−7 2.82 × 10^−7
2.59 × 10^−2 SomaScan I Not reported
THBD chr20:23045633–23049672 1.05 (1.02–1.08) 1.96 × 10^−11
1.30 × 10^−11 5.99 × 10^−3 Olink I 26888256
FAM3B* chr21:41304212–41357727 1.04 (1.01–1.07) 2.13 × 10^−8
1.02 × 10^−11 9.76 × 10^−6 SomaScan I Not reported
FAM3B^ 1.04 (1.01–1.07) 1.64 × 10^−9 5.05 × 10^−17 4.27 × 10^−10
CTRC chr1:15438442–15449242 1.03 (1.00–1.06) 1.08 × 10^−4 1.14 × 10^−12
4.56 × 10^−9 Olink II Not reported
PROC chr2:127418427–127429242 0.99 (0.96–1.01) 4.76 × 10^−2
4.62 × 10^−10 4.99 × 10^−10 SomaScan II 36658437
GFRAL chr6:55327469–55402493 1.03 (1.00–1.06) 1.65 × 10^−4 5.29 × 10^−8
5.23 × 10^−5 SomaScan II Not reported
TEK chr9:27109141–27230174 1.00 (0.97–1.02) 5.43 × 10^−1 9.36 × 10^−16
1.68 × 10^−16 Olink II Not reported
1.01 (0.98–1.03) 3.22 × 10^−1 3.61 × 10^−17 1.07 × 10^−17 SomaScan
DKK1 chr10:52314281–52318042 1.02 (0.99–1.05) 2.53 × 10^−3 9.65 × 10^−6
1.87 × 10^−4 Olink II Not reported
VSIG2 chr11:124747474–124752255 0.99 (0.97–1.02) 4.31 × 10^−1
2.52 × 10^−8 6.83 × 10^−9 Olink II Not reported
CCL25 chr19:8052318–8062660 0.98 (0.95–1.00) 6.73 × 10^−4 5.25 × 10^−6
5.38 × 10^−4 Olink II Not reported
LDLR chr19:11089462–11133820 1.00 (0.97–1.02) 6.33 × 10^−1 2.51 × 10^−5
6.69 × 10^−6 Olink II Not reported
[100]Open in a new tab
^aThe superscript symbols ‘*’ and ‘^’ are only used to distinguish
SomaScan proteins with multiple aptamers. Details on the corresponding
aptamers can be found in Supplementary Data [101]2.
^bThe position of protein-coding genes, GRCh38 assembly.
^cThe linear result obtained by two-sided Wald tests in the logistic
regression. The Bonferroni significance is 9.3 × 10^−5.
^d
[MATH:
PRCS :MATH]
is the P value of the two-sided overall significance test, and
[MATH:
PNon−<
/mo>linear
:MATH]
is the P value of the two-sided non-linearity test in the RCS model.
^ePMID for the known loci. Known loci are defined as those with clear
evidence linking the protein or gene to VTE, or where significant GWAS
loci for VTE are identified within 1MB upstream or downstream of the
protein.
Identify two non-linear patterns
Combining the linear and non-linear PWAS results in Table [102]1, we
can classify proteins into Patterns I and II. Briefly, the proteins
with significant
[MATH:
PLine<
/mi>ar :MATH]
,
[MATH:
PRCS :MATH]
and
[MATH:
PNon−<
/mo>linear
:MATH]
, such as VWF, CDH5 and F3, were categorized as Pattern I. The proteins
with significant
[MATH:
PRCS :MATH]
and
[MATH:
PNon−<
/mo>linear
:MATH]
but non-significant
[MATH:
PLine<
/mi>ar :MATH]
, such as TEK, CCL25 and VSIG2, were classified as Pattern II. Among
the VTE-associated proteins, 29 proteins were identified with only
linear trends, 35 proteins were assigned to Pattern I and eight
proteins were assigned to Pattern II (Fig. [103]3A). Moreover, seven
proteins in Pattern II including GFRAL, CTRC, DKK1, TEK, VSIG2, CCL25
and LDLR were newly identified as being associated with VTE,
highlighting that our proposed non-linear analysis pipeline can uncover
additional associations that may be missed under traditional linear
assumption. Specifically, both the Olink and SomaScan platforms
identified proteins ICAM2, CDH5, LIFR, PDGFA, SELE and VWF as Pattern
I, and TEK as Pattern II (Fig. [104]3B). This demonstrates that the two
non-linear patterns remain robust across results derived from different
proteomic platforms.
Fig. 3. Venn diagram.
[105]Fig. 3
[106]Open in a new tab
A Venn diagram of significant proteins in linear and non-linear PWAS.
The red numbers denote the number of elements in each region. B Venn
diagram of significant proteins in linear (SomaScan), linear (Olink),
non-linear (Olink) and non-linear (SomaScan) PWAS. Source data are
provided as a [107]Source Data file.
Dose-response relationship between protein levels and VTE risk
The dose-response curves between the levels of 43 non-linear protein
and VTE risk, as identified by our PWAS pipeline, were displayed in
Fig. [108]4A and Figs. [109]S3–[110]S5, with eight proteins
successfully replicated in the subsequent cohort analysis
(Fig. [111]4A). The dose-response curves of proteins ULBP2, IL18BP and
MAN1A2 were J-shaped, characterized by a flat or slowly increasing
trend at low doses, followed by a rapid increase after exceeding a
threshold. The curve of protein CCL25 exhibited an L-shape, with a
sharp decrease at low doses that leveled off or declined more gradually
beyond a threshold. The curves of proteins ICAM2, VSIG2 and LGALS4
followed a U-shape, with risk decreasing at low doses, increasing at
high doses, and reaching the lowest risk at a moderate dose. The
dose-response curve of protein ABO was more complex, remaining
relatively flat at both low and high doses while increasing at moderate
doses.
Fig. 4. Dose-response curves between seven replicated proteins and VTE in
PWAS and cohort analysis.
[112]Fig. 4
[113]Open in a new tab
A Dose-response curves of proteins ULBP2, IL18BP, MAN1A2, CCL25, ICAM2,
LGALS4, VSIG2 and ABO in PWAS. B Corresponding dose-response curves in
cohort replication. The x axis is the predicted (measured) protein
levels after rank-based inverse normal transformation, the y axis on
the right denotes the frequency of their histogram, and the y axis on
the left is OR (HR) of VTE risk estimated by logistic regression (cox
regression) and RCS model. The red curve represents linear fit and 95%
CI. The blue curve represents non-linear fit and 95% CI. The central
measure of these curves represents the predicted mean values generated
by the linear (non-linear) model. The histogram represents the
distribution of the predicted (measured) protein levels. The dashed
line is the reference line of OR (HR).
Based on the statistics of linear and non-linear PWAS, ULBP2, IL18BP,
MAN1A2, ICAM2, LGALS4 and ABO were classified as Pattern I, and CCL25
and VSIG2 belonged to Pattern II. Although some proteins displayed
similar curve shapes, they may belong to different patterns (e.g.,
ICAM2, LGALS4 and VSIG2), which may be due to the imbalanced
distribution in their non-linear components. We noted that the
threshold points for Pattern I often distant from the normalized mean
0, resulting in sparse data in one portion of the curve. For example,
ICAM2 and LGALS4 had thresholds around 1 SD, leading to sparse
distributions in the high-dose range. When conducting linear
regression, an imbalanced distribution may reduce the weight of either
non-linear component in the ordinary least square estimation, leading
to misleading statistical significance. In this case, although the
linear PWAS can still detect these significant proteins, the non-linear
association between proteins and VTE would be ignored. In contrast to
Pattern I, threshold points in Pattern II were typically near the mean
0 (e.g., VSIG2). In this case, linear models lack power to detect
associations, causing them to be overlooked in traditional linear PWAS.
Prospective cohort study replicates 8 proteins in 43 non-linear proteins
Among the 43 non-linear proteins identified by our PWAS pipeline, 18
proteins can be replicated at a nominal significance level in the
prospective cohort analysis, and 14 proteins remained significant after
Bonferroni adjustment (0.05/43 = 1.1 × 10^−3). Among these, eight
proteins exhibited consistent dose-response curves in both the PWAS and
cohort analyses. The dose-response cures of these eight proteins were
shown in Fig. [114]4B. The non-linear trends of proteins ULBP2, IL18BP,
MANIA2, ICAM2 and ABO can be perfectly replicated in the prospective
cohort. For the remaining three proteins CCL25, VSIG2 and LGALS4, some
minor discrepancies were observed in the curve shapes, but these
differences are reasonable given the substantially smaller sample size
of the prospective cohort. The dose-response curves of the other 35
non-linear proteins analyzed in the cohort were shown in
Figs. [115]S6–[116]S8.
Enrichment analysis of VTE associated proteins
Gene-set enrichment analyses were conducted to further elucidate the
biological pathways of the identified non-linear proteins. As a
benchmark, we first performed enrichment analyses on the 64 proteins
with linear trends. The top three enriched GO pathways
(Fig. [117]5ALinear) were Enzyme-linked receptor protein signaling
pathway (P[adjust] = 8.8 × 10^−5, GO:0007167), Regulation of body fluid
levels (P[adjust] = 0.006, GO:0050878) and Blood coagulation
(P[adjust] = 0.006, GO:0007596) for biological process (BP); Cell
periphery (P[adjust] = 0.005, GO:0071944), Membrane (P[adjust] = 0.002,
GO:0016020) and Plasma membrane (P[adjust] = 0.002, GO:0005886) for
cellular component (CC). The top three enriched KEGG pathways
(Fig. [118]5BLinear) included Focal adhesion (P[adjust] = 0.001, path:
hsa04510), Rap1 signaling pathway (P[adjust] = 0.001, path: hsa04015)
and Ras signaling pathway (P[adjust] = 0.009, path: hsa04014). In
contrast, for the 43 proteins with non-linear trends, enriched GO
pathways (Fig. [119]5ANon-linear) only included endothelium development
(P[adjust] = 0.031, GO:0003158), and KEGG pathways
(Fig. [120]5BNon-linear) only included fluid shear stress and
atherosclerosis (P[adjust] = 0.043, path:hsa04510).
Fig. 5. Gene-set enrichment analyses for VTE associated proteins.
[121]Fig. 5
[122]Open in a new tab
a Dot plot shows the top five BP, CC and MF terms of GO function for 64
proteins with linear trends and 43 proteins with non-linear trends on
VTE. b Dot plot shows the top ten enriched KEGG pathway terms for 64
proteins with linear trends and 43 proteins with non-linear trends on
VTE. Dot color represents the statistical significance of enrichment
analysis based on P values after FDR adjustment, while dot size
represents the count of proteins annotated to each term. Source data
are provided as a [123]Source Data file.
Discussion
In this study, we proposed a novel non-linear PWAS analysis pipeline.
This flexible pipeline does not rely on large-scale individual-level
proteomic data, and can borrow public resources to complete the
analysis. We further suggested two non-linear association patterns,
highlighting the potential limitations for directly using linear PWAS
when non-linear relationships exist. We recommended that our non-linear
pipeline could be used as a complementary approach to linear PWAS, to
reduce the risk of omission and misleadingness. Applying this pipeline
to VTE, we identified 43 proteins with non-linear associations to VTE
risk, including 26 unreported findings. Among them, eight proteins can
be replicated in the prospective cohort analysis, further supporting
the robustness of our findings. These non-linear associations provided
novel insights into the development of VTE.
Among these replicated proteins, ULBP2 primarily binds to NKG2D, a
receptor on the surface of NK cells, and regulates immune response. Its
aberrant expression may lead to an inflammatory response and immune
imbalance, which promotes thrombus formation^[124]27. IL18BP is a
modulator of the proinflammatory cytokine IL-18, which promotes
leukocyte and platelet recruitment, endothelial dysfunction and
coagulation triggering. IL18BP may affect VTE by mediating IL-18 in the
vasculature^[125]28,[126]29. MAN1A2 regulates the glycosylation
process, and abnormalities in glycosylation can enhance adhesion on
specific cell surfaces, particularly endothelial cells, thereby
increasing the risk of VTE^[127]30. CCL25 can attract T lymphocytes
expressing its receptor CCR9 to sites of inflammation, including
vascular endothelium, and may contribute to enhanced recruitment of
immune cells to the vessel wall, promoting local inflammation and
thrombus formation^[128]31. ICAM2 plays a crucial role in mediating
cell-cell adhesion processes. Endothelial ICAM2 mediates angiogenesis
and the lack of ICAM2 expression results in impaired angiogenesis both
in vitro and in vivo, increasing the risk of thrombosis^[129]32.
Meanwhile, overexpression of ICAM2 also promotes thrombosis by inducing
platelet and leukocyte adhesion to vascular endothelial cells^[130]33.
VSIG2 primarily implicated in immune modulation and cell-cell
interactions, could impact the initiation of VTE by affecting the
inflammatory response associated with endothelial dysfunction^[131]34.
LGALS4, also known as Galectin-4, has been reported to regulate
endothelial cell adhesion molecule expression, inflammation response
and leukocyte recruitment, which are critical events in the initiation
of VTE^[132]35,[133]36. ABO protein levels determine an individual’s
ABO blood group, which non-linearly affects the risk of VTE by
mediating the levels of VIII and VWF in different blood
groups^[134]37,[135]38. Among above replicated proteins, the non-linear
trends of ICAM2 and ABO can be interpreted by existing functional
annotations, and others are new findings that need more confirmation.
It should be noted that most of the non-linear associated proteins are
functionally annotated to cell adhesion and endothelium development,
which is consistent with our gene-set enrichment results. Endothelial
cell development is also linked to fluid shear stress, a KEGG pathway
identified in our analysis. The association between fluid shear stress
and VTE is generally considered to be non-linear: excessive fluid shear
stress may damage vascular endothelial cells, triggering an
inflammatory response and promoting thrombosis, while insufficient
fluid shear may make vascular endothelial cells more susceptible to
platelet and leukocyte adhesion, promoting thrombosis^[136]39,[137]40.
In addition, endothelial cell development influences multiple processes
associated with thrombosis (e.g., release of coagulation factors,
inflammatory response, and immune response), which increases the
likelihood that endothelial cell development plays a non-linear role in
the pathogenesis of VTE^[138]41.
For the non-linear PWAS analysis pipeline, we did not directly use the
measured protein levels from UKB-PPP. Instead, we used genetically
predicted protein levels based on the genetic data from UKB. This
strategy has two advantages. First, although UKB-PPP includes measured
protein levels for 54,219 samples, only 28,318 samples can be retained
after quality control (QC) in our study. In contrast, we can
genetically predict protein levels for 408,812 Europeans in UKB, and
even after QC, 378,474 individuals can still be retained in the
analysis. This approach allows us to utilize a larger sample size for
subsequent non-linear PWAS compared to directly using measured
proteins, which is especially valuable for ethnic populations with
small sample sizes of measured proteins (e.g., African^[139]42, East
Asian^[140]43 and Arabs^[141]44). Second, measured plasma protein
levels are influenced by various environmental factors, such as
diet^[142]45, behavioral habits^[143]22, and the individual physical
state at the time of measurement. These factors may confound the
analysis of linear and non-linear relationships between protein levels
and diseases. Here, we used genetically predicted protein levels, which
is isolated from environmental confounders. To further validate our
method, we conducted additional analyses comparing our linear PWAS
pipeline with the traditional PWAS method (regression using measured
protein levels, followed by Mendelian randomization to strengthen
causal inference). The results showed that our linear PWAS pipeline was
generally consistent with the Mendelian randomization results
(Supplementary Note [144]3 and Figs. [145]S10 and [146]S11).
We suggested two non-linear patterns based on our analysis pipeline.
Pattern I indicates the proteins that indeed have non-linear trends
with VTE (e.g., U-shape or L-shape), while also showing significant
linear trends. In this case, linear PWAS would identify these proteins,
but it might overlook the non-linear relationship underlying them. More
seriously, the misleading regression coefficients from linear PWAS
could lead to incorrect conclusions. Pattern II represents the proteins
with non-linear trends where the linear trend is insignificant. In this
case, if only linear PWAS is conducted, these proteins would be omitted
from the analysis. Overall, we recommend our proposed non-linear PWAS
pipeline as a complement to linear PWAS, to reduce the risk of omission
and misleadingness in linear PWAS.
We recommend the following directions to further research the
non-linear proteins identified in this study. Firstly, such non-linear
relationships indicate that the effect of proteins on diseases changes
significantly at a specific threshold. This threshold effect may
involve complex feedback regulatory loops^[147]46, or networks of
protein-protein interactions on diseases^[148]47. Studying this
threshold effect may reveal triggering mechanisms and progression
patterns of diseases. Secondly, Dose-response curves help determine the
optimal protein level to reduce disease risks, which can inform the
appropriate drug dose in the subsequent clinical trials and support the
development of protein-targeted drugs. Thirdly, known non-linear
relationships can help to build more accurate predictive models of
diseases. Researchers can enhance the predictive performance by
introducing quadratic or interaction terms of known non-linear proteins
into the model, instead of just assuming all relationships are
linear^[149]48.
In defining VTE cases, we did not consider multiple VTE events for a
single patient, as the number of recurrent VTE events may be
incompletely recorded in the UKB. Our primary focus was on genetic
susceptibility to VTE, and once an individual has experienced a VTE
event, indicating the underlying susceptibility, regardless of the
number of occurrences. VTE can occur in various clinical settings, such
as acute disease, trauma, pregnancy, exposure to drugs, surgery,
cancer, inflammatory diseases and so on. To address this issue, we
conducted a sensitivity analysis focused solely on unprovoked VTE cases
driven by genetic factors. The results of the sensitivity analysis
showed that our PWAS results were minimally affected by these clinical
factors (Supplementary Note [150]2 and Fig. [151]S9). Considering
anticoagulant use may also influence incident VTE, we performed
sensitivity analyses by adjusting for it in the prospective cohort. The
results showed this factor did not significantly influence incident VTE
in our study (HR = 0.755, 95% CI: 0.357–1.592, P = 0.439), which may be
attributed to the limited sample size using anticoagulants and the
potential misclassification from baseline-only self-reported medication
data in the UKB.
There are also several limitations deserve discussion. First, our PWAS
analysis pipeline was based on genetically predicted protein levels.
Some important proteins may be eliminated in the validation step, due
to the limited power of the protein prediction model or the limited
sample size of the protein reference population. In this study, we
excluded these biased predicted proteins to minimize the risk of
misleading associations. Second, in our study, VTE cases were defined
based on ICD-10 codes, which may introduce non-differential
misclassification of VTE outcomes when the sensitivity and specificity
of ICD-10 codes for VTE diagnosis were relatively low. Such
misclassification could reduce the statistical power of our models and
potentially lead to true protein associations being classified as
non-significant. Third, the mechanisms of VTE vary by sex and age
groups, with oestrogen playing a major role in young women and genetic
factors being more prominent in young men. Future studies stratifying
by sex and age groups may provide additional valuable insights into
VTE. Finally, only eight out of 43 non-linear proteins identified in
our study were replicated in the prospective cohort analysis, which may
be attributed to the limited sample size of the prospective replication
analysis. The non-linear trend of these proteins should be further
validated in larger populations or through experimental approaches in
future studies.
Methods
Ethics
We obtained genotype, protein and phenotype data from the UKB and
UKB-PPP, which received ethics approval from the North West Centre for
Research Ethics Committee. The application number of UKB and UKB-PPP in
our study was 88159.
Quality control in UK Biobank
The target population in this study is individuals of European ancestry
in UKB. The UKB cohort is a population-based volunteer longitudinal
cohort of ~500,000 individuals recruited across the United Kingdom. We
used the imputed genotypes from UKB. Details of array, sample
processing and quality control have been described
elsewhere^[152]49,[153]50. Briefly, we extracted a European ancestry
subset, including 408,812 samples who self-identified as white British
and had very similar genetic ancestry based on the principal component
analysis of the genotypes (Field ID 22006). The variants stratify the
following criterion were excluded by using PLINK^[154]51: (1) minor
allele frequency (MAF) < 0.001 (we set a looser criterion than 0.005 of
the original prediction model^[155]21 here to ensure enough variants
can be included in our cohort), (2) Hardy–Weinberg equilibrium test P
value < 1.0 × 10^−12, (3) missing genotype rate >0.05, or (4)
imputation accuracy score <0.3. We also randomly removed 30,338
individuals in related pairs with kinship coefficient ≥0.0884 (second
degree or greater)^[156]52 using PLINK^[157]51. Therefore, 378,474
individuals of European ancestry and 15,456,785 variants were retained
for the following study.
VTE definition in UK biobank
VTE phenotype in UKB (application number 88159) was derived from the
first occurrence of any code mapped to 3-character ICD-10 (category ID
1712), which was generated by mapping ICD-10 code in the death register
records, read code information in the primary care data, ICD-9 and
ICD-10 codes in the hospital inpatient records (e.g., diagnosis or
operation code), and self-reported medical conditions reported at the
baseline or subsequent visit to UKB assessment center. VTE cases were
defined as having at least one of the following ICD-10 diagnosis code:
(1) ‘I26’ pulmonary embolism (Field ID 131308), (2) ‘I80’ thrombosis,
phlebitis and thrombophlebitis (Field ID 131396), and (3) ‘I82’ other
venous thromboembolism and thrombosis (Field ID 131400)^[158]53. The
remaining individuals were included in the control group. Covariates
included sex (Field ID 22001), array batch (Field ID 22000) and top ten
ancestry principal components. Finally, we retained 378,474 Europeans,
including 22,103 VTE cases and 356,371 controls.
Protein levels prediction in UK Biobank
In this study, we directly used the protein prediction models
constructed by Xu et al.^[159]21, which were based on the INTERVAL
study^[160]54. For aptamer-based SomaScan assay (v.3), the INTERVAL
study profiled 2384 plasma proteins for 3175 participants of European
ancestry^[161]26,[162]55. For the Olink assay, they measured 308 plasma
protein abundance of around 4822 Europeans using four Olink panels,
namely inflammation-1 (INF-1), cardiovascular II (CVD-2),
cardiovascular III (CVD-3) and neurology (NEUR)^[163]56. Xu et al.
constructed the prediction model for these Olink and SomaScan proteins
using Bayesian ridge regression^[164]21.
The existing prediction models have provided pQTLs and their weights
for each protein. We matched pQTLs in UKB and predicted protein levels
by calculating the inner product of pQTLs and the corresponding
weights. Finally, we obtained 308 predicted Olink proteins and 2384
SomaScan proteins for 22,103 VTE cases and 356,371 controls. The flow
chart of protein levels prediction was shown in the left panel of
Fig. [165]S2.
Validation and selection of predicted protein in UKB-PPP
UKB-PPP is a proteomics profiling data in UKB, which collected blood
plasma samples from 54,219 UKB participants using the antibody-based
Olink Explore 3072 proximity extension assay, measuring 2,941 proteins
(Field ID 30900)^[166]22. After excluding non-random baseline and
individuals in batch 7 due to its mixed samples from both UKB sets and
COVID-19 sets^[167]22, 40,502 samples were retained. We matched
individuals and proteins between the predicted data in UKB and measured
data in UKB-PPP based on their corresponding ID and protein-coding gene
names, respectively. Then 31,013 samples, 295 Olink proteins and 944
SomaScan proteins were matched. The Pearson correlations of these
matched proteins between predicted data and measured data were
calculated to reflect the accuracy of our protein predictions. For
measured protein levels, we further used their residuals after
adjusting covariates sex (Field ID 22001), baseline age (Field ID
21022), array batch (Field ID 22000) and top ten ancestry principal
components (Field ID 22009). Both predicted and measured protein levels
were rank-based inverse normal transformed before calculating Pearson
correlations^[168]57. Finally, only matched proteins with R^2 > 0.01
(i.e., Pearson correlation >0.1)^[169]21,[170]23 can pass the external
validation and be included in the following PWAS analysis. The flow
chart of predicted protein validation was shown in the middle panel of
Fig. [171]S2.
Linear PWAS for VTE
We performed linear PWAS for the well-predicted proteins and the VTE
risk in 378,474 Europeans from UKB using the logistic regression. The
adjusted covariates only included sex (Field ID 22001), array batch
(Field ID 22000) and top ten ancestry principal components (Field ID
22009), as further adjustment may induce collider bias in PWAS
(adjusting downstream variables of the predicted protein can create new
associations between the predicted protein and unknown environmental
confounders)^[172]58. We excluded baseline age as covariates in PWAS,
because the predicted protein levels represent an individual’s lifelong
average levels. It is unreasonable to use baseline age to adjust the
associations of lifelong average predicted protein levels with
diseases. All predicted protein levels were rank-based inverse normal
transformed before analysis^[173]57. Considering 108 overlapped
proteins between predicted Olink and SomaScan proteins, there were 537
unique well-predicted proteins in our study. The unique proteins were
defined by having different protein-coding gene names. The significance
level of linear PWAS was set to be 0.05/537 = 9.3 × 10^−5 based on
Bonferroni adjustment.
Non-linear PWAS using RCS
We used restricted cubic splines^[174]59 with four knots at the 5th,
35th, 65th, and 95th centiles to flexibly model the non-linear
association between well-predicted proteins and VTE risk in 378,474
Europeans from UKB by R package “rms”. The covariates were the same as
the above linear PWAS. All predicted protein levels were rank-based
inverse normal transformed before analysis^[175]57. We tested the
significance of protein levels by likelihood ratio test comparing the
model with only covariates, against the model with covariates and
protein levels. The significance level was set to be 9.3 × 10^−5 after
Bonferroni adjustment. The non-linearity of protein levels was tested
by the likelihood ratio test comparing RCS model, against the logistic
model that considers protein levels as a linear term^[176]60,[177]61.
Only the proteins that passed the overall significance test would be
performed for the non-linearity test. So, the significance level of the
non-linearity test was set to be 0.05. Finally, we plotted the
dose-response curves of the odds ratio of VTE risk against standardized
protein levels for each protein.
Prospective cohort replication in UKB-PPP
To replicate the 43 proteins obtained by non-linear PWAS, we conducted
Cox regression using the prospective cohort data in UKB-PPP. We
selected 31,013 matched samples and 43 corresponding measured proteins
in UKB-PPP. We excluded individuals with VTE before baseline (n = 990)
and individuals with missing covariates (n = 1705). Then 28,318
samples, 768 of whom had emerging VTE events during the follow-up
period, were retained in the cohort analysis (Table [178]S1). Other
27,550 samples with censored events include: (1) 25,601 samples did not
suffer from VTE until the end of follow-up, (2) 1882 samples were lost
to follow-up, and (3) 67 samples died before suffering from VTE. We
used Cox regression to replicate the linear association between
measured protein level and risk of VTE, and restricted cubic splines to
replicate their non-linear associations. The adjusted covariates
included sex (Field ID 22001), baseline age (Field ID 21022), BMI
(Field ID 21001), Townsend deprivation index (Field ID 22189), smoke
status (Field ID 20116), drink status (Field ID 20117), physical status
(Field ID 6164), systolic blood pressure (Field ID 4080), Oily fish
intake (Field ID 1329), processed meat intake (Field ID 1349) and top
ten ancestry principal components (Field ID 22009). Details on the
selection of covariates can be found in Supplementary Note [179]1. All
measured protein levels were rank-based inverse normal transformed
before analysis^[180]57. The significance level of proteins was set to
be 0.05/43 = 1.1 × 10^−3 after Bonferroni adjustment. Considering the
sample size of the prospective cohort is much smaller than that of
predicted data in UKB, we also listed the proteins that can be
replicated by the nominal significance level (P = 0.05). The
significance level of the non-linearity test was 0.05 as the non-linear
PWAS. The flow chart of cohort replication was shown in the right panel
of Fig. [181]S2.
Pathway enrichment analysis
We performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and
Genomes (KEGG) enrichment analysis to explore the biological pathways
on 64 proteins with linear trends and 43 proteins with non-linear
trends, respectively, using the R package “clusterProfiler”^[182]62.
The background gene of enrichment analysis was set as the list of all
537 unique well-predicted proteins. The P values were adjusted by false
discovery rate (FDR) using the Benjamini-Hochberg method.
Reporting summary
Further information on research design is available in the [183]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[184]Supplementary information^ (8.7MB, pdf)
[185]41467_2025_61874_MOESM2_ESM.docx^ (14.8KB, docx)
Description of Additional Supplementary Files
[186]Supplementary Data 1-3^ (67.9KB, xlsx)
[187]Reporting Summary^ (91.2KB, pdf)
[188]Transparent Peer Review file^ (4.1MB, pdf)
Source data
[189]Source Data^ (176.3KB, xlsx)
Acknowledgements