Abstract

   Venous thromboembolism is a life-threatening vascular event with high
   prevalence and genetic determinants. PWAS has become a popular strategy
   to identify therapeutic targets of complex diseases. However, the
   current PWAS model only considers the linear relationship between
   protein and disease. Here, we propose a novel non-linear PWAS pipeline
   and identify 43 proteins exhibiting non-linear associations with venous
   thromboembolism in the UK Biobank, of which eight proteins cannot be
   captured by linear PWAS. We further conduct prospective cohort
   replication in the UK Biobank Pharma Proteomics Project, and replicate
   eight proteins with similar non-linear trends, including ULBP2, IL18BP,
   MAN1A2, CCL25, ICAM2, LGALS4, VSIG2 and ABO. Pathway enrichment
   analysis suggests that the identified non-linear proteins are involved
   in endothelium development, fluid shear stress and atherosclerosis
   pathways. In summary, we develop a novel non-linear PWAS analysis
   pipeline, and identify 43 non-linear proteins with venous
   thromboembolism, highlighting the importance of incorporating
   non-linear analysis in PWAS.

   Subject terms: Predictive markers, Proteome informatics, Haematological
   diseases, Diagnostic markers, Cardiovascular diseases
     __________________________________________________________________

   Venous thromboembolism is a serious condition influenced by genetics,
   but current protein-based analyses often miss complex patterns. Here,
   the authors show that a new non-linear analysis method reveals 43
   associated proteins, including eight missed by traditional models,
   offering deeper insight into disease mechanisms.

Introduction

   Venous thromboembolism (VTE) is a cardiovascular disease caused by an
   imbalance in the regulation of hemostasis, leading to the pathologic
   coagulation of blood and the formation of vascular thrombosis.
   According to the location of the thrombus, VTE can be classified as
   deep vein thrombosis (DVT), which occurs in the deep veins (e.g., legs
   and trunk), or as pulmonary embolism (PE), when a thrombus obstructs
   the pulmonary arteries^[42]1. The annual incidence of VTE in the
   general population is approximately 100 cases per 100,000 individuals
   (1‰)^[43]2, and more than 600,000 individuals in the United States and
   Europe die due to VTE each year^[44]3.

   VTE is a complex disease influenced by both environmental and genetic
   factors. Twin studies have estimated the heritability of VTE to be
   approximately 50%, suggesting that a large proportion of VTE risk is
   genetically driven^[45]4,[46]5. Large-scale genome-wide association
   studies (GWAS) have identified hundreds of loci associated with
   VTE^[47]6–[48]9. However, most of the susceptibility loci identified by
   GWAS are located in non-coding regions or span multiple genes, which
   limits their ability to reflect the causal roles of genes and proteins.
   As key molecular regulators, proteins have become the primary targets
   in most drug development efforts^[49]10. Therefore, protein-wide
   association studies (PWAS) are now widely performed to explore how
   proteins are involved in the occurrence and progression of
   diseases^[50]11. Previous PWAS have identified several proteins,
   including F2, F11, ABO and PROC, as potential diagnostic and
   therapeutic targets for VTE using the FUSION pipeline followed by
   causal inference^[51]12, SERPINE1, VWF, EPHB4, etc., as causally
   associated proteins with VTE through prospective designs and Mendelian
   randomization^[52]13, and CFHR5 as a VTE associated biomarker based on
   case-control designs^[53]14.

   Currently, almost all PWAS studies are designed based on the assumption
   of linearity, in which proteins are assumed to affect diseases only in
   a linear pattern^[54]12–[55]14. However, many studies have indicated
   that the relationships between proteins and diseases may follow
   non-linear patterns. For example, a cohort study showed that women with
   moderate levels of the protein TFF3 had the highest risk of recurrent
   ovarian cancer compared to women with either low or high levels^[56]15.
   Another two review studies suggested that the protein p53 functions as
   a “cellular rheostat” in cancer, favoring antioxidant functions at low
   stress levels while switching to a prooxidant role under high
   stress^[57]16,[58]17. The similar non-linear associations were also
   reported between tau protein and neurodegenerative
   diseases^[59]18–[60]20.

   Therefore, to identify the dose-response relationships between proteins
   and diseases, non-linear patterns should be considered in PWAS. In this
   study, we proposed a novel non-linear PWAS analysis pipeline based on
   genetically predicted protein levels. We then applied this pipeline to
   identify the 43 proteins non-linearly associated with VTE in the UK
   biobank (UKB). Eight of these proteins were replicated with similar
   dose-response curves in the prospective cohort of the UK Biobank Pharma
   Proteomics Project (UKB-PPP). Finally, we performed pathway enrichment
   analysis to elucidate the potential non-linear biological pathway for
   VTE.

Results

Overview of study design

   The overview of the study design is illustrated in Fig. [61]1. We first
   genetically predicted protein levels of 378,474 Europeans in the UKB
   using the prediction models developed by Xu et al.^[62]21. To validate
   the predicted proteins, we calculated the Pearson correlation
   coefficient (r) between the predicted protein levels and directly
   measured protein levels in the UKB-PPP^[63]22, and only proteins with
   R^2 > 0.01^[64]21,[65]23 were retained for subsequent analyses
   (Fig. [66]1A).

Fig. 1. Overview of study design.

   [67]Fig. 1
   [68]Open in a new tab

   A Predict the protein levels of European individuals in UKB using the
   weights of pQTL from INTERVAL study. Predicted proteins with validated
   R^2 > 0.01 in UKB-PPP were retained for the subsequent analyses. B Two
   patterns of the non-linear associations between protein levels and VTE.
   Pattern I: The protein has non-linear trends with disease (U-shape or
   L-shape, the blue curve), while the linear trend can be also fitted
   (the red line) because of the misleading statistical significance.
   Pattern II: the proteins have non-linear trends, while the linear trend
   is not significant due to the weak test power. The purple triangle
   represents the reference OR value. C Linear and non-linear PWAS with
   VTE by the logistic regression and RCS model using predicted proteins,
   respectively. The gray area of the Venn diagram represents Patterns I
   and II. D Cohort replication and pathway enrichment for proteins with
   non-linear trends, created in BioRender. Kong, Y. (2025)
   [69]https://BioRender.com/j9ioqmi. Figure 1 is a schematic diagram and
   is not derived from real data.

   Based on statistical theory, applying linear regression to variables
   that exhibit non-linear relationships may lead to reduced test power,
   or even misleading statistical significance^[70]24,[71]25. Therefore,
   we considered two patterns for non-linear relationships between protein
   levels and the disease. In the first pattern, protein levels indeed
   have non-linear trends with the disease (e.g., U-shape or L-shape), but
   a misleadingly significant linear association may still be observed
   (Pattern I in Fig. [72]1B). In the second pattern, the relationship is
   truly non-linear, while the linear association appears non-significant
   due to reduced test power (Pattern II in Fig. [73]1B). To identify
   these two patterns, we conducted both linear and non-linear PWAS of the
   disease using the genetically predicted proteins, employing logistic
   regression for the linear analysis and restricted cubic splines (RCS)
   for the non-linear analysis. The two patterns can be distinguished by
   combining results of linear and non-linear PWAS (Fig. [74]1C). Finally,
   we extracted prospective cohort data in UKB-PPP to replicate the
   proteins identified by the non-linear PWAS. We also performed gene-set
   enrichment analyses on these non-linear proteins, to explain the
   potential biological pathways for the non-linear patterns
   (Fig. [75]1D).

Predict and validate protein levels in UKB

   We first extracted 15,456,785 genetic variants from 378,474 Europeans
   in the UKB, including 22,103 individuals with a history of VTE and
   356,371 controls after quality control (Table [76]S1, details shown in
   Methods section). Based on prediction models constructed by Xu et
   al.^[77]21, we genetically predicted 308 Olink proteins and 2384
   SomaScan proteins for these 378,474 Europeans. After quality control,
   measured levels of 295 Olink proteins and 944 SomaScan proteins were
   also available for 31,013 Europeans in the UKB-PPP (Table [78]S1,
   details shown in Methods section). We then calculated Pearson
   correlation coefficients between these overlapped genetically predicted
   and measured protein levels in the UKB-PPP. Prediction performance was
   shown in Fig. [79]S1. For Olink proteins, 237 had an R^2 greater than
   0.01, among which 97 had R^2 > 0.1 and 5 had R^2 > 0.5. For SomaScan
   proteins, 408 had an R^2 greater than 0.01, among which 151 had
   R^2 > 0.1 and 9 had R^2 > 0.5. In total, 537 unique well predicted
   proteins (R^2 > 0.01), including 237 Olink proteins (Supplementary
   Data [80]1) and 408 SomaScan proteins (Supplementary Data [81]2), were
   retained for subsequent analysis. It should be noted that multiple
   aptamers may target the same SomaScan protein, so that a SomaScan
   protein may have more than one measurement^[82]26. We listed these
   special SomaScan proteins in Supplementary Data [83]2, and
   distinguished them by superscripts of asterisk and circumflex in the
   study.

Linear PWAS identified 64 proteins for VTE

   There were 64 unique proteins whose predicted levels were linearly
   associated with VTE at a significance level of P < 9.3 × 10^−5
   (0.05/537) using Bonferroni adjustment, including 28 Olink proteins and
   48 SomaScan proteins (Fig. [84]2A), of which 12 proteins (red labels in
   Fig. [85]2A) were significant in both Olink and SomaScan platforms,
   such as SELE, LIFR and VWF. We extracted the Z-statistics of these 12
   proteins from each platform (Fig. [86]2B). The scatter plot showed a
   good concordance in linear PWAS results between the Olink and SomaScan
   platforms, with a Pearson correlation coefficient being 0.976
   (P = 1.2 × 10^−8). Previous linear PWAS based on GWAS and pQTL summary
   statistics for 257–1348 proteins have identified 21 proteins casually
   associated with VTE^[87]12–[88]14, among which nine proteins were
   validated in our study. The effect directions of these nine proteins
   were also consistent with our linear PWAS results (risk proteins:
   KLKB1, GP6, F11, ABO and VWF; protective proteins: THBS2, EFEMP1, EPHB4
   and PROC)^[89]12,[90]13. Notably, several proteins previously reported
   to be strongly associated with VTE, such as F2, F5 and SERPINC1, were
   not validated in our study. This is because we performed PWAS based on
   genetically predicted protein levels, and these specific proteins were
   not well predicted and thus excluded in the external validation.

Fig. 2. Linear and non-linear PWAS results for VTE.

   [91]Fig. 2
   [92]Open in a new tab

   A Miami plot for linear PWAS from Olink platform (top) and SomaScan
   platform (bottom). The x axis is the chromosomal location of proteins,
   and the y axis denotes −log10 P values from two-sided Wald tests in the
   logistic regression. The red labels mark the significant proteins
   present in both Olink and SomaScan platforms, while the gray labels
   mark the significant proteins present in either Olink or SomaScan
   platforms. The dashed line is the significant threshold at 9.3 × 10^−5
   after Bonferroni adjustment. B Scatter plot for linear PWAS. The x and
   y axes are Z statistics of linear PWAS from SomaScan and Olink
   platforms, respectively. The Pearson correlation of scatters is shown
   in the lower right corner. C Miami plot for non-linear PWAS. The y axis
   denotes −log10 P values from two-sided overall significance test in the
   RCS model. The blue labels mark the significant proteins present in
   both Olink and SomaScan platforms, while the gray labels mark the
   significant proteins present in either Olink or SomaScan platforms. The
   purple dots represent the proteins that passed the two-sided
   non-linearity test. For clear display, y axis scales in (A) and (C) are
   compressed by the square root. D Scatter plot for non-linear PWAS. The
   x and y axes are
   [MATH: <msup><mrow><mi>χ</mi></mrow><mrow><mn>2</mn></mrow></msup>
   :MATH]
   statistics of significance test in non-linear PWAS from SomaScan and
   Olink platforms, respectively. Source data are provided as a [93]Source
   Data file.

Non-linear PWAS identified 43 proteins for VTE

   Our non-linear PWAS using the RCS model identified 34 Olink proteins
   and 44 SomaScan proteins associated with VTE (P < 9.3 × 10^−5 for the
   overall significance test, Fig. [94]2C). Among them, 23 Olink proteins
   and 27 SomaScan proteins showed significant non-linear associations
   (P < 0.05 for the non-linearity test, purple dots in the Fig. [95]2C,
   the set of significance levels were shown in the Methods section). The
   list of these 23 Olink proteins and 27 SomaScan proteins with
   significant non-linear trends were presented in Table [96]1. After
   removing duplicate proteins across Olink and SomaScan platforms, 43
   unique proteins were indicated to have non-linear associations with
   VTE, 26 of which were not reported by previous study, such as TEK,
   CCL25 and VSIG2. 13 proteins were significant in both Olink and
   SomaScan platforms, and their P values of the overall significance test
   were highly correlated, with a Pearson correlation coefficient being
   0.989 (P = 1.4 × 10^−10) (Fig. [97]2D). The CVs of these protein
   measurements in Olink and SomaScan were listed in Supplementary
   Data [98]3^[99]22.

Table 1.

   The 43 unique proteins that have significant non-linear associations
   with VTE
   Protein^a Position^b Linear^c Non-linear^d Platform Pattern Known
   loci^e
   OR (95% CI) P P[RCS] P[Non–linear]
   TIE1 chr1:43300982–43323108 1.12 (1.09–1.15) 4.76 × 10^−56
   8.42 × 10^−60 3.48 × 10^−6 SomaScan I Not reported
   F3 chr1:94529173–94541759 0.97 (0.95–1.00) 9.01 × 10^−5 9.51 × 10^−13
   4.55 × 10^−10 Olink I 28827327
   MAN1A2 chr1:117367449–117528872 1.09 (1.06–1.12) 1.74 × 10^−38
   3.65 × 10^−50 7.47 × 10^−13 SomaScan I 36658437
   PEAR1 chr1:156893698–156916429 0.97 (0.94–1.00) 5.11 × 10^−5
   3.61 × 10^−20 3.93 × 10^−18 SomaScan I 28002340
   SELE chr1:169722640–169764705 0.85 (0.83–0.87) 1.10 × 10^−121
   3.54 × 10^−134 1.43 × 10^−14 Olink I 21980494
   0.86 (0.83–0.88) 2.47 × 10^−108 1.90 × 10^−120 1.88 × 10^−13 SomaScan
   LGALS8 chr1:236518000–236552981 0.96 (0.93–0.98) 1.12 × 10^−10
   1.07 × 10^−17 1.98 × 10^−10 Olink I Not reported
   MERTK chr2:111898607–112029561 0.96 (0.93–0.98) 9.66 × 10^−11
   1.37 × 10^−10 4.87 × 10^−2 Olink I Not reported
   TFPI chr2:187464230–187565760 0.93 (0.90–0.95) 1.16 × 10^−27
   1.37 × 10^−26 3.26 × 10^−2 SomaScan I 36154123
   ALPI chr2:232456125–232460753 0.91 (0.89–0.94) 3.85 × 10^−38
   6.19 × 10^−40 1.37 × 10^−3 SomaScan I Not reported
   FAM3D chr3:58633946–58666834 1.09 (1.06–1.12) 7.94 × 10^−32
   9.76 × 10^−32 2.52 × 10^−2 SomaScan I Not reported
   KDR chr4:55078481–55125595 0.89 (0.87–0.92) 1.03 × 10^−60 3.42 × 10^−82
   3.01 × 10^−25 SomaScan I Not reported
   F11 chr4:186266189–186289681 1.11 (1.08–1.14) 4.35 × 10^−52
   4.04 × 10^−50 2.64 × 10^−2 SomaScan I 31420334
   LIFR chr5:38474668–38608354 0.92 (0.90–0.95) 2.68 × 10^−32
   2.32 × 10^−123 5.22 × 10^−97 Olink I 36658437
   0.86 (0.83–0.88) 7.21 × 10^−95 4.89 × 10^−121 7.77 × 10^−28 SomaScan
   FLT4 chr5:180601506–180649624 0.94 (0.91–0.96) 3.03 × 10^−20
   2.91 × 10^−20 2.10 × 10^−2 SomaScan I 31765479
   ULBP2 chr6:149942014–149949235 1.05 (1.02–1.08) 9.98 × 10^−14
   1.03 × 10^−16 5.77 × 10^−5 Olink I Not reported
   PDGFA chr7:497258–520296 1.04 (1.01–1.07) 4.42 × 10^−9 9.27 × 10^−9
   4.37 × 10^−2 Olink I Not reported
   1.05 (1.02–1.08) 9.07 × 10^−10 2.17 × 10^−9 3.92 × 10^−2 SomaScan
   EPHB4 chr7:100802565–100827523 0.92 (0.90–0.95) 2.65 × 10^−33
   1.78 × 10^−128 1.19 × 10^−100 Olink I 35285134
   ANGPT1 chr8:107249482–107498055 1.04 (1.01–1.07) 2.76 × 10^−9
   8.51 × 10^−10 5.29 × 10^−3 SomaScan I 36154123
   ABO chr9:133233278–133276024 1.15 (1.12–1.18) 7.90 × 10^−91
   9.34 × 10^−110 4.29 × 10^−23 SomaScan I 28373160
   CBLIF chr11:59829273–59845499 0.95 (0.93–0.98) 1.63 × 10^−11
   2.88 × 10^−12 6.80 × 10^−3 Olink I 35285134
   IL18BP chr11:71998613–72007315 1.07 (1.04–1.10) 8.98 × 10^−20
   4.22 × 10^−22 2.48 × 10^−4 Olink I Not reported
   RELT^ chr11:73376399–73397474 0.97 (0.95–1.00) 5.47 × 10^−5
   6.54 × 10^−6 5.85 × 10^−3 SomaScan I Not reported
   ESAM* chr11:124752583–124762290 1.03 (1.00–1.06) 8.73 × 10^−6
   4.80 × 10^−13 4.16 × 10^−9 SomaScan I Not reported
   ESAM^ 1.03 (1.01–1.06) 2.08 × 10^−6 1.12 × 10^−13 3.72 × 10^−9
   VWF chr12:5948877–6124770 1.19 (1.16–1.22) 1.58 × 10^−134
   4.67 × 10^−154 3.36 × 10^−24 Olink I 36658437
   1.17 (1.14–1.20) 7.78 × 10^−111 2.20 × 10^−143 1.15 × 10^−36 SomaScan
   ISLR2* chr15:74100311–74138540 0.95 (0.92–0.97) 1.12 × 10^−14
   1.94 × 10^−54 3.53 × 10^−43 SomaScan I Not reported
   ISLR2^ 0.96 (0.93–0.98) 4.42 × 10^−10 2.05 × 10^−40 1.81 × 10^−33
   CDH5 chr16:66366622–66404784 0.95 (0.93–0.98) 1.04 × 10^−12
   1.59 × 10^−71 6.12 × 10^−63 Olink I Not reported
   0.95 (0.93–0.98) 2.65 × 10^−11 4.15 × 10^−66 6.43 × 10^−59 SomaScan
   LGALS9 chr17:27629798–27649560 0.96 (0.94–0.99) 3.29 × 10^−8
   7.06 × 10^−8 4.06 × 10^−2 Olink I Not reported
   ICAM2 chr17:64002594–64020634 0.91 (0.88–0.93) 4.12 × 10^−45
   6.85 × 10^−59 1.03 × 10^−15 Olink I Not reported
   0.87 (0.85–0.90) 1.67 × 10^−81 8.69 × 10^−82 1.12 × 10^−2 SomaScan
   PECAM1 chr17:64319415–64413776 0.92 (0.89–0.94) 2.29 × 10^−36
   8.32 × 10^−121 2.20 × 10^−91 Olink I Not reported
   CD209 chr19:7739993–7747564 1.17 (1.14–1.20) 5.21 × 10^−114
   7.77 × 10^−136 1.21 × 10^−25 SomaScan I 35285134
   ICAM1 chr19:10271093–10286615 0.89 (0.86–0.92) 1.78 × 10^−56
   5.90 × 10^−145 4.39 × 10^−94 SomaScan I 31676865
   LGALS4 chr19:38801671–38812945 0.85 (0.82–0.87) 5.79 × 10^−127
   5.93 × 10^−151 9.80 × 10^−28 Olink I Not reported
   OXT chr20:3071620–3072517 1.04 (1.01–1.06) 3.16 × 10^−7 2.82 × 10^−7
   2.59 × 10^−2 SomaScan I Not reported
   THBD chr20:23045633–23049672 1.05 (1.02–1.08) 1.96 × 10^−11
   1.30 × 10^−11 5.99 × 10^−3 Olink I 26888256
   FAM3B* chr21:41304212–41357727 1.04 (1.01–1.07) 2.13 × 10^−8
   1.02 × 10^−11 9.76 × 10^−6 SomaScan I Not reported
   FAM3B^ 1.04 (1.01–1.07) 1.64 × 10^−9 5.05 × 10^−17 4.27 × 10^−10
   CTRC chr1:15438442–15449242 1.03 (1.00–1.06) 1.08 × 10^−4 1.14 × 10^−12
   4.56 × 10^−9 Olink II Not reported
   PROC chr2:127418427–127429242 0.99 (0.96–1.01) 4.76 × 10^−2
   4.62 × 10^−10 4.99 × 10^−10 SomaScan II 36658437
   GFRAL chr6:55327469–55402493 1.03 (1.00–1.06) 1.65 × 10^−4 5.29 × 10^−8
   5.23 × 10^−5 SomaScan II Not reported
   TEK chr9:27109141–27230174 1.00 (0.97–1.02) 5.43 × 10^−1 9.36 × 10^−16
   1.68 × 10^−16 Olink II Not reported
   1.01 (0.98–1.03) 3.22 × 10^−1 3.61 × 10^−17 1.07 × 10^−17 SomaScan
   DKK1 chr10:52314281–52318042 1.02 (0.99–1.05) 2.53 × 10^−3 9.65 × 10^−6
   1.87 × 10^−4 Olink II Not reported
   VSIG2 chr11:124747474–124752255 0.99 (0.97–1.02) 4.31 × 10^−1
   2.52 × 10^−8 6.83 × 10^−9 Olink II Not reported
   CCL25 chr19:8052318–8062660 0.98 (0.95–1.00) 6.73 × 10^−4 5.25 × 10^−6
   5.38 × 10^−4 Olink II Not reported
   LDLR chr19:11089462–11133820 1.00 (0.97–1.02) 6.33 × 10^−1 2.51 × 10^−5
   6.69 × 10^−6 Olink II Not reported
   [100]Open in a new tab

   ^aThe superscript symbols ‘*’ and ‘^’ are only used to distinguish
   SomaScan proteins with multiple aptamers. Details on the corresponding
   aptamers can be found in Supplementary Data [101]2.

   ^bThe position of protein-coding genes, GRCh38 assembly.

   ^cThe linear result obtained by two-sided Wald tests in the logistic
   regression. The Bonferroni significance is 9.3 × 10^−5.

   ^d
   [MATH:
   <msub><mrow><mi>P</mi></mrow><mrow><mi>R</mi><mi>C</mi><mi>S</mi></mrow
   ></msub> :MATH]
   is the P value of the two-sided overall significance test, and
   [MATH:
   <msub><mrow><mi>P</mi></mrow><mrow><mi>N</mi><mi>o</mi><mi>n</mi><mo>−<
   /mo><mi>l</mi><mi>i</mi><mi>n</mi><mi>e</mi><mi>a</mi><mi>r</mi></mrow>
   </msub> :MATH]
   is the P value of the two-sided non-linearity test in the RCS model.

   ^ePMID for the known loci. Known loci are defined as those with clear
   evidence linking the protein or gene to VTE, or where significant GWAS
   loci for VTE are identified within 1MB upstream or downstream of the
   protein.

Identify two non-linear patterns

   Combining the linear and non-linear PWAS results in Table [102]1, we
   can classify proteins into Patterns I and II. Briefly, the proteins
   with significant
   [MATH:
   <msub><mrow><mi>P</mi></mrow><mrow><mi>L</mi><mi>i</mi><mi>n</mi><mi>e<
   /mi><mi>a</mi><mi>r</mi></mrow></msub> :MATH]
   ,
   [MATH:
   <msub><mrow><mi>P</mi></mrow><mrow><mi>R</mi><mi>C</mi><mi>S</mi></mrow
   ></msub> :MATH]
   and
   [MATH:
   <msub><mrow><mi>P</mi></mrow><mrow><mi>N</mi><mi>o</mi><mi>n</mi><mo>−<
   /mo><mi>l</mi><mi>i</mi><mi>n</mi><mi>e</mi><mi>a</mi><mi>r</mi></mrow>
   </msub> :MATH]
   , such as VWF, CDH5 and F3, were categorized as Pattern I. The proteins
   with significant
   [MATH:
   <msub><mrow><mi>P</mi></mrow><mrow><mi>R</mi><mi>C</mi><mi>S</mi></mrow
   ></msub> :MATH]
   and
   [MATH:
   <msub><mrow><mi>P</mi></mrow><mrow><mi>N</mi><mi>o</mi><mi>n</mi><mo>−<
   /mo><mi>l</mi><mi>i</mi><mi>n</mi><mi>e</mi><mi>a</mi><mi>r</mi></mrow>
   </msub> :MATH]
   but non-significant
   [MATH:
   <msub><mrow><mi>P</mi></mrow><mrow><mi>L</mi><mi>i</mi><mi>n</mi><mi>e<
   /mi><mi>a</mi><mi>r</mi></mrow></msub> :MATH]
   , such as TEK, CCL25 and VSIG2, were classified as Pattern II. Among
   the VTE-associated proteins, 29 proteins were identified with only
   linear trends, 35 proteins were assigned to Pattern I and eight
   proteins were assigned to Pattern II (Fig. [103]3A). Moreover, seven
   proteins in Pattern II including GFRAL, CTRC, DKK1, TEK, VSIG2, CCL25
   and LDLR were newly identified as being associated with VTE,
   highlighting that our proposed non-linear analysis pipeline can uncover
   additional associations that may be missed under traditional linear
   assumption. Specifically, both the Olink and SomaScan platforms
   identified proteins ICAM2, CDH5, LIFR, PDGFA, SELE and VWF as Pattern
   I, and TEK as Pattern II (Fig. [104]3B). This demonstrates that the two
   non-linear patterns remain robust across results derived from different
   proteomic platforms.

Fig. 3. Venn diagram.

   [105]Fig. 3
   [106]Open in a new tab

   A Venn diagram of significant proteins in linear and non-linear PWAS.
   The red numbers denote the number of elements in each region. B Venn
   diagram of significant proteins in linear (SomaScan), linear (Olink),
   non-linear (Olink) and non-linear (SomaScan) PWAS. Source data are
   provided as a [107]Source Data file.

Dose-response relationship between protein levels and VTE risk

   The dose-response curves between the levels of 43 non-linear protein
   and VTE risk, as identified by our PWAS pipeline, were displayed in
   Fig. [108]4A and Figs. [109]S3–[110]S5, with eight proteins
   successfully replicated in the subsequent cohort analysis
   (Fig. [111]4A). The dose-response curves of proteins ULBP2, IL18BP and
   MAN1A2 were J-shaped, characterized by a flat or slowly increasing
   trend at low doses, followed by a rapid increase after exceeding a
   threshold. The curve of protein CCL25 exhibited an L-shape, with a
   sharp decrease at low doses that leveled off or declined more gradually
   beyond a threshold. The curves of proteins ICAM2, VSIG2 and LGALS4
   followed a U-shape, with risk decreasing at low doses, increasing at
   high doses, and reaching the lowest risk at a moderate dose. The
   dose-response curve of protein ABO was more complex, remaining
   relatively flat at both low and high doses while increasing at moderate
   doses.

Fig. 4. Dose-response curves between seven replicated proteins and VTE in
PWAS and cohort analysis.

   [112]Fig. 4
   [113]Open in a new tab

   A Dose-response curves of proteins ULBP2, IL18BP, MAN1A2, CCL25, ICAM2,
   LGALS4, VSIG2 and ABO in PWAS. B Corresponding dose-response curves in
   cohort replication. The x axis is the predicted (measured) protein
   levels after rank-based inverse normal transformation, the y axis on
   the right denotes the frequency of their histogram, and the y axis on
   the left is OR (HR) of VTE risk estimated by logistic regression (cox
   regression) and RCS model. The red curve represents linear fit and 95%
   CI. The blue curve represents non-linear fit and 95% CI. The central
   measure of these curves represents the predicted mean values generated
   by the linear (non-linear) model. The histogram represents the
   distribution of the predicted (measured) protein levels. The dashed
   line is the reference line of OR (HR).

   Based on the statistics of linear and non-linear PWAS, ULBP2, IL18BP,
   MAN1A2, ICAM2, LGALS4 and ABO were classified as Pattern I, and CCL25
   and VSIG2 belonged to Pattern II. Although some proteins displayed
   similar curve shapes, they may belong to different patterns (e.g.,
   ICAM2, LGALS4 and VSIG2), which may be due to the imbalanced
   distribution in their non-linear components. We noted that the
   threshold points for Pattern I often distant from the normalized mean
   0, resulting in sparse data in one portion of the curve. For example,
   ICAM2 and LGALS4 had thresholds around 1 SD, leading to sparse
   distributions in the high-dose range. When conducting linear
   regression, an imbalanced distribution may reduce the weight of either
   non-linear component in the ordinary least square estimation, leading
   to misleading statistical significance. In this case, although the
   linear PWAS can still detect these significant proteins, the non-linear
   association between proteins and VTE would be ignored. In contrast to
   Pattern I, threshold points in Pattern II were typically near the mean
   0 (e.g., VSIG2). In this case, linear models lack power to detect
   associations, causing them to be overlooked in traditional linear PWAS.

Prospective cohort study replicates 8 proteins in 43 non-linear proteins

   Among the 43 non-linear proteins identified by our PWAS pipeline, 18
   proteins can be replicated at a nominal significance level in the
   prospective cohort analysis, and 14 proteins remained significant after
   Bonferroni adjustment (0.05/43 = 1.1 × 10^−3). Among these, eight
   proteins exhibited consistent dose-response curves in both the PWAS and
   cohort analyses. The dose-response cures of these eight proteins were
   shown in Fig. [114]4B. The non-linear trends of proteins ULBP2, IL18BP,
   MANIA2, ICAM2 and ABO can be perfectly replicated in the prospective
   cohort. For the remaining three proteins CCL25, VSIG2 and LGALS4, some
   minor discrepancies were observed in the curve shapes, but these
   differences are reasonable given the substantially smaller sample size
   of the prospective cohort. The dose-response curves of the other 35
   non-linear proteins analyzed in the cohort were shown in
   Figs. [115]S6–[116]S8.

Enrichment analysis of VTE associated proteins

   Gene-set enrichment analyses were conducted to further elucidate the
   biological pathways of the identified non-linear proteins. As a
   benchmark, we first performed enrichment analyses on the 64 proteins
   with linear trends. The top three enriched GO pathways
   (Fig. [117]5ALinear) were Enzyme-linked receptor protein signaling
   pathway (P[adjust] = 8.8 × 10^−5, GO:0007167), Regulation of body fluid
   levels (P[adjust] = 0.006, GO:0050878) and Blood coagulation
   (P[adjust] = 0.006, GO:0007596) for biological process (BP); Cell
   periphery (P[adjust] = 0.005, GO:0071944), Membrane (P[adjust] = 0.002,
   GO:0016020) and Plasma membrane (P[adjust] = 0.002, GO:0005886) for
   cellular component (CC). The top three enriched KEGG pathways
   (Fig. [118]5BLinear) included Focal adhesion (P[adjust] = 0.001, path:
   hsa04510), Rap1 signaling pathway (P[adjust] = 0.001, path: hsa04015)
   and Ras signaling pathway (P[adjust] = 0.009, path: hsa04014). In
   contrast, for the 43 proteins with non-linear trends, enriched GO
   pathways (Fig. [119]5ANon-linear) only included endothelium development
   (P[adjust] = 0.031, GO:0003158), and KEGG pathways
   (Fig. [120]5BNon-linear) only included fluid shear stress and
   atherosclerosis (P[adjust] = 0.043, path:hsa04510).

Fig. 5. Gene-set enrichment analyses for VTE associated proteins.

   [121]Fig. 5
   [122]Open in a new tab

   a Dot plot shows the top five BP, CC and MF terms of GO function for 64
   proteins with linear trends and 43 proteins with non-linear trends on
   VTE. b Dot plot shows the top ten enriched KEGG pathway terms for 64
   proteins with linear trends and 43 proteins with non-linear trends on
   VTE. Dot color represents the statistical significance of enrichment
   analysis based on P values after FDR adjustment, while dot size
   represents the count of proteins annotated to each term. Source data
   are provided as a [123]Source Data file.

Discussion

   In this study, we proposed a novel non-linear PWAS analysis pipeline.
   This flexible pipeline does not rely on large-scale individual-level
   proteomic data, and can borrow public resources to complete the
   analysis. We further suggested two non-linear association patterns,
   highlighting the potential limitations for directly using linear PWAS
   when non-linear relationships exist. We recommended that our non-linear
   pipeline could be used as a complementary approach to linear PWAS, to
   reduce the risk of omission and misleadingness. Applying this pipeline
   to VTE, we identified 43 proteins with non-linear associations to VTE
   risk, including 26 unreported findings. Among them, eight proteins can
   be replicated in the prospective cohort analysis, further supporting
   the robustness of our findings. These non-linear associations provided
   novel insights into the development of VTE.

   Among these replicated proteins, ULBP2 primarily binds to NKG2D, a
   receptor on the surface of NK cells, and regulates immune response. Its
   aberrant expression may lead to an inflammatory response and immune
   imbalance, which promotes thrombus formation^[124]27. IL18BP is a
   modulator of the proinflammatory cytokine IL-18, which promotes
   leukocyte and platelet recruitment, endothelial dysfunction and
   coagulation triggering. IL18BP may affect VTE by mediating IL-18 in the
   vasculature^[125]28,[126]29. MAN1A2 regulates the glycosylation
   process, and abnormalities in glycosylation can enhance adhesion on
   specific cell surfaces, particularly endothelial cells, thereby
   increasing the risk of VTE^[127]30. CCL25 can attract T lymphocytes
   expressing its receptor CCR9 to sites of inflammation, including
   vascular endothelium, and may contribute to enhanced recruitment of
   immune cells to the vessel wall, promoting local inflammation and
   thrombus formation^[128]31. ICAM2 plays a crucial role in mediating
   cell-cell adhesion processes. Endothelial ICAM2 mediates angiogenesis
   and the lack of ICAM2 expression results in impaired angiogenesis both
   in vitro and in vivo, increasing the risk of thrombosis^[129]32.
   Meanwhile, overexpression of ICAM2 also promotes thrombosis by inducing
   platelet and leukocyte adhesion to vascular endothelial cells^[130]33.
   VSIG2 primarily implicated in immune modulation and cell-cell
   interactions, could impact the initiation of VTE by affecting the
   inflammatory response associated with endothelial dysfunction^[131]34.
   LGALS4, also known as Galectin-4, has been reported to regulate
   endothelial cell adhesion molecule expression, inflammation response
   and leukocyte recruitment, which are critical events in the initiation
   of VTE^[132]35,[133]36. ABO protein levels determine an individual’s
   ABO blood group, which non-linearly affects the risk of VTE by
   mediating the levels of VIII and VWF in different blood
   groups^[134]37,[135]38. Among above replicated proteins, the non-linear
   trends of ICAM2 and ABO can be interpreted by existing functional
   annotations, and others are new findings that need more confirmation.

   It should be noted that most of the non-linear associated proteins are
   functionally annotated to cell adhesion and endothelium development,
   which is consistent with our gene-set enrichment results. Endothelial
   cell development is also linked to fluid shear stress, a KEGG pathway
   identified in our analysis. The association between fluid shear stress
   and VTE is generally considered to be non-linear: excessive fluid shear
   stress may damage vascular endothelial cells, triggering an
   inflammatory response and promoting thrombosis, while insufficient
   fluid shear may make vascular endothelial cells more susceptible to
   platelet and leukocyte adhesion, promoting thrombosis^[136]39,[137]40.
   In addition, endothelial cell development influences multiple processes
   associated with thrombosis (e.g., release of coagulation factors,
   inflammatory response, and immune response), which increases the
   likelihood that endothelial cell development plays a non-linear role in
   the pathogenesis of VTE^[138]41.

   For the non-linear PWAS analysis pipeline, we did not directly use the
   measured protein levels from UKB-PPP. Instead, we used genetically
   predicted protein levels based on the genetic data from UKB. This
   strategy has two advantages. First, although UKB-PPP includes measured
   protein levels for 54,219 samples, only 28,318 samples can be retained
   after quality control (QC) in our study. In contrast, we can
   genetically predict protein levels for 408,812 Europeans in UKB, and
   even after QC, 378,474 individuals can still be retained in the
   analysis. This approach allows us to utilize a larger sample size for
   subsequent non-linear PWAS compared to directly using measured
   proteins, which is especially valuable for ethnic populations with
   small sample sizes of measured proteins (e.g., African^[139]42, East
   Asian^[140]43 and Arabs^[141]44). Second, measured plasma protein
   levels are influenced by various environmental factors, such as
   diet^[142]45, behavioral habits^[143]22, and the individual physical
   state at the time of measurement. These factors may confound the
   analysis of linear and non-linear relationships between protein levels
   and diseases. Here, we used genetically predicted protein levels, which
   is isolated from environmental confounders. To further validate our
   method, we conducted additional analyses comparing our linear PWAS
   pipeline with the traditional PWAS method (regression using measured
   protein levels, followed by Mendelian randomization to strengthen
   causal inference). The results showed that our linear PWAS pipeline was
   generally consistent with the Mendelian randomization results
   (Supplementary Note [144]3 and Figs. [145]S10 and [146]S11).

   We suggested two non-linear patterns based on our analysis pipeline.
   Pattern I indicates the proteins that indeed have non-linear trends
   with VTE (e.g., U-shape or L-shape), while also showing significant
   linear trends. In this case, linear PWAS would identify these proteins,
   but it might overlook the non-linear relationship underlying them. More
   seriously, the misleading regression coefficients from linear PWAS
   could lead to incorrect conclusions. Pattern II represents the proteins
   with non-linear trends where the linear trend is insignificant. In this
   case, if only linear PWAS is conducted, these proteins would be omitted
   from the analysis. Overall, we recommend our proposed non-linear PWAS
   pipeline as a complement to linear PWAS, to reduce the risk of omission
   and misleadingness in linear PWAS.

   We recommend the following directions to further research the
   non-linear proteins identified in this study. Firstly, such non-linear
   relationships indicate that the effect of proteins on diseases changes
   significantly at a specific threshold. This threshold effect may
   involve complex feedback regulatory loops^[147]46, or networks of
   protein-protein interactions on diseases^[148]47. Studying this
   threshold effect may reveal triggering mechanisms and progression
   patterns of diseases. Secondly, Dose-response curves help determine the
   optimal protein level to reduce disease risks, which can inform the
   appropriate drug dose in the subsequent clinical trials and support the
   development of protein-targeted drugs. Thirdly, known non-linear
   relationships can help to build more accurate predictive models of
   diseases. Researchers can enhance the predictive performance by
   introducing quadratic or interaction terms of known non-linear proteins
   into the model, instead of just assuming all relationships are
   linear^[149]48.

   In defining VTE cases, we did not consider multiple VTE events for a
   single patient, as the number of recurrent VTE events may be
   incompletely recorded in the UKB. Our primary focus was on genetic
   susceptibility to VTE, and once an individual has experienced a VTE
   event, indicating the underlying susceptibility, regardless of the
   number of occurrences. VTE can occur in various clinical settings, such
   as acute disease, trauma, pregnancy, exposure to drugs, surgery,
   cancer, inflammatory diseases and so on. To address this issue, we
   conducted a sensitivity analysis focused solely on unprovoked VTE cases
   driven by genetic factors. The results of the sensitivity analysis
   showed that our PWAS results were minimally affected by these clinical
   factors (Supplementary Note [150]2 and Fig. [151]S9). Considering
   anticoagulant use may also influence incident VTE, we performed
   sensitivity analyses by adjusting for it in the prospective cohort. The
   results showed this factor did not significantly influence incident VTE
   in our study (HR = 0.755, 95% CI: 0.357–1.592, P = 0.439), which may be
   attributed to the limited sample size using anticoagulants and the
   potential misclassification from baseline-only self-reported medication
   data in the UKB.

   There are also several limitations deserve discussion. First, our PWAS
   analysis pipeline was based on genetically predicted protein levels.
   Some important proteins may be eliminated in the validation step, due
   to the limited power of the protein prediction model or the limited
   sample size of the protein reference population. In this study, we
   excluded these biased predicted proteins to minimize the risk of
   misleading associations. Second, in our study, VTE cases were defined
   based on ICD-10 codes, which may introduce non-differential
   misclassification of VTE outcomes when the sensitivity and specificity
   of ICD-10 codes for VTE diagnosis were relatively low. Such
   misclassification could reduce the statistical power of our models and
   potentially lead to true protein associations being classified as
   non-significant. Third, the mechanisms of VTE vary by sex and age
   groups, with oestrogen playing a major role in young women and genetic
   factors being more prominent in young men. Future studies stratifying
   by sex and age groups may provide additional valuable insights into
   VTE. Finally, only eight out of 43 non-linear proteins identified in
   our study were replicated in the prospective cohort analysis, which may
   be attributed to the limited sample size of the prospective replication
   analysis. The non-linear trend of these proteins should be further
   validated in larger populations or through experimental approaches in
   future studies.

Methods

Ethics

   We obtained genotype, protein and phenotype data from the UKB and
   UKB-PPP, which received ethics approval from the North West Centre for
   Research Ethics Committee. The application number of UKB and UKB-PPP in
   our study was 88159.

Quality control in UK Biobank

   The target population in this study is individuals of European ancestry
   in UKB. The UKB cohort is a population-based volunteer longitudinal
   cohort of ~500,000 individuals recruited across the United Kingdom. We
   used the imputed genotypes from UKB. Details of array, sample
   processing and quality control have been described
   elsewhere^[152]49,[153]50. Briefly, we extracted a European ancestry
   subset, including 408,812 samples who self-identified as white British
   and had very similar genetic ancestry based on the principal component
   analysis of the genotypes (Field ID 22006). The variants stratify the
   following criterion were excluded by using PLINK^[154]51: (1) minor
   allele frequency (MAF) < 0.001 (we set a looser criterion than 0.005 of
   the original prediction model^[155]21 here to ensure enough variants
   can be included in our cohort), (2) Hardy–Weinberg equilibrium test P
   value < 1.0 × 10^−12, (3) missing genotype rate >0.05, or (4)
   imputation accuracy score <0.3. We also randomly removed 30,338
   individuals in related pairs with kinship coefficient ≥0.0884 (second
   degree or greater)^[156]52 using PLINK^[157]51. Therefore, 378,474
   individuals of European ancestry and 15,456,785 variants were retained
   for the following study.

VTE definition in UK biobank

   VTE phenotype in UKB (application number 88159) was derived from the
   first occurrence of any code mapped to 3-character ICD-10 (category ID
   1712), which was generated by mapping ICD-10 code in the death register
   records, read code information in the primary care data, ICD-9 and
   ICD-10 codes in the hospital inpatient records (e.g., diagnosis or
   operation code), and self-reported medical conditions reported at the
   baseline or subsequent visit to UKB assessment center. VTE cases were
   defined as having at least one of the following ICD-10 diagnosis code:
   (1) ‘I26’ pulmonary embolism (Field ID 131308), (2) ‘I80’ thrombosis,
   phlebitis and thrombophlebitis (Field ID 131396), and (3) ‘I82’ other
   venous thromboembolism and thrombosis (Field ID 131400)^[158]53. The
   remaining individuals were included in the control group. Covariates
   included sex (Field ID 22001), array batch (Field ID 22000) and top ten
   ancestry principal components. Finally, we retained 378,474 Europeans,
   including 22,103 VTE cases and 356,371 controls.

Protein levels prediction in UK Biobank

   In this study, we directly used the protein prediction models
   constructed by Xu et al.^[159]21, which were based on the INTERVAL
   study^[160]54. For aptamer-based SomaScan assay (v.3), the INTERVAL
   study profiled 2384 plasma proteins for 3175 participants of European
   ancestry^[161]26,[162]55. For the Olink assay, they measured 308 plasma
   protein abundance of around 4822 Europeans using four Olink panels,
   namely inflammation-1 (INF-1), cardiovascular II (CVD-2),
   cardiovascular III (CVD-3) and neurology (NEUR)^[163]56. Xu et al.
   constructed the prediction model for these Olink and SomaScan proteins
   using Bayesian ridge regression^[164]21.

   The existing prediction models have provided pQTLs and their weights
   for each protein. We matched pQTLs in UKB and predicted protein levels
   by calculating the inner product of pQTLs and the corresponding
   weights. Finally, we obtained 308 predicted Olink proteins and 2384
   SomaScan proteins for 22,103 VTE cases and 356,371 controls. The flow
   chart of protein levels prediction was shown in the left panel of
   Fig. [165]S2.

Validation and selection of predicted protein in UKB-PPP

   UKB-PPP is a proteomics profiling data in UKB, which collected blood
   plasma samples from 54,219 UKB participants using the antibody-based
   Olink Explore 3072 proximity extension assay, measuring 2,941 proteins
   (Field ID 30900)^[166]22. After excluding non-random baseline and
   individuals in batch 7 due to its mixed samples from both UKB sets and
   COVID-19 sets^[167]22, 40,502 samples were retained. We matched
   individuals and proteins between the predicted data in UKB and measured
   data in UKB-PPP based on their corresponding ID and protein-coding gene
   names, respectively. Then 31,013 samples, 295 Olink proteins and 944
   SomaScan proteins were matched. The Pearson correlations of these
   matched proteins between predicted data and measured data were
   calculated to reflect the accuracy of our protein predictions. For
   measured protein levels, we further used their residuals after
   adjusting covariates sex (Field ID 22001), baseline age (Field ID
   21022), array batch (Field ID 22000) and top ten ancestry principal
   components (Field ID 22009). Both predicted and measured protein levels
   were rank-based inverse normal transformed before calculating Pearson
   correlations^[168]57. Finally, only matched proteins with R^2 > 0.01
   (i.e., Pearson correlation >0.1)^[169]21,[170]23 can pass the external
   validation and be included in the following PWAS analysis. The flow
   chart of predicted protein validation was shown in the middle panel of
   Fig. [171]S2.

Linear PWAS for VTE

   We performed linear PWAS for the well-predicted proteins and the VTE
   risk in 378,474 Europeans from UKB using the logistic regression. The
   adjusted covariates only included sex (Field ID 22001), array batch
   (Field ID 22000) and top ten ancestry principal components (Field ID
   22009), as further adjustment may induce collider bias in PWAS
   (adjusting downstream variables of the predicted protein can create new
   associations between the predicted protein and unknown environmental
   confounders)^[172]58. We excluded baseline age as covariates in PWAS,
   because the predicted protein levels represent an individual’s lifelong
   average levels. It is unreasonable to use baseline age to adjust the
   associations of lifelong average predicted protein levels with
   diseases. All predicted protein levels were rank-based inverse normal
   transformed before analysis^[173]57. Considering 108 overlapped
   proteins between predicted Olink and SomaScan proteins, there were 537
   unique well-predicted proteins in our study. The unique proteins were
   defined by having different protein-coding gene names. The significance
   level of linear PWAS was set to be 0.05/537 = 9.3 × 10^−5 based on
   Bonferroni adjustment.

Non-linear PWAS using RCS

   We used restricted cubic splines^[174]59 with four knots at the 5th,
   35th, 65th, and 95th centiles to flexibly model the non-linear
   association between well-predicted proteins and VTE risk in 378,474
   Europeans from UKB by R package “rms”. The covariates were the same as
   the above linear PWAS. All predicted protein levels were rank-based
   inverse normal transformed before analysis^[175]57. We tested the
   significance of protein levels by likelihood ratio test comparing the
   model with only covariates, against the model with covariates and
   protein levels. The significance level was set to be 9.3 × 10^−5 after
   Bonferroni adjustment. The non-linearity of protein levels was tested
   by the likelihood ratio test comparing RCS model, against the logistic
   model that considers protein levels as a linear term^[176]60,[177]61.
   Only the proteins that passed the overall significance test would be
   performed for the non-linearity test. So, the significance level of the
   non-linearity test was set to be 0.05. Finally, we plotted the
   dose-response curves of the odds ratio of VTE risk against standardized
   protein levels for each protein.

Prospective cohort replication in UKB-PPP

   To replicate the 43 proteins obtained by non-linear PWAS, we conducted
   Cox regression using the prospective cohort data in UKB-PPP. We
   selected 31,013 matched samples and 43 corresponding measured proteins
   in UKB-PPP. We excluded individuals with VTE before baseline (n = 990)
   and individuals with missing covariates (n = 1705). Then 28,318
   samples, 768 of whom had emerging VTE events during the follow-up
   period, were retained in the cohort analysis (Table [178]S1). Other
   27,550 samples with censored events include: (1) 25,601 samples did not
   suffer from VTE until the end of follow-up, (2) 1882 samples were lost
   to follow-up, and (3) 67 samples died before suffering from VTE. We
   used Cox regression to replicate the linear association between
   measured protein level and risk of VTE, and restricted cubic splines to
   replicate their non-linear associations. The adjusted covariates
   included sex (Field ID 22001), baseline age (Field ID 21022), BMI
   (Field ID 21001), Townsend deprivation index (Field ID 22189), smoke
   status (Field ID 20116), drink status (Field ID 20117), physical status
   (Field ID 6164), systolic blood pressure (Field ID 4080), Oily fish
   intake (Field ID 1329), processed meat intake (Field ID 1349) and top
   ten ancestry principal components (Field ID 22009). Details on the
   selection of covariates can be found in Supplementary Note [179]1. All
   measured protein levels were rank-based inverse normal transformed
   before analysis^[180]57. The significance level of proteins was set to
   be 0.05/43 = 1.1 × 10^−3 after Bonferroni adjustment. Considering the
   sample size of the prospective cohort is much smaller than that of
   predicted data in UKB, we also listed the proteins that can be
   replicated by the nominal significance level (P = 0.05). The
   significance level of the non-linearity test was 0.05 as the non-linear
   PWAS. The flow chart of cohort replication was shown in the right panel
   of Fig. [181]S2.

Pathway enrichment analysis

   We performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and
   Genomes (KEGG) enrichment analysis to explore the biological pathways
   on 64 proteins with linear trends and 43 proteins with non-linear
   trends, respectively, using the R package “clusterProfiler”^[182]62.
   The background gene of enrichment analysis was set as the list of all
   537 unique well-predicted proteins. The P values were adjusted by false
   discovery rate (FDR) using the Benjamini-Hochberg method.

Reporting summary

   Further information on research design is available in the [183]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [184]Supplementary information^ (8.7MB, pdf)
   [185]41467_2025_61874_MOESM2_ESM.docx^ (14.8KB, docx)

   Description of Additional Supplementary Files
   [186]Supplementary Data 1-3^ (67.9KB, xlsx)
   [187]Reporting Summary^ (91.2KB, pdf)
   [188]Transparent Peer Review file^ (4.1MB, pdf)

Source data

   [189]Source Data^ (176.3KB, xlsx)

Acknowledgements