Abstract Venous thromboembolism is a life-threatening vascular event with high prevalence and genetic determinants. PWAS has become a popular strategy to identify therapeutic targets of complex diseases. However, the current PWAS model only considers the linear relationship between protein and disease. Here, we propose a novel non-linear PWAS pipeline and identify 43 proteins exhibiting non-linear associations with venous thromboembolism in the UK Biobank, of which eight proteins cannot be captured by linear PWAS. We further conduct prospective cohort replication in the UK Biobank Pharma Proteomics Project, and replicate eight proteins with similar non-linear trends, including ULBP2, IL18BP, MAN1A2, CCL25, ICAM2, LGALS4, VSIG2 and ABO. Pathway enrichment analysis suggests that the identified non-linear proteins are involved in endothelium development, fluid shear stress and atherosclerosis pathways. In summary, we develop a novel non-linear PWAS analysis pipeline, and identify 43 non-linear proteins with venous thromboembolism, highlighting the importance of incorporating non-linear analysis in PWAS. Subject terms: Predictive markers, Proteome informatics, Haematological diseases, Diagnostic markers, Cardiovascular diseases __________________________________________________________________ Venous thromboembolism is a serious condition influenced by genetics, but current protein-based analyses often miss complex patterns. Here, the authors show that a new non-linear analysis method reveals 43 associated proteins, including eight missed by traditional models, offering deeper insight into disease mechanisms. Introduction Venous thromboembolism (VTE) is a cardiovascular disease caused by an imbalance in the regulation of hemostasis, leading to the pathologic coagulation of blood and the formation of vascular thrombosis. According to the location of the thrombus, VTE can be classified as deep vein thrombosis (DVT), which occurs in the deep veins (e.g., legs and trunk), or as pulmonary embolism (PE), when a thrombus obstructs the pulmonary arteries^[42]1. The annual incidence of VTE in the general population is approximately 100 cases per 100,000 individuals (1‰)^[43]2, and more than 600,000 individuals in the United States and Europe die due to VTE each year^[44]3. VTE is a complex disease influenced by both environmental and genetic factors. Twin studies have estimated the heritability of VTE to be approximately 50%, suggesting that a large proportion of VTE risk is genetically driven^[45]4,[46]5. Large-scale genome-wide association studies (GWAS) have identified hundreds of loci associated with VTE^[47]6–[48]9. However, most of the susceptibility loci identified by GWAS are located in non-coding regions or span multiple genes, which limits their ability to reflect the causal roles of genes and proteins. As key molecular regulators, proteins have become the primary targets in most drug development efforts^[49]10. Therefore, protein-wide association studies (PWAS) are now widely performed to explore how proteins are involved in the occurrence and progression of diseases^[50]11. Previous PWAS have identified several proteins, including F2, F11, ABO and PROC, as potential diagnostic and therapeutic targets for VTE using the FUSION pipeline followed by causal inference^[51]12, SERPINE1, VWF, EPHB4, etc., as causally associated proteins with VTE through prospective designs and Mendelian randomization^[52]13, and CFHR5 as a VTE associated biomarker based on case-control designs^[53]14. Currently, almost all PWAS studies are designed based on the assumption of linearity, in which proteins are assumed to affect diseases only in a linear pattern^[54]12–[55]14. However, many studies have indicated that the relationships between proteins and diseases may follow non-linear patterns. For example, a cohort study showed that women with moderate levels of the protein TFF3 had the highest risk of recurrent ovarian cancer compared to women with either low or high levels^[56]15. Another two review studies suggested that the protein p53 functions as a “cellular rheostat” in cancer, favoring antioxidant functions at low stress levels while switching to a prooxidant role under high stress^[57]16,[58]17. The similar non-linear associations were also reported between tau protein and neurodegenerative diseases^[59]18–[60]20. Therefore, to identify the dose-response relationships between proteins and diseases, non-linear patterns should be considered in PWAS. In this study, we proposed a novel non-linear PWAS analysis pipeline based on genetically predicted protein levels. We then applied this pipeline to identify the 43 proteins non-linearly associated with VTE in the UK biobank (UKB). Eight of these proteins were replicated with similar dose-response curves in the prospective cohort of the UK Biobank Pharma Proteomics Project (UKB-PPP). Finally, we performed pathway enrichment analysis to elucidate the potential non-linear biological pathway for VTE. Results Overview of study design The overview of the study design is illustrated in Fig. [61]1. We first genetically predicted protein levels of 378,474 Europeans in the UKB using the prediction models developed by Xu et al.^[62]21. To validate the predicted proteins, we calculated the Pearson correlation coefficient (r) between the predicted protein levels and directly measured protein levels in the UKB-PPP^[63]22, and only proteins with R^2 > 0.01^[64]21,[65]23 were retained for subsequent analyses (Fig. [66]1A). Fig. 1. Overview of study design. [67]Fig. 1 [68]Open in a new tab A Predict the protein levels of European individuals in UKB using the weights of pQTL from INTERVAL study. Predicted proteins with validated R^2 > 0.01 in UKB-PPP were retained for the subsequent analyses. B Two patterns of the non-linear associations between protein levels and VTE. Pattern I: The protein has non-linear trends with disease (U-shape or L-shape, the blue curve), while the linear trend can be also fitted (the red line) because of the misleading statistical significance. Pattern II: the proteins have non-linear trends, while the linear trend is not significant due to the weak test power. The purple triangle represents the reference OR value. C Linear and non-linear PWAS with VTE by the logistic regression and RCS model using predicted proteins, respectively. The gray area of the Venn diagram represents Patterns I and II. D Cohort replication and pathway enrichment for proteins with non-linear trends, created in BioRender. Kong, Y. (2025) [69]https://BioRender.com/j9ioqmi. Figure 1 is a schematic diagram and is not derived from real data. Based on statistical theory, applying linear regression to variables that exhibit non-linear relationships may lead to reduced test power, or even misleading statistical significance^[70]24,[71]25. Therefore, we considered two patterns for non-linear relationships between protein levels and the disease. In the first pattern, protein levels indeed have non-linear trends with the disease (e.g., U-shape or L-shape), but a misleadingly significant linear association may still be observed (Pattern I in Fig. [72]1B). In the second pattern, the relationship is truly non-linear, while the linear association appears non-significant due to reduced test power (Pattern II in Fig. [73]1B). To identify these two patterns, we conducted both linear and non-linear PWAS of the disease using the genetically predicted proteins, employing logistic regression for the linear analysis and restricted cubic splines (RCS) for the non-linear analysis. The two patterns can be distinguished by combining results of linear and non-linear PWAS (Fig. [74]1C). Finally, we extracted prospective cohort data in UKB-PPP to replicate the proteins identified by the non-linear PWAS. We also performed gene-set enrichment analyses on these non-linear proteins, to explain the potential biological pathways for the non-linear patterns (Fig. [75]1D). Predict and validate protein levels in UKB We first extracted 15,456,785 genetic variants from 378,474 Europeans in the UKB, including 22,103 individuals with a history of VTE and 356,371 controls after quality control (Table [76]S1, details shown in Methods section). Based on prediction models constructed by Xu et al.^[77]21, we genetically predicted 308 Olink proteins and 2384 SomaScan proteins for these 378,474 Europeans. After quality control, measured levels of 295 Olink proteins and 944 SomaScan proteins were also available for 31,013 Europeans in the UKB-PPP (Table [78]S1, details shown in Methods section). We then calculated Pearson correlation coefficients between these overlapped genetically predicted and measured protein levels in the UKB-PPP. Prediction performance was shown in Fig. [79]S1. For Olink proteins, 237 had an R^2 greater than 0.01, among which 97 had R^2 > 0.1 and 5 had R^2 > 0.5. For SomaScan proteins, 408 had an R^2 greater than 0.01, among which 151 had R^2 > 0.1 and 9 had R^2 > 0.5. In total, 537 unique well predicted proteins (R^2 > 0.01), including 237 Olink proteins (Supplementary Data [80]1) and 408 SomaScan proteins (Supplementary Data [81]2), were retained for subsequent analysis. It should be noted that multiple aptamers may target the same SomaScan protein, so that a SomaScan protein may have more than one measurement^[82]26. We listed these special SomaScan proteins in Supplementary Data [83]2, and distinguished them by superscripts of asterisk and circumflex in the study. Linear PWAS identified 64 proteins for VTE There were 64 unique proteins whose predicted levels were linearly associated with VTE at a significance level of P < 9.3 × 10^−5 (0.05/537) using Bonferroni adjustment, including 28 Olink proteins and 48 SomaScan proteins (Fig. [84]2A), of which 12 proteins (red labels in Fig. [85]2A) were significant in both Olink and SomaScan platforms, such as SELE, LIFR and VWF. We extracted the Z-statistics of these 12 proteins from each platform (Fig. [86]2B). The scatter plot showed a good concordance in linear PWAS results between the Olink and SomaScan platforms, with a Pearson correlation coefficient being 0.976 (P = 1.2 × 10^−8). Previous linear PWAS based on GWAS and pQTL summary statistics for 257–1348 proteins have identified 21 proteins casually associated with VTE^[87]12–[88]14, among which nine proteins were validated in our study. The effect directions of these nine proteins were also consistent with our linear PWAS results (risk proteins: KLKB1, GP6, F11, ABO and VWF; protective proteins: THBS2, EFEMP1, EPHB4 and PROC)^[89]12,[90]13. Notably, several proteins previously reported to be strongly associated with VTE, such as F2, F5 and SERPINC1, were not validated in our study. This is because we performed PWAS based on genetically predicted protein levels, and these specific proteins were not well predicted and thus excluded in the external validation. Fig. 2. Linear and non-linear PWAS results for VTE. [91]Fig. 2 [92]Open in a new tab A Miami plot for linear PWAS from Olink platform (top) and SomaScan platform (bottom). The x axis is the chromosomal location of proteins, and the y axis denotes −log10 P values from two-sided Wald tests in the logistic regression. The red labels mark the significant proteins present in both Olink and SomaScan platforms, while the gray labels mark the significant proteins present in either Olink or SomaScan platforms. The dashed line is the significant threshold at 9.3 × 10^−5 after Bonferroni adjustment. B Scatter plot for linear PWAS. The x and y axes are Z statistics of linear PWAS from SomaScan and Olink platforms, respectively. The Pearson correlation of scatters is shown in the lower right corner. C Miami plot for non-linear PWAS. The y axis denotes −log10 P values from two-sided overall significance test in the RCS model. The blue labels mark the significant proteins present in both Olink and SomaScan platforms, while the gray labels mark the significant proteins present in either Olink or SomaScan platforms. The purple dots represent the proteins that passed the two-sided non-linearity test. For clear display, y axis scales in (A) and (C) are compressed by the square root. D Scatter plot for non-linear PWAS. The x and y axes are [MATH: χ2 :MATH] statistics of significance test in non-linear PWAS from SomaScan and Olink platforms, respectively. Source data are provided as a [93]Source Data file. Non-linear PWAS identified 43 proteins for VTE Our non-linear PWAS using the RCS model identified 34 Olink proteins and 44 SomaScan proteins associated with VTE (P < 9.3 × 10^−5 for the overall significance test, Fig. [94]2C). Among them, 23 Olink proteins and 27 SomaScan proteins showed significant non-linear associations (P < 0.05 for the non-linearity test, purple dots in the Fig. [95]2C, the set of significance levels were shown in the Methods section). The list of these 23 Olink proteins and 27 SomaScan proteins with significant non-linear trends were presented in Table [96]1. After removing duplicate proteins across Olink and SomaScan platforms, 43 unique proteins were indicated to have non-linear associations with VTE, 26 of which were not reported by previous study, such as TEK, CCL25 and VSIG2. 13 proteins were significant in both Olink and SomaScan platforms, and their P values of the overall significance test were highly correlated, with a Pearson correlation coefficient being 0.989 (P = 1.4 × 10^−10) (Fig. [97]2D). The CVs of these protein measurements in Olink and SomaScan were listed in Supplementary Data [98]3^[99]22. Table 1. The 43 unique proteins that have significant non-linear associations with VTE Protein^a Position^b Linear^c Non-linear^d Platform Pattern Known loci^e OR (95% CI) P P[RCS] P[Non–linear] TIE1 chr1:43300982–43323108 1.12 (1.09–1.15) 4.76 × 10^−56 8.42 × 10^−60 3.48 × 10^−6 SomaScan I Not reported F3 chr1:94529173–94541759 0.97 (0.95–1.00) 9.01 × 10^−5 9.51 × 10^−13 4.55 × 10^−10 Olink I 28827327 MAN1A2 chr1:117367449–117528872 1.09 (1.06–1.12) 1.74 × 10^−38 3.65 × 10^−50 7.47 × 10^−13 SomaScan I 36658437 PEAR1 chr1:156893698–156916429 0.97 (0.94–1.00) 5.11 × 10^−5 3.61 × 10^−20 3.93 × 10^−18 SomaScan I 28002340 SELE chr1:169722640–169764705 0.85 (0.83–0.87) 1.10 × 10^−121 3.54 × 10^−134 1.43 × 10^−14 Olink I 21980494 0.86 (0.83–0.88) 2.47 × 10^−108 1.90 × 10^−120 1.88 × 10^−13 SomaScan LGALS8 chr1:236518000–236552981 0.96 (0.93–0.98) 1.12 × 10^−10 1.07 × 10^−17 1.98 × 10^−10 Olink I Not reported MERTK chr2:111898607–112029561 0.96 (0.93–0.98) 9.66 × 10^−11 1.37 × 10^−10 4.87 × 10^−2 Olink I Not reported TFPI chr2:187464230–187565760 0.93 (0.90–0.95) 1.16 × 10^−27 1.37 × 10^−26 3.26 × 10^−2 SomaScan I 36154123 ALPI chr2:232456125–232460753 0.91 (0.89–0.94) 3.85 × 10^−38 6.19 × 10^−40 1.37 × 10^−3 SomaScan I Not reported FAM3D chr3:58633946–58666834 1.09 (1.06–1.12) 7.94 × 10^−32 9.76 × 10^−32 2.52 × 10^−2 SomaScan I Not reported KDR chr4:55078481–55125595 0.89 (0.87–0.92) 1.03 × 10^−60 3.42 × 10^−82 3.01 × 10^−25 SomaScan I Not reported F11 chr4:186266189–186289681 1.11 (1.08–1.14) 4.35 × 10^−52 4.04 × 10^−50 2.64 × 10^−2 SomaScan I 31420334 LIFR chr5:38474668–38608354 0.92 (0.90–0.95) 2.68 × 10^−32 2.32 × 10^−123 5.22 × 10^−97 Olink I 36658437 0.86 (0.83–0.88) 7.21 × 10^−95 4.89 × 10^−121 7.77 × 10^−28 SomaScan FLT4 chr5:180601506–180649624 0.94 (0.91–0.96) 3.03 × 10^−20 2.91 × 10^−20 2.10 × 10^−2 SomaScan I 31765479 ULBP2 chr6:149942014–149949235 1.05 (1.02–1.08) 9.98 × 10^−14 1.03 × 10^−16 5.77 × 10^−5 Olink I Not reported PDGFA chr7:497258–520296 1.04 (1.01–1.07) 4.42 × 10^−9 9.27 × 10^−9 4.37 × 10^−2 Olink I Not reported 1.05 (1.02–1.08) 9.07 × 10^−10 2.17 × 10^−9 3.92 × 10^−2 SomaScan EPHB4 chr7:100802565–100827523 0.92 (0.90–0.95) 2.65 × 10^−33 1.78 × 10^−128 1.19 × 10^−100 Olink I 35285134 ANGPT1 chr8:107249482–107498055 1.04 (1.01–1.07) 2.76 × 10^−9 8.51 × 10^−10 5.29 × 10^−3 SomaScan I 36154123 ABO chr9:133233278–133276024 1.15 (1.12–1.18) 7.90 × 10^−91 9.34 × 10^−110 4.29 × 10^−23 SomaScan I 28373160 CBLIF chr11:59829273–59845499 0.95 (0.93–0.98) 1.63 × 10^−11 2.88 × 10^−12 6.80 × 10^−3 Olink I 35285134 IL18BP chr11:71998613–72007315 1.07 (1.04–1.10) 8.98 × 10^−20 4.22 × 10^−22 2.48 × 10^−4 Olink I Not reported RELT^ chr11:73376399–73397474 0.97 (0.95–1.00) 5.47 × 10^−5 6.54 × 10^−6 5.85 × 10^−3 SomaScan I Not reported ESAM* chr11:124752583–124762290 1.03 (1.00–1.06) 8.73 × 10^−6 4.80 × 10^−13 4.16 × 10^−9 SomaScan I Not reported ESAM^ 1.03 (1.01–1.06) 2.08 × 10^−6 1.12 × 10^−13 3.72 × 10^−9 VWF chr12:5948877–6124770 1.19 (1.16–1.22) 1.58 × 10^−134 4.67 × 10^−154 3.36 × 10^−24 Olink I 36658437 1.17 (1.14–1.20) 7.78 × 10^−111 2.20 × 10^−143 1.15 × 10^−36 SomaScan ISLR2* chr15:74100311–74138540 0.95 (0.92–0.97) 1.12 × 10^−14 1.94 × 10^−54 3.53 × 10^−43 SomaScan I Not reported ISLR2^ 0.96 (0.93–0.98) 4.42 × 10^−10 2.05 × 10^−40 1.81 × 10^−33 CDH5 chr16:66366622–66404784 0.95 (0.93–0.98) 1.04 × 10^−12 1.59 × 10^−71 6.12 × 10^−63 Olink I Not reported 0.95 (0.93–0.98) 2.65 × 10^−11 4.15 × 10^−66 6.43 × 10^−59 SomaScan LGALS9 chr17:27629798–27649560 0.96 (0.94–0.99) 3.29 × 10^−8 7.06 × 10^−8 4.06 × 10^−2 Olink I Not reported ICAM2 chr17:64002594–64020634 0.91 (0.88–0.93) 4.12 × 10^−45 6.85 × 10^−59 1.03 × 10^−15 Olink I Not reported 0.87 (0.85–0.90) 1.67 × 10^−81 8.69 × 10^−82 1.12 × 10^−2 SomaScan PECAM1 chr17:64319415–64413776 0.92 (0.89–0.94) 2.29 × 10^−36 8.32 × 10^−121 2.20 × 10^−91 Olink I Not reported CD209 chr19:7739993–7747564 1.17 (1.14–1.20) 5.21 × 10^−114 7.77 × 10^−136 1.21 × 10^−25 SomaScan I 35285134 ICAM1 chr19:10271093–10286615 0.89 (0.86–0.92) 1.78 × 10^−56 5.90 × 10^−145 4.39 × 10^−94 SomaScan I 31676865 LGALS4 chr19:38801671–38812945 0.85 (0.82–0.87) 5.79 × 10^−127 5.93 × 10^−151 9.80 × 10^−28 Olink I Not reported OXT chr20:3071620–3072517 1.04 (1.01–1.06) 3.16 × 10^−7 2.82 × 10^−7 2.59 × 10^−2 SomaScan I Not reported THBD chr20:23045633–23049672 1.05 (1.02–1.08) 1.96 × 10^−11 1.30 × 10^−11 5.99 × 10^−3 Olink I 26888256 FAM3B* chr21:41304212–41357727 1.04 (1.01–1.07) 2.13 × 10^−8 1.02 × 10^−11 9.76 × 10^−6 SomaScan I Not reported FAM3B^ 1.04 (1.01–1.07) 1.64 × 10^−9 5.05 × 10^−17 4.27 × 10^−10 CTRC chr1:15438442–15449242 1.03 (1.00–1.06) 1.08 × 10^−4 1.14 × 10^−12 4.56 × 10^−9 Olink II Not reported PROC chr2:127418427–127429242 0.99 (0.96–1.01) 4.76 × 10^−2 4.62 × 10^−10 4.99 × 10^−10 SomaScan II 36658437 GFRAL chr6:55327469–55402493 1.03 (1.00–1.06) 1.65 × 10^−4 5.29 × 10^−8 5.23 × 10^−5 SomaScan II Not reported TEK chr9:27109141–27230174 1.00 (0.97–1.02) 5.43 × 10^−1 9.36 × 10^−16 1.68 × 10^−16 Olink II Not reported 1.01 (0.98–1.03) 3.22 × 10^−1 3.61 × 10^−17 1.07 × 10^−17 SomaScan DKK1 chr10:52314281–52318042 1.02 (0.99–1.05) 2.53 × 10^−3 9.65 × 10^−6 1.87 × 10^−4 Olink II Not reported VSIG2 chr11:124747474–124752255 0.99 (0.97–1.02) 4.31 × 10^−1 2.52 × 10^−8 6.83 × 10^−9 Olink II Not reported CCL25 chr19:8052318–8062660 0.98 (0.95–1.00) 6.73 × 10^−4 5.25 × 10^−6 5.38 × 10^−4 Olink II Not reported LDLR chr19:11089462–11133820 1.00 (0.97–1.02) 6.33 × 10^−1 2.51 × 10^−5 6.69 × 10^−6 Olink II Not reported [100]Open in a new tab ^aThe superscript symbols ‘*’ and ‘^’ are only used to distinguish SomaScan proteins with multiple aptamers. Details on the corresponding aptamers can be found in Supplementary Data [101]2. ^bThe position of protein-coding genes, GRCh38 assembly. ^cThe linear result obtained by two-sided Wald tests in the logistic regression. The Bonferroni significance is 9.3 × 10^−5. ^d [MATH: PRCS :MATH] is the P value of the two-sided overall significance test, and [MATH: PNon−< /mo>linear :MATH] is the P value of the two-sided non-linearity test in the RCS model. ^ePMID for the known loci. Known loci are defined as those with clear evidence linking the protein or gene to VTE, or where significant GWAS loci for VTE are identified within 1MB upstream or downstream of the protein. Identify two non-linear patterns Combining the linear and non-linear PWAS results in Table [102]1, we can classify proteins into Patterns I and II. Briefly, the proteins with significant [MATH: PLine< /mi>ar :MATH] , [MATH: PRCS :MATH] and [MATH: PNon−< /mo>linear :MATH] , such as VWF, CDH5 and F3, were categorized as Pattern I. The proteins with significant [MATH: PRCS :MATH] and [MATH: PNon−< /mo>linear :MATH] but non-significant [MATH: PLine< /mi>ar :MATH] , such as TEK, CCL25 and VSIG2, were classified as Pattern II. Among the VTE-associated proteins, 29 proteins were identified with only linear trends, 35 proteins were assigned to Pattern I and eight proteins were assigned to Pattern II (Fig. [103]3A). Moreover, seven proteins in Pattern II including GFRAL, CTRC, DKK1, TEK, VSIG2, CCL25 and LDLR were newly identified as being associated with VTE, highlighting that our proposed non-linear analysis pipeline can uncover additional associations that may be missed under traditional linear assumption. Specifically, both the Olink and SomaScan platforms identified proteins ICAM2, CDH5, LIFR, PDGFA, SELE and VWF as Pattern I, and TEK as Pattern II (Fig. [104]3B). This demonstrates that the two non-linear patterns remain robust across results derived from different proteomic platforms. Fig. 3. Venn diagram. [105]Fig. 3 [106]Open in a new tab A Venn diagram of significant proteins in linear and non-linear PWAS. The red numbers denote the number of elements in each region. B Venn diagram of significant proteins in linear (SomaScan), linear (Olink), non-linear (Olink) and non-linear (SomaScan) PWAS. Source data are provided as a [107]Source Data file. Dose-response relationship between protein levels and VTE risk The dose-response curves between the levels of 43 non-linear protein and VTE risk, as identified by our PWAS pipeline, were displayed in Fig. [108]4A and Figs. [109]S3–[110]S5, with eight proteins successfully replicated in the subsequent cohort analysis (Fig. [111]4A). The dose-response curves of proteins ULBP2, IL18BP and MAN1A2 were J-shaped, characterized by a flat or slowly increasing trend at low doses, followed by a rapid increase after exceeding a threshold. The curve of protein CCL25 exhibited an L-shape, with a sharp decrease at low doses that leveled off or declined more gradually beyond a threshold. The curves of proteins ICAM2, VSIG2 and LGALS4 followed a U-shape, with risk decreasing at low doses, increasing at high doses, and reaching the lowest risk at a moderate dose. The dose-response curve of protein ABO was more complex, remaining relatively flat at both low and high doses while increasing at moderate doses. Fig. 4. Dose-response curves between seven replicated proteins and VTE in PWAS and cohort analysis. [112]Fig. 4 [113]Open in a new tab A Dose-response curves of proteins ULBP2, IL18BP, MAN1A2, CCL25, ICAM2, LGALS4, VSIG2 and ABO in PWAS. B Corresponding dose-response curves in cohort replication. The x axis is the predicted (measured) protein levels after rank-based inverse normal transformation, the y axis on the right denotes the frequency of their histogram, and the y axis on the left is OR (HR) of VTE risk estimated by logistic regression (cox regression) and RCS model. The red curve represents linear fit and 95% CI. The blue curve represents non-linear fit and 95% CI. The central measure of these curves represents the predicted mean values generated by the linear (non-linear) model. The histogram represents the distribution of the predicted (measured) protein levels. The dashed line is the reference line of OR (HR). Based on the statistics of linear and non-linear PWAS, ULBP2, IL18BP, MAN1A2, ICAM2, LGALS4 and ABO were classified as Pattern I, and CCL25 and VSIG2 belonged to Pattern II. Although some proteins displayed similar curve shapes, they may belong to different patterns (e.g., ICAM2, LGALS4 and VSIG2), which may be due to the imbalanced distribution in their non-linear components. We noted that the threshold points for Pattern I often distant from the normalized mean 0, resulting in sparse data in one portion of the curve. For example, ICAM2 and LGALS4 had thresholds around 1 SD, leading to sparse distributions in the high-dose range. When conducting linear regression, an imbalanced distribution may reduce the weight of either non-linear component in the ordinary least square estimation, leading to misleading statistical significance. In this case, although the linear PWAS can still detect these significant proteins, the non-linear association between proteins and VTE would be ignored. In contrast to Pattern I, threshold points in Pattern II were typically near the mean 0 (e.g., VSIG2). In this case, linear models lack power to detect associations, causing them to be overlooked in traditional linear PWAS. Prospective cohort study replicates 8 proteins in 43 non-linear proteins Among the 43 non-linear proteins identified by our PWAS pipeline, 18 proteins can be replicated at a nominal significance level in the prospective cohort analysis, and 14 proteins remained significant after Bonferroni adjustment (0.05/43 = 1.1 × 10^−3). Among these, eight proteins exhibited consistent dose-response curves in both the PWAS and cohort analyses. The dose-response cures of these eight proteins were shown in Fig. [114]4B. The non-linear trends of proteins ULBP2, IL18BP, MANIA2, ICAM2 and ABO can be perfectly replicated in the prospective cohort. For the remaining three proteins CCL25, VSIG2 and LGALS4, some minor discrepancies were observed in the curve shapes, but these differences are reasonable given the substantially smaller sample size of the prospective cohort. The dose-response curves of the other 35 non-linear proteins analyzed in the cohort were shown in Figs. [115]S6–[116]S8. Enrichment analysis of VTE associated proteins Gene-set enrichment analyses were conducted to further elucidate the biological pathways of the identified non-linear proteins. As a benchmark, we first performed enrichment analyses on the 64 proteins with linear trends. The top three enriched GO pathways (Fig. [117]5ALinear) were Enzyme-linked receptor protein signaling pathway (P[adjust] = 8.8 × 10^−5, GO:0007167), Regulation of body fluid levels (P[adjust] = 0.006, GO:0050878) and Blood coagulation (P[adjust] = 0.006, GO:0007596) for biological process (BP); Cell periphery (P[adjust] = 0.005, GO:0071944), Membrane (P[adjust] = 0.002, GO:0016020) and Plasma membrane (P[adjust] = 0.002, GO:0005886) for cellular component (CC). The top three enriched KEGG pathways (Fig. [118]5BLinear) included Focal adhesion (P[adjust] = 0.001, path: hsa04510), Rap1 signaling pathway (P[adjust] = 0.001, path: hsa04015) and Ras signaling pathway (P[adjust] = 0.009, path: hsa04014). In contrast, for the 43 proteins with non-linear trends, enriched GO pathways (Fig. [119]5ANon-linear) only included endothelium development (P[adjust] = 0.031, GO:0003158), and KEGG pathways (Fig. [120]5BNon-linear) only included fluid shear stress and atherosclerosis (P[adjust] = 0.043, path:hsa04510). Fig. 5. Gene-set enrichment analyses for VTE associated proteins. [121]Fig. 5 [122]Open in a new tab a Dot plot shows the top five BP, CC and MF terms of GO function for 64 proteins with linear trends and 43 proteins with non-linear trends on VTE. b Dot plot shows the top ten enriched KEGG pathway terms for 64 proteins with linear trends and 43 proteins with non-linear trends on VTE. Dot color represents the statistical significance of enrichment analysis based on P values after FDR adjustment, while dot size represents the count of proteins annotated to each term. Source data are provided as a [123]Source Data file. Discussion In this study, we proposed a novel non-linear PWAS analysis pipeline. This flexible pipeline does not rely on large-scale individual-level proteomic data, and can borrow public resources to complete the analysis. We further suggested two non-linear association patterns, highlighting the potential limitations for directly using linear PWAS when non-linear relationships exist. We recommended that our non-linear pipeline could be used as a complementary approach to linear PWAS, to reduce the risk of omission and misleadingness. Applying this pipeline to VTE, we identified 43 proteins with non-linear associations to VTE risk, including 26 unreported findings. Among them, eight proteins can be replicated in the prospective cohort analysis, further supporting the robustness of our findings. These non-linear associations provided novel insights into the development of VTE. Among these replicated proteins, ULBP2 primarily binds to NKG2D, a receptor on the surface of NK cells, and regulates immune response. Its aberrant expression may lead to an inflammatory response and immune imbalance, which promotes thrombus formation^[124]27. IL18BP is a modulator of the proinflammatory cytokine IL-18, which promotes leukocyte and platelet recruitment, endothelial dysfunction and coagulation triggering. IL18BP may affect VTE by mediating IL-18 in the vasculature^[125]28,[126]29. MAN1A2 regulates the glycosylation process, and abnormalities in glycosylation can enhance adhesion on specific cell surfaces, particularly endothelial cells, thereby increasing the risk of VTE^[127]30. CCL25 can attract T lymphocytes expressing its receptor CCR9 to sites of inflammation, including vascular endothelium, and may contribute to enhanced recruitment of immune cells to the vessel wall, promoting local inflammation and thrombus formation^[128]31. ICAM2 plays a crucial role in mediating cell-cell adhesion processes. Endothelial ICAM2 mediates angiogenesis and the lack of ICAM2 expression results in impaired angiogenesis both in vitro and in vivo, increasing the risk of thrombosis^[129]32. Meanwhile, overexpression of ICAM2 also promotes thrombosis by inducing platelet and leukocyte adhesion to vascular endothelial cells^[130]33. VSIG2 primarily implicated in immune modulation and cell-cell interactions, could impact the initiation of VTE by affecting the inflammatory response associated with endothelial dysfunction^[131]34. LGALS4, also known as Galectin-4, has been reported to regulate endothelial cell adhesion molecule expression, inflammation response and leukocyte recruitment, which are critical events in the initiation of VTE^[132]35,[133]36. ABO protein levels determine an individual’s ABO blood group, which non-linearly affects the risk of VTE by mediating the levels of VIII and VWF in different blood groups^[134]37,[135]38. Among above replicated proteins, the non-linear trends of ICAM2 and ABO can be interpreted by existing functional annotations, and others are new findings that need more confirmation. It should be noted that most of the non-linear associated proteins are functionally annotated to cell adhesion and endothelium development, which is consistent with our gene-set enrichment results. Endothelial cell development is also linked to fluid shear stress, a KEGG pathway identified in our analysis. The association between fluid shear stress and VTE is generally considered to be non-linear: excessive fluid shear stress may damage vascular endothelial cells, triggering an inflammatory response and promoting thrombosis, while insufficient fluid shear may make vascular endothelial cells more susceptible to platelet and leukocyte adhesion, promoting thrombosis^[136]39,[137]40. In addition, endothelial cell development influences multiple processes associated with thrombosis (e.g., release of coagulation factors, inflammatory response, and immune response), which increases the likelihood that endothelial cell development plays a non-linear role in the pathogenesis of VTE^[138]41. For the non-linear PWAS analysis pipeline, we did not directly use the measured protein levels from UKB-PPP. Instead, we used genetically predicted protein levels based on the genetic data from UKB. This strategy has two advantages. First, although UKB-PPP includes measured protein levels for 54,219 samples, only 28,318 samples can be retained after quality control (QC) in our study. In contrast, we can genetically predict protein levels for 408,812 Europeans in UKB, and even after QC, 378,474 individuals can still be retained in the analysis. This approach allows us to utilize a larger sample size for subsequent non-linear PWAS compared to directly using measured proteins, which is especially valuable for ethnic populations with small sample sizes of measured proteins (e.g., African^[139]42, East Asian^[140]43 and Arabs^[141]44). Second, measured plasma protein levels are influenced by various environmental factors, such as diet^[142]45, behavioral habits^[143]22, and the individual physical state at the time of measurement. These factors may confound the analysis of linear and non-linear relationships between protein levels and diseases. Here, we used genetically predicted protein levels, which is isolated from environmental confounders. To further validate our method, we conducted additional analyses comparing our linear PWAS pipeline with the traditional PWAS method (regression using measured protein levels, followed by Mendelian randomization to strengthen causal inference). The results showed that our linear PWAS pipeline was generally consistent with the Mendelian randomization results (Supplementary Note [144]3 and Figs. [145]S10 and [146]S11). We suggested two non-linear patterns based on our analysis pipeline. Pattern I indicates the proteins that indeed have non-linear trends with VTE (e.g., U-shape or L-shape), while also showing significant linear trends. In this case, linear PWAS would identify these proteins, but it might overlook the non-linear relationship underlying them. More seriously, the misleading regression coefficients from linear PWAS could lead to incorrect conclusions. Pattern II represents the proteins with non-linear trends where the linear trend is insignificant. In this case, if only linear PWAS is conducted, these proteins would be omitted from the analysis. Overall, we recommend our proposed non-linear PWAS pipeline as a complement to linear PWAS, to reduce the risk of omission and misleadingness in linear PWAS. We recommend the following directions to further research the non-linear proteins identified in this study. Firstly, such non-linear relationships indicate that the effect of proteins on diseases changes significantly at a specific threshold. This threshold effect may involve complex feedback regulatory loops^[147]46, or networks of protein-protein interactions on diseases^[148]47. Studying this threshold effect may reveal triggering mechanisms and progression patterns of diseases. Secondly, Dose-response curves help determine the optimal protein level to reduce disease risks, which can inform the appropriate drug dose in the subsequent clinical trials and support the development of protein-targeted drugs. Thirdly, known non-linear relationships can help to build more accurate predictive models of diseases. Researchers can enhance the predictive performance by introducing quadratic or interaction terms of known non-linear proteins into the model, instead of just assuming all relationships are linear^[149]48. In defining VTE cases, we did not consider multiple VTE events for a single patient, as the number of recurrent VTE events may be incompletely recorded in the UKB. Our primary focus was on genetic susceptibility to VTE, and once an individual has experienced a VTE event, indicating the underlying susceptibility, regardless of the number of occurrences. VTE can occur in various clinical settings, such as acute disease, trauma, pregnancy, exposure to drugs, surgery, cancer, inflammatory diseases and so on. To address this issue, we conducted a sensitivity analysis focused solely on unprovoked VTE cases driven by genetic factors. The results of the sensitivity analysis showed that our PWAS results were minimally affected by these clinical factors (Supplementary Note [150]2 and Fig. [151]S9). Considering anticoagulant use may also influence incident VTE, we performed sensitivity analyses by adjusting for it in the prospective cohort. The results showed this factor did not significantly influence incident VTE in our study (HR = 0.755, 95% CI: 0.357–1.592, P = 0.439), which may be attributed to the limited sample size using anticoagulants and the potential misclassification from baseline-only self-reported medication data in the UKB. There are also several limitations deserve discussion. First, our PWAS analysis pipeline was based on genetically predicted protein levels. Some important proteins may be eliminated in the validation step, due to the limited power of the protein prediction model or the limited sample size of the protein reference population. In this study, we excluded these biased predicted proteins to minimize the risk of misleading associations. Second, in our study, VTE cases were defined based on ICD-10 codes, which may introduce non-differential misclassification of VTE outcomes when the sensitivity and specificity of ICD-10 codes for VTE diagnosis were relatively low. Such misclassification could reduce the statistical power of our models and potentially lead to true protein associations being classified as non-significant. Third, the mechanisms of VTE vary by sex and age groups, with oestrogen playing a major role in young women and genetic factors being more prominent in young men. Future studies stratifying by sex and age groups may provide additional valuable insights into VTE. Finally, only eight out of 43 non-linear proteins identified in our study were replicated in the prospective cohort analysis, which may be attributed to the limited sample size of the prospective replication analysis. The non-linear trend of these proteins should be further validated in larger populations or through experimental approaches in future studies. Methods Ethics We obtained genotype, protein and phenotype data from the UKB and UKB-PPP, which received ethics approval from the North West Centre for Research Ethics Committee. The application number of UKB and UKB-PPP in our study was 88159. Quality control in UK Biobank The target population in this study is individuals of European ancestry in UKB. The UKB cohort is a population-based volunteer longitudinal cohort of ~500,000 individuals recruited across the United Kingdom. We used the imputed genotypes from UKB. Details of array, sample processing and quality control have been described elsewhere^[152]49,[153]50. Briefly, we extracted a European ancestry subset, including 408,812 samples who self-identified as white British and had very similar genetic ancestry based on the principal component analysis of the genotypes (Field ID 22006). The variants stratify the following criterion were excluded by using PLINK^[154]51: (1) minor allele frequency (MAF) < 0.001 (we set a looser criterion than 0.005 of the original prediction model^[155]21 here to ensure enough variants can be included in our cohort), (2) Hardy–Weinberg equilibrium test P value < 1.0 × 10^−12, (3) missing genotype rate >0.05, or (4) imputation accuracy score <0.3. We also randomly removed 30,338 individuals in related pairs with kinship coefficient ≥0.0884 (second degree or greater)^[156]52 using PLINK^[157]51. Therefore, 378,474 individuals of European ancestry and 15,456,785 variants were retained for the following study. VTE definition in UK biobank VTE phenotype in UKB (application number 88159) was derived from the first occurrence of any code mapped to 3-character ICD-10 (category ID 1712), which was generated by mapping ICD-10 code in the death register records, read code information in the primary care data, ICD-9 and ICD-10 codes in the hospital inpatient records (e.g., diagnosis or operation code), and self-reported medical conditions reported at the baseline or subsequent visit to UKB assessment center. VTE cases were defined as having at least one of the following ICD-10 diagnosis code: (1) ‘I26’ pulmonary embolism (Field ID 131308), (2) ‘I80’ thrombosis, phlebitis and thrombophlebitis (Field ID 131396), and (3) ‘I82’ other venous thromboembolism and thrombosis (Field ID 131400)^[158]53. The remaining individuals were included in the control group. Covariates included sex (Field ID 22001), array batch (Field ID 22000) and top ten ancestry principal components. Finally, we retained 378,474 Europeans, including 22,103 VTE cases and 356,371 controls. Protein levels prediction in UK Biobank In this study, we directly used the protein prediction models constructed by Xu et al.^[159]21, which were based on the INTERVAL study^[160]54. For aptamer-based SomaScan assay (v.3), the INTERVAL study profiled 2384 plasma proteins for 3175 participants of European ancestry^[161]26,[162]55. For the Olink assay, they measured 308 plasma protein abundance of around 4822 Europeans using four Olink panels, namely inflammation-1 (INF-1), cardiovascular II (CVD-2), cardiovascular III (CVD-3) and neurology (NEUR)^[163]56. Xu et al. constructed the prediction model for these Olink and SomaScan proteins using Bayesian ridge regression^[164]21. The existing prediction models have provided pQTLs and their weights for each protein. We matched pQTLs in UKB and predicted protein levels by calculating the inner product of pQTLs and the corresponding weights. Finally, we obtained 308 predicted Olink proteins and 2384 SomaScan proteins for 22,103 VTE cases and 356,371 controls. The flow chart of protein levels prediction was shown in the left panel of Fig. [165]S2. Validation and selection of predicted protein in UKB-PPP UKB-PPP is a proteomics profiling data in UKB, which collected blood plasma samples from 54,219 UKB participants using the antibody-based Olink Explore 3072 proximity extension assay, measuring 2,941 proteins (Field ID 30900)^[166]22. After excluding non-random baseline and individuals in batch 7 due to its mixed samples from both UKB sets and COVID-19 sets^[167]22, 40,502 samples were retained. We matched individuals and proteins between the predicted data in UKB and measured data in UKB-PPP based on their corresponding ID and protein-coding gene names, respectively. Then 31,013 samples, 295 Olink proteins and 944 SomaScan proteins were matched. The Pearson correlations of these matched proteins between predicted data and measured data were calculated to reflect the accuracy of our protein predictions. For measured protein levels, we further used their residuals after adjusting covariates sex (Field ID 22001), baseline age (Field ID 21022), array batch (Field ID 22000) and top ten ancestry principal components (Field ID 22009). Both predicted and measured protein levels were rank-based inverse normal transformed before calculating Pearson correlations^[168]57. Finally, only matched proteins with R^2 > 0.01 (i.e., Pearson correlation >0.1)^[169]21,[170]23 can pass the external validation and be included in the following PWAS analysis. The flow chart of predicted protein validation was shown in the middle panel of Fig. [171]S2. Linear PWAS for VTE We performed linear PWAS for the well-predicted proteins and the VTE risk in 378,474 Europeans from UKB using the logistic regression. The adjusted covariates only included sex (Field ID 22001), array batch (Field ID 22000) and top ten ancestry principal components (Field ID 22009), as further adjustment may induce collider bias in PWAS (adjusting downstream variables of the predicted protein can create new associations between the predicted protein and unknown environmental confounders)^[172]58. We excluded baseline age as covariates in PWAS, because the predicted protein levels represent an individual’s lifelong average levels. It is unreasonable to use baseline age to adjust the associations of lifelong average predicted protein levels with diseases. All predicted protein levels were rank-based inverse normal transformed before analysis^[173]57. Considering 108 overlapped proteins between predicted Olink and SomaScan proteins, there were 537 unique well-predicted proteins in our study. The unique proteins were defined by having different protein-coding gene names. The significance level of linear PWAS was set to be 0.05/537 = 9.3 × 10^−5 based on Bonferroni adjustment. Non-linear PWAS using RCS We used restricted cubic splines^[174]59 with four knots at the 5th, 35th, 65th, and 95th centiles to flexibly model the non-linear association between well-predicted proteins and VTE risk in 378,474 Europeans from UKB by R package “rms”. The covariates were the same as the above linear PWAS. All predicted protein levels were rank-based inverse normal transformed before analysis^[175]57. We tested the significance of protein levels by likelihood ratio test comparing the model with only covariates, against the model with covariates and protein levels. The significance level was set to be 9.3 × 10^−5 after Bonferroni adjustment. The non-linearity of protein levels was tested by the likelihood ratio test comparing RCS model, against the logistic model that considers protein levels as a linear term^[176]60,[177]61. Only the proteins that passed the overall significance test would be performed for the non-linearity test. So, the significance level of the non-linearity test was set to be 0.05. Finally, we plotted the dose-response curves of the odds ratio of VTE risk against standardized protein levels for each protein. Prospective cohort replication in UKB-PPP To replicate the 43 proteins obtained by non-linear PWAS, we conducted Cox regression using the prospective cohort data in UKB-PPP. We selected 31,013 matched samples and 43 corresponding measured proteins in UKB-PPP. We excluded individuals with VTE before baseline (n = 990) and individuals with missing covariates (n = 1705). Then 28,318 samples, 768 of whom had emerging VTE events during the follow-up period, were retained in the cohort analysis (Table [178]S1). Other 27,550 samples with censored events include: (1) 25,601 samples did not suffer from VTE until the end of follow-up, (2) 1882 samples were lost to follow-up, and (3) 67 samples died before suffering from VTE. We used Cox regression to replicate the linear association between measured protein level and risk of VTE, and restricted cubic splines to replicate their non-linear associations. The adjusted covariates included sex (Field ID 22001), baseline age (Field ID 21022), BMI (Field ID 21001), Townsend deprivation index (Field ID 22189), smoke status (Field ID 20116), drink status (Field ID 20117), physical status (Field ID 6164), systolic blood pressure (Field ID 4080), Oily fish intake (Field ID 1329), processed meat intake (Field ID 1349) and top ten ancestry principal components (Field ID 22009). Details on the selection of covariates can be found in Supplementary Note [179]1. All measured protein levels were rank-based inverse normal transformed before analysis^[180]57. The significance level of proteins was set to be 0.05/43 = 1.1 × 10^−3 after Bonferroni adjustment. Considering the sample size of the prospective cohort is much smaller than that of predicted data in UKB, we also listed the proteins that can be replicated by the nominal significance level (P = 0.05). The significance level of the non-linearity test was 0.05 as the non-linear PWAS. The flow chart of cohort replication was shown in the right panel of Fig. [181]S2. Pathway enrichment analysis We performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis to explore the biological pathways on 64 proteins with linear trends and 43 proteins with non-linear trends, respectively, using the R package “clusterProfiler”^[182]62. The background gene of enrichment analysis was set as the list of all 537 unique well-predicted proteins. The P values were adjusted by false discovery rate (FDR) using the Benjamini-Hochberg method. Reporting summary Further information on research design is available in the [183]Nature Portfolio Reporting Summary linked to this article. Supplementary information [184]Supplementary information^ (8.7MB, pdf) [185]41467_2025_61874_MOESM2_ESM.docx^ (14.8KB, docx) Description of Additional Supplementary Files [186]Supplementary Data 1-3^ (67.9KB, xlsx) [187]Reporting Summary^ (91.2KB, pdf) [188]Transparent Peer Review file^ (4.1MB, pdf) Source data [189]Source Data^ (176.3KB, xlsx) Acknowledgements