Abstract Background and aims Primary biliary cholangitis (PBC) is a progressive chronic autoimmune cholestatic liver disease characterized by the destruction of small intrahepatic bile ducts leading to biliary cirrhosis. Liver biopsy is required in the diagnosis of Antimitochondrial antibody-negative patients. Therefore, novel biomarkers are needed for the non-invasive diagnosis of PBC. To identify novel biomarkers for PBC, we conducted large-scale plasma proteome Mendelian randomization (MR). Methods A total of 21,593 protein quantitative trait loci (pQTLs) for 2297 circulating proteins were used and classified into four different groups. MR analyses were conducted in the four groups separately. Furthermore, the results were discovered and replicated in two different cohorts of PBC. Colocalization analysis and enrichment analysis were also conducted. Results Three plasma proteins (ficolin-1, CD40 and protein FAM177A1) were identified and replicated as being associated with PBC. All of them showed significant protective effects against PBC. An increase in ficolin-1 (OR=0.890 [0.843-0.941], p=3.50×10^-5), CD40 (OR=0.814 [0.741-0.895], p=1.96×10^-5) and protein FAM177A1 (OR=0.822 [0.754-0.897], p=9.75×10^-6) reduced the incidence of PBC. Ficolin-1 (PP4 = 0.994) and protein FAM177A1 (PP4 = 0.995) colocalized with the expression of the genes FCN1 and FAM177A1 in whole blood, respectively. Furthermore, CD40 (PP4 = 0.977) and protein FAM177A1 (PP4 = 0.897) strongly colocalized with PBC. Conclusions We expand the current biomarkers for PBC. In total, three (ficolin-1, CD40, and protein FAM177A1) plasma proteins were identified and replicated as being associated with PBC in MR analysis. All of them showed significant protective effects against PBC. These proteins can be potential biomarkers or drug targets for PBC. Keywords: primary biliary cholangitis, plasma protein, ficolin-1 (FCN1), CD40, FAM177A1, biomarker Introduction Primary biliary cholangitis (PBC) is a progressive chronic autoimmune cholestatic liver disease characterized by the destruction of small intrahepatic bile ducts leading to biliary cirrhosis ([30]1). A systematic review of epidemiological studies suggested that the PBC incidence ranges from 0.3 to 5.8 per 1000 people and that prevalence rates are increasing over time ([31]2). Although the specific etiology of PBC remains uncertain, several triggers have been identified by previous studies. The immunogenetic risk and epigenetic regulation of the epithelium and bile acid play important roles in the etiology of PBC ([32]3). Immune-mediated biliary injury and the consequences of chronic cholestasis are the major pathogenic features of PBC. The antimitochondrial antibody (AMA) and alkaline phosphatase (ALP) are the main serological biomarkers for the diagnosis of PBC. Although AMA is crucial for the diagnosis of PBC, approximately 3%-5% of patients are AMA negative ([33]4). Therefore, novel biomarkers are required to diagnose AMA-negative PBC and serve as auxiliary biomarkers for AMA-positive PBC. Recently published studies have identified several novel biomarkers for PBC. Bombaci et al. tested 1658 human plasma proteins and found that SPATA31A3 and GARP showed high reactivity in PBC sera ([34]4). A multivariate analysis showed that an elevated level of immunoglobulin M contributes to the diagnosis for patients with seropositive AMA but normal ALP ([35]5). Two retrospective studies conducted by Hayashi suggested that high serum levels of cytokeratin-16 fragment M30, growth arrest-specific gene 6 protein and Axl were associated with the cirrhosis condition of PBC patients ([36]6, [37]7). Anti-Sp100 and anti-gp210 were identified to be related to PBC ([38]1). Although several serum biomarkers were found, more molecules have not been tested. Plasma proteins play key roles in a series of biological processes, including signaling, transportation, and inflammation ([39]8). Plasma proteins can originate from any organ, cell, or even from the mother through the placenta ([40]9). Therefore, they could serve as an important source of biomarkers ([41]8). Recently, several genome-wide association studies (GWASs) of plasma proteins have identified protein quantitative trait loci (pQTLs) for thousands of plasma proteins ([42]10–[43]18). A pQTL is an association of protein levels at a genetic locus and is represented by the strongest associating single-nucleotide polymorphism (SNP) ([44]8). Plasma pQTLs represent the circulating levels of plasma proteins. They provide an opportunity for us to test the causal effects of plasma proteins on PBC. To evaluate the causal effects of plasma proteins on PBC and to determine potential biomarkers (risk and protective proteins against PBC), we carried out a large-scale plasma proteome Mendelian randomization (MR) using plasma pQTLs as instrumental variables. MR is a powerful method to detect the causal effect of exposure (plasma proteins) on the outcome (PBC) using genetic variants extracted from GWAS summary statistics as instrumental variables. Two-sample MR can calculate the causal effect of exposure on outcome using genetic variants that are only associated with exposure and affect outcome through exposure only. Compared to conventional randomized controlled trials, MR is more appropriate to detect a long-term causal effect of risk/protective factors on the outcome due to the random assortment and lifelong effect of genetic variants. Compared to observational studies, MR could avoid environmental confounders and reverse causality because genetic variants used in MR cannot be easily modified by the environment ([45]19). Furthermore, the high efficiency and low cost make MR more suitable for large-scale screening for causal relationships. As a result, we conducted two-sample MR analyses for the plasma proteome using pQTLs extracted from nine different GWASs ([46]10–[47]18). In this study, to eliminate bias, MR analyses were conducted using different types of pQTLs. Furthermore, to reduce chance findings, all proteins were discovered and validated in two different cohorts of PBC. Only proteins identified and validated in two different cohorts were included in our results. Although MR is a powerful tool for detecting causal effects, the results can be confounded by linkage disequilibrium (LD). When exposure and outcome were affected by two different genetic variants that are in LD with each other, we obtained false positive results. Therefore, to eliminate potential LDs, colocalization analyses between proteins and PBC were conducted. Colocalization can determine whether two traits share causal variants in a single region. If the colocalization results suggest strong evidence that exposure and outcome have distinct causal variants in a single region, the MR result is invalid and is removed from the results. Furthermore, to test the source of plasma protein, colocalization analyses between expression quantitative trait loci (eQTLs) and pQTLs were conducted. The proteins identified by MR and colocalized with PBC are more likely to be drug targets ([48]20). Finally, pathway enrichment analysis was conducted to determine the pathways involved in the pathogenesis of PBC. The enriched pathways imply the molecular basis of the causal effects of the plasma proteins on PBC. This analysis aims to evaluate the causal effect of plasma proteins on PBC and to identify potential biomarkers for PBC. Method As described in the previous section, a large-scale plasma proteome MR analysis was carried out. The process is shown in [49]Figure 1 . Figure 1. [50]Figure 1 [51]Open in a new tab The flow chart shows the analysis process. First, we pooled pQTLs from nine different studies together and removed SNPs violating the MR assumptions. Second, we divided pQTLs into four groups and conducted MR analyses separately. Since sentinel cis-pQTL is the most significant SNP in a region, group A was not LD clumped. Four proteins were identified and validated in the two cohorts. The colocalization analysis suggested that beta-mannosidase and PBC have different causal signals in a single region (PP3 = 1). Therefore, it violated the assumptions of MR and was excluded from the result. Enrichment analysis was conducted on the other three proteins. Data source We extracted summary statistics of pQTLs for plasma proteins from nine different proteomic GWASs and pooled them together using METAL. ([52]10–[53]18, [54]21) In total, 51,799 individuals were included in our analysis. All of the participating individuals are of European ancestry. There is no overlap among the nine GWASs. Details of these studies are provided in ST1. The summary statistics of PBC were extracted from two different cohorts: a discovery cohort and a replication cohort. Only the proteins that were significant in the discovery cohort and replicated in the replication cohort were considered to be associated with PBC. The discovery summary statistics were obtained from the research of Cordell et al. ([55]22) A total of 8021 European ancestry cases and 16,489 European ancestry controls participated in this GWAS. We extracted summary statistics from the FinnGen cohort ([56]https://r6.finngen.fi/pheno/CHIRBIL_PRIM ) as replication. This included 346 cases and 207,748 controls of European ancestry. Although the participants of the FinnGen cohort were from nine different cohorts ([57]https://finngen.gitbook.io/documentation/methods/cohort-descriptio n ), the proportion of cases is relatively small. Since only the participants of the FinnGen cohort did not overlap with the discovery cohort, the FinnGen cohort was the only choice for replication. All of the GWAS summary statistics adopted in this study are publicly available and freely downloadable. Ethics approval was obtained by the original analysis. Instrumental variable selection pQTLs chosen for the MR analysis must meet the three assumptions of the IV (1): the IV is associated with the risk factor (2); the IV is not associated with confounders; and (3) the IV influences outcome only through the risk factor ([58]23). To ensure assumption one, only the genome-wide significant (p<5×10^-8) pQTLs were selected as IVs. Moreover, IVs with an F statistic of less than 10 were regarded as weak IVs and were excluded from this study. As described in assumptions two and three, pQTLs from the MHC region (chr6:27477797-34448354 hg19), palindromic SNPs, and pleiotropic SNPs associated with more than 5 proteins were excluded from this study. To further avoid pleiotropy, MR-PRESSO tests were conducted to identify and remove SNPs with pleiotropy ([59]24). Since the coding variants may affect the assessment of proteins, we removed them from this study ([60]8). pQTLs from Hillary, Suhre, and Sun did not provide a predicted consequence. Therefore, we looked up their consequences using Variant Effect Predictor ([61]25). To further avoid bias, we divided the pQTLs into four groups: sentinel cis-pQTLs only, sentinel cis-pQTLs combined with independent cis-pQTLs, trans-pQTLs only, and total pQTLs. Then, we named them groups A, B, C, and D, respectively. The pQTL with the lowest p value in a region was selected as the sentinel pQTL. Independent pQTLs were identified by conditional analysis using COJO ([62]26). Due to the lack of pleiotropy and the direct relationship with exposure, the result of sentinel cis-pQTLs only (group A) was preferred in our analysis, and group B was the second choice. Although trans-acting pQTLs (group C) may be pleiotropic, we removed pleiotropic pQTLs. The results of group C could provide a way of understanding the potential etiology of PBC. The results of all pQTLs (group D) could reflect the total causal effect of exposures on outcome. Except for sentinel cis-pQTLs, the other three groups were LD clumped (r2<0.1) because the sentinel cis-pQTL is the most significant SNP in a region. Mendelian randomization MR analyses were carried out in each group. The Wald ratio was adopted in single IV MR. The result of the inverse-variance weighted regression model (IVW) was adopted as the main result. If heterogeneity was detected, the multiplicative effects of IVW were chosen for the result; otherwise, fixed effects of IVW were preferred. In addition, Egger’s regression and the weighted median were also conducted as references.