Abstract Pancreatic Ductal Adenocarcinoma (PDAC) is the third leading cause of cancer-related deaths in the U.S. Both rare and common germline variants contribute to PDAC risk. Here, we fine-map and functionally characterize a common PDAC risk signal at chr1p36.33 (tagged by rs13303010) identified through a genome wide association study (GWAS). One of the fine-mapped SNPs, rs13303160 (OR = 1.23 (95% CI 1.15-1.32), P-value = 2.74×10^−9, LD r^2 = 0.93 with rs13303010 in 1000 G EUR samples) demonstrated allele-preferential gene regulatory activity in vitro and binding of JunB and JunD in vitro and in vivo. Expression Quantitative Trait Locus (eQTL) analysis identified KLHL17 as a likely target gene underlying the signal. Proteomic analysis identified KLHL17 as a member of the Cullin-E3 ubiquitin ligase complex with vimentin and nestin as candidate substrates for degradation in PDAC-derived cells. In silico differential gene expression analysis of high and low KLHL17 expressing GTEx pancreas samples suggested an association between lower KLHL17 levels (risk associated) and pro-inflammatory pathways. We hypothesize that KLHL17 may mitigate cell injury and inflammation by recruiting nestin and vimentin for ubiquitination and degradation thereby influencing PDAC risk. Subject terms: Cancer genetics, Gene regulation, Genetic variation __________________________________________________________________ Allele-preferential transcription factor binding can influence pancreatic ductal adenocarcinoma risk loci function. Here, the authors show allele-specific JunB and JunD binding at chr1p36.33 and propose a role for KLHL17 in protein homeostasis by mitigating inflammation. Introduction Pancreatic cancer is currently the third leading cause of cancer-related deaths and is expected to move to second place by 2030 in the United States^[57]1. Pancreatic ductal adenocarcinoma (PDAC) comprises over 90% of pancreatic cancer cases. While its survival rate has improved over the years, detection, prevention, and treatment of PDAC remains a challenge^[58]2. Epidemiological factors known to increase the risk of PDAC include Type 2 diabetes (T2D), pancreatitis, smoking, and obesity^[59]3. Additionally, both rare high-risk and common, low effect size germline variants are known to contribute to PDAC susceptibility^[60]4–[61]8. The Pancreatic Cancer Cohort Consortium and Pancreatic Cancer Case-Control Consortium have sought to identify common germline variants that influence risk of PDAC through genome-wide association studies (GWAS). Previous GWAS phases, PanScan I^[62]5, II^[63]6, and III^[64]7,[65]8 and PanC4^[66]9, have identified 17 independent risk signals for PDAC. A 2018 meta-analysis of these four studies (9040 cases and 12,496 controls) and TaqMan replication using samples from the PANcreatic Disease ReseArch (PANDoRA) consortium (2,497 cases and 4611 controls) uncovered five new risk loci^[67]4. One of the newly identified loci was at chr1p36.33. The initial meta-analysis of PanScan I, II, III and PanC4 identified chr1p36.33 (tagged by single nucleotide polymorphism (SNP) rs13303010; P-value = 7.3 × 10^−7; odds ratio (OR) = 1.20; 95% confidence interval (CI) = 1.12–1.29) as a suggestive risk locus for PDAC (Table [68]1). A meta-analysis including samples from the PANDoRA consortium (11,537 cases and 17,107 controls) improved the signal beyond the GWAS significance threshold (rs13303010, OR = 1.26, 95% CI 1.19–1.35, P-value = 8.36 × 10^−14) (Table [69]1)^[70]4, further supporting chr1p36.33 as a PDAC risk locus. Table 1. Summary statistics for rs13303010 on 1p36.33 PDAC risk locus ^aSNP, ^balleles, ^cchr, ^dlocation, and ^egene Study MAF Subjects P-value Allelic OR (95% CI) Controls Cases Controls Cases rs13303010 (G, A) 2018 Meta Analysis^4 12,496 9040 7.3 × 10^−7 1.20 (1.12–1.29) 1p36.33 (894,573) PANDoRA^4 0.10 0.14 4611 2497 6.0 × 10^−10 1.45 (1.33–1.57) NOC2L Meta Analysis + PANDoRA^4 17,107 11,537 8.36 × 10^−14 1.26 (1.19–1.35) UK Biobank 0.09 0.12 9399 1066 1.05 × 10^−5 1.42 (1.21–1.65) Meta Analysis^4 + UK Biobank 21,895 10,106 2.09 × 10^−10 1.24 (1.16–1.32) [71]Open in a new tab Summary statistics from previously published PDAC GWAS and for the new meta-analysis presented here including UK Biobank PDAC cases and controls for rs13303010 including minor allele frequencies in cases and controls, P-value, and odds ratio (OR) with the 95% confidence intervals (CI). Statistics are generated using a logistic regression model adjusted for covariates outlined in the Methods. P-values are not multiple-testing corrected. Source data are provided as a Source Data file. ^arsID; SNP ID number. ^bMinor, major alleles. ^cLocation. ^dPosition (Genome build hg19). ^eNearest gene. GWAS have been instrumental in estimating disease risk, identifying candidate genes, and uncovering novel pathways underlying disease development. However, functional characterization of GWAS loci is critical to pinpoint the biological mechanism underlying risk. Most GWAS signals map to non-coding, regulatory regions of the genome and are hypothesized to influence disease risk through allele-specific changes in gene expression^[72]10. Further, each locus is decorated with tens to hundreds of highly correlated variants complicating the identification of functional variant(s) and gene(s) underlying the risk signal. Statistical fine mapping and genomic assays (e.g., ATAC-seq and massively parallel reporter assays) have been beneficial for reducing the number of candidate functional variants to move forward for testing^[73]11–[74]13. Chromatin capture and expression quantitative trait locus (eQTL) analysis are valuable for identifying putative functional genes^[75]14,[76]15. While these methods greatly assist in the process of functionally characterizing GWAS risk loci, it is still a time-consuming process. In fact, the biological mechanisms underlying only a handful of the 22 published PDAC risk signals have been functionally characterized to date: chr5p15.33/TERT^[77]16, chr16q23.1/CTRB2^[78]17, and chr13q22.1/DIS3^[79]18. Interestingly, some overlap has been observed between GWAS loci for PDAC and associated epidemiological risk factors. A number of PDAC risk loci have common SNPs or colocalize with GWAS for traits that influence PDAC risk: chr1q32.1/NR5A2/T2D, chr2p13.3/ETAA1/T2D/waist-hip-ratio (obesity measure), chr8q24.21/MIR1208/T2D, chr9q34/ABO/T2D/body fat percentage, chr12q24.31/HNF1A/Maturity-onset Diabetes of the Young/T2D, chr16q23.1/BCAR1/T2D, chr18q21.32/GRP/T2D/BMI/obesity ([80]https://mvp-ukbb.finngen.fi/ and NHGRI GWAS Catalog^[81]19). Further pathway enrichment analysis of genes ±100 kb of PDAC GWAS risk loci indicate an enrichment of genes associated with Maturity-onset diabetes of the young (KEGG), Sequence-specific DNA-binding transcription factor activity (GO Molecular Function), and Cellular response to UV (GO Biological Process)^[82]4. Further, DEPICT enrichment analysis indicated that genes associated with GWAS risk loci are highly expressed in numerous gastrointestinal tissues^[83]4. Here, we apply fine-mapping methods to identify candidate functional variants for in vitro testing. We subsequently identify rs13303160 as a functional variant at chr1p36.33 likely mediating the expression of KLHL17 through allele-preferential binding of JunB and JunD transcription factors. Our work to characterize KLHL17’s function in PDAC risk suggests a role in mitigating acinar to ductal metaplasia (ADM) and epithelial to mesenchymal transition (EMT) through protein homeostasis. Results Fine-mapping of the chr1p36.33 PDAC risk locus To identify candidate functional variants at the chr1p36.33 risk locus, we first performed a meta-analysis using GWAS data from PanScan I-III^[84]5–[85]8, PanC4^[86]9 and an additional 1066 PDAC cases and 9399 controls from the UK Biobank (UKBB)^[87]20 resulting in 10,106 cases and 21,895 control subjects with imputed GWAS data. In this analysis, rs13303010 remained the most statistically significant SNP at chr1p36.33 (OR = 1.24, 95% CI 1.16–1.32, P-value = 2.09 × 10^−10) (Fig. [88]1, Table [89]1). Fig. 1. Overview of chr1p36.33 PDAC risk locus. [90]Fig. 1 [91]Open in a new tab a Locus Zoom plot of the variants identified in the meta-analysis, colors are indicative of the linkage disequilibrium (LD) r^2 in reference to the tag SNP (red 0.8 < r^2 ≤ 1.0; yellow 0.6 < r^2 < 0.8, green 0.4 < r^2 < 0.6, blue 0.2 < r^2 < 0.4, purple 0 < r^2 < 0.2), -log10 P-values are calculated using a logistic regression analysis and are not multiple-testing corrected. The gray-dashed line indicates the Bonferroni-corrected P-value significance threshold (5×10^-8) used in GWAS; b UCSC genome browser view of chr1p36.33 with ChromHMM and ATAC-seq annotations in PDAC and normal-derived duct epithelial cell lines^[92]24; c Zoomed in UCSC browser snapshot showing the candidate functional variants and nearby genes. ChromHMM states are indicated by color: weak enhancer 1 (light green), weak enhancer 2 (dark blue), active enhancer (light blue), active element (purple), active transcriptional start site (TSS, red), bivalent TSS (yellow), polycomb repressed (light purple), quiescent (dark green). Source data are provided as a Source Data file. To identify credible causal variants (CCVs) underlying this association signal, we applied fine-mapping approaches to the summary statistics for this meta-analysis. We implemented the Bayesian approach Sum of Single Effects Linear Regression (SuSiE)^[93]13 to identify credible sets of CCV with 90% confidence. SuSiE identified one credible set with five variants (Table [94]2). We also applied likelihood ratio (LLR < 1:100) and linkage disequilibrium (LD r^2 > 0.8) thresholds which identified two additional CCVs (Table [95]2). To be inclusive, we moved all seven variants forward for in vitro functional analysis. Table 2. Summary statistics and fine-mapping for the CCVs in UK Biobank meta-analysis ^aSNP, ^balleles, ^clocation OR (95% CI) P-value LD r2 LLR SuSiE PIP Chromatin accessibility rs13303010 (G, A) 1:894,573 1.24 (1.16-1.32) 2.09 × 10^−10 -- 1.00 0.27 Yes rs3935066 (G, A) 1:900,730 1.25 (1.17-1.35) 2.74 × 10^−10 0.89 1.30 0.22 No (>500 bp) rs113491766 (A, AG) 1:895,755 0.81 (0.76-0.86) 2.85 × 10^−10 1 1.35 0.20 ~56 bp away rs13303327 (G, A) 1:895,706 1.24 (1.16-1.32) 3.38 × 10^−10 1 1.60 0.18 ~110 bp away rs10465241 (C, T) 1:886,817 1.24 (1.16-1.32) 1.03 × 10^−9 0.90 4.74 0.06 ~22 bp rs10465242 (G, A) 1:886,788 1.23 (1.1.5-1.32) 2.23 × 10^−9 0.90 10.07 0.03 ~52 bp rs13303160 (G, A) 1:901,559 1.23 (1.15-1.32) 2.74 × 10^−9 0.93 12.31 0.02 Yes [96]Open in a new tab Summary statistics of the CCVs for the newest meta-analysis with PDAC UK Biobank cases and controls including OR and 95% CI, P-value, linkage disequilibrium (LD) r^2 reported relative to the tag SNP based on the European 1000Genomes reference, likelihood ratio (LLR), SuSiE posterior inclusion probability (PIP), and proximity to accessible chromatin. GWAS statistics are generated using a logistic regression model adjusted for covariates outlined in the Methods. P-values are not multiple-testing corrected. LLR was calculated using the GWAS P-values and a chi-squared test. Top five SNPs were identified in the SuSiE credible set of variants. All variants listed met the fine-mapping criteria for functional follow-up. Accessible chromatin is based on the ATAC-seq data generated in PDAC-derived or normal-derived pancreas epithelial cell lines^[97]24. Source data are provided as a Source Data file. ^arsID. ^bMinor, major alleles. ^cchr:position (Genome build hg19). Most GWAS variants are noncoding and thought to affect gene expression of target genes in an allele specific manner. As such, GWAS variants have been shown to be enriched in active gene regulatory elements indicated by posttranslational modifications of histones (H3K4me3, H3K4me1, H3K27ac) and accessible chromatin^[98]11,[99]21–[100]23. We examined the set of statistically fine-mapped variants in the context of maps of pancreas gene regulatory elements (using a ChromHMM 8-state model and ATAC-seq) we generated in PDAC and normal-derived pancreas cell lines^[101]24. All the fine-mapped variants at chr1p36.33 lie within active and bivalent transcriptional start sites and active enhancer elements (Fig. [102]1b, c). Two variants lie within regions of open chromatin (Fig. [103]1c, Table [104]2) lending further support for the fine-mapped variants as candidate functional SNPs influencing gene expression in cis or trans. Assessing allele-preferential binding and gene regulatory activity of candidate functional variants To identify functional variants underlying this GWAS signal, we first sought to identify variants that exhibited allele-preferential transcription factor (TF) binding. We tested the set of seven fine-mapped variants at chr1p36.33 in electrophoretic mobility shift assays (EMSAs) using nuclear extracts from the PANC-1 pancreatic cancer cell line. Three of the seven variants demonstrated allele-preferential protein binding: rs13303010, rs13303327, and rs13303160 (Fig. [105]2). The GWAS tag SNP, rs13303010, showed preferential binding to the protective alternate allele (A) (Fig. [106]2a). Additionally, we observed consistent allele-preferential binding with the protective alternate allele (A) at rs13303327 (Fig. [107]2b). The third variant, rs13303160, exhibited some preferential binding to the risk-increasing reference (G) allele (Fig. [108]2c). These preferential binding patterns were also observed with MIA PaCa-2 (Supplementary Fig. [109]1a–c) and HeLa nuclear lysate (Supplementary Fig. [110]1d, e). Fig. 2. Identification of allele-preferential binding and activity using EMSA and luciferase reporter assays. [111]Fig. 2 [112]Open in a new tab a–c Representative EMSA results with PANC-1 nuclear extract and fluorescently labeled 31 bp oligonucleotides with each variant, rs13303010 (n = 4 independent experiments), rs13303327 (n = 3 independent experiments), rs13303160 (n = 4 independent experiments), respectively, centered in the middle. Competitor is the same sequence with no fluorescent label in excess (50, 100X, indicated by the black triangles). Arrows denote the allele-preferential binding. d–f Luciferase reporter assays using rs13303010, rs13303327, and rs13330160, respectively and the surrounding sequence as a promoter to the luciferase gene in three cell lines, number of biological replicates are indicated below the cell line name. g Luciferase assay for rs13303160 and surrounding sequence as an enhancer upstream of a minimal promoter and luciferase gene in three cell lines. For EMSA and luciferase, the risk allele is colored in red. For luciferase, the forward (fwd) and reverse (rev) orientation of the sequence was used. Pink bars indicate the risk allele and blue bars the protective allele. Gray bars are the Empty Vector (EV) control. The number of biological replicates is indicated underneath the cell line in the figure for (d–g). Error bars represent the standard error of the mean (SEM) and significance was determined by an unpaired, two-tailed t-test. Source data are provided as a Source Data file. We next moved the three variants with allele-preferential binding forward to evaluate allele-preferential gene regulatory activity using luciferase reporter activity assays. As these variants lie near active transcriptional start sites (TSS) (Fig. [113]1c), we first tested allele-preferential promoter activity by placing 141-201 base pair (bp) sequences centered on the variant of interest (see Methods) into a luciferase vector without a minimal promoter (in the pGL4.14 vector). We observed strong promoter activity for the rs13303010 constructs compared to the empty vector (EV) but minimal allele-preferential luciferase activity in MIA PaCa-2, PANC-1 and HEK293T cells (Fig. [114]2d). The second SNP, rs13303327, demonstrated allele-preferential luciferase activity with the alternate (A) allele having a stronger regulatory effect (Fig. [115]2e). The third SNP, rs13303160, exhibited strong promoter activity with the alternate (A) allele demonstrating an allele-preferential effect in the reverse orientation in the MIA PaCa-2 and HEK293T cells (Fig. [116]2f). While rs13303010 and rs13303327 lie in a 1328 bp region between the TSS of the NOC2L and KLHL17 genes, rs13303160 is located 360 bp downstream of the 3’UTR of KLHL17 and 303 bp upstream of the TSS for PLEKHN1 (Fig. [117]1c), suggesting it could influence promoter and/or enhancer activity at this locus. We therefore tested the sequence surrounding rs13303160 as an enhancer upstream of a minimal promoter and the luciferase gene (in the pGL4.23 vector). As an enhancer element, the rs13303160 constructs demonstrated strong enhancer activity in all three cell lines with the alternate (A) allele exhibiting stronger activity (Fold change (FC) 1.5–2.8, P-value = 1 × 10^−6–0.02; Fig. [118]2g). Thus, through EMSA and luciferase assays, we narrowed the set of seven fine-mapped candidate functional variants down to two (rs13303327 and rs13303160) that each demonstrate allele-preferential protein binding and gene regulatory activity. Identifying allele-preferential protein binding To identify TFs potentially mediating the allele-preferential regulatory activity we observed, we performed an in silico TF motif analysis using PERFECTOS-APE^[119]25. This analysis predicts TF binding potential for both alleles of a variant and provides a P-value for the estimated strength of binding. We then calculated the fold-change between P-values as a proxy for the binding affinity change between alleles. For rs13303327, we identified several E74-like factors (ELF), members of the E-twenty-six (ETS) family of transcription factors, having a 13-24-fold difference in binding P-values between the A and G alleles (Table [120]3, Supplementary Table [121]1). The ELF transcription factors recognize the motif GGAA (Fig. [122]3a) which is disrupted by the risk allele-G at rs13303327 by replacing the last A with a G. Table 3. Predicted TFs with allelic binding preferences for rs13303327 and