Abstract

   Pancreatic Ductal Adenocarcinoma (PDAC) is the third leading cause of
   cancer-related deaths in the U.S. Both rare and common germline
   variants contribute to PDAC risk. Here, we fine-map and functionally
   characterize a common PDAC risk signal at chr1p36.33 (tagged by
   rs13303010) identified through a genome wide association study (GWAS).
   One of the fine-mapped SNPs, rs13303160 (OR = 1.23 (95% CI 1.15-1.32),
   P-value = 2.74×10^−9, LD r^2 = 0.93 with rs13303010 in 1000 G EUR
   samples) demonstrated allele-preferential gene regulatory activity in
   vitro and binding of JunB and JunD in vitro and in vivo. Expression
   Quantitative Trait Locus (eQTL) analysis identified KLHL17 as a likely
   target gene underlying the signal. Proteomic analysis identified KLHL17
   as a member of the Cullin-E3 ubiquitin ligase complex with vimentin and
   nestin as candidate substrates for degradation in PDAC-derived cells.
   In silico differential gene expression analysis of high and low KLHL17
   expressing GTEx pancreas samples suggested an association between lower
   KLHL17 levels (risk associated) and pro-inflammatory pathways. We
   hypothesize that KLHL17 may mitigate cell injury and inflammation by
   recruiting nestin and vimentin for ubiquitination and degradation
   thereby influencing PDAC risk.

   Subject terms: Cancer genetics, Gene regulation, Genetic variation
     __________________________________________________________________

   Allele-preferential transcription factor binding can influence
   pancreatic ductal adenocarcinoma risk loci function. Here, the authors
   show allele-specific JunB and JunD binding at chr1p36.33 and propose a
   role for KLHL17 in protein homeostasis by mitigating inflammation.

Introduction

   Pancreatic cancer is currently the third leading cause of
   cancer-related deaths and is expected to move to second place by 2030
   in the United States^[57]1. Pancreatic ductal adenocarcinoma (PDAC)
   comprises over 90% of pancreatic cancer cases. While its survival rate
   has improved over the years, detection, prevention, and treatment of
   PDAC remains a challenge^[58]2. Epidemiological factors known to
   increase the risk of PDAC include Type 2 diabetes (T2D), pancreatitis,
   smoking, and obesity^[59]3. Additionally, both rare high-risk and
   common, low effect size germline variants are known to contribute to
   PDAC susceptibility^[60]4–[61]8.

   The Pancreatic Cancer Cohort Consortium and Pancreatic Cancer
   Case-Control Consortium have sought to identify common germline
   variants that influence risk of PDAC through genome-wide association
   studies (GWAS). Previous GWAS phases, PanScan I^[62]5, II^[63]6, and
   III^[64]7,[65]8 and PanC4^[66]9, have identified 17 independent risk
   signals for PDAC. A 2018 meta-analysis of these four studies (9040
   cases and 12,496 controls) and TaqMan replication using samples from
   the PANcreatic Disease ReseArch (PANDoRA) consortium (2,497 cases and
   4611 controls) uncovered five new risk loci^[67]4. One of the newly
   identified loci was at chr1p36.33. The initial meta-analysis of PanScan
   I, II, III and PanC4 identified chr1p36.33 (tagged by single nucleotide
   polymorphism (SNP) rs13303010; P-value = 7.3 × 10^−7; odds ratio
   (OR) = 1.20; 95% confidence interval (CI) = 1.12–1.29) as a suggestive
   risk locus for PDAC (Table [68]1). A meta-analysis including samples
   from the PANDoRA consortium (11,537 cases and 17,107 controls) improved
   the signal beyond the GWAS significance threshold (rs13303010,
   OR = 1.26, 95% CI 1.19–1.35, P-value = 8.36 × 10^−14)
   (Table [69]1)^[70]4, further supporting chr1p36.33 as a PDAC risk
   locus.

Table 1.

   Summary statistics for rs13303010 on 1p36.33 PDAC risk locus
   ^aSNP, ^balleles, ^cchr, ^dlocation, and ^egene Study MAF Subjects
   P-value Allelic OR (95% CI)
   Controls Cases Controls Cases
   rs13303010 (G, A) 2018 Meta Analysis^4 12,496 9040 7.3 × 10^−7 1.20
   (1.12–1.29)
   1p36.33 (894,573) PANDoRA^4 0.10 0.14 4611 2497 6.0 × 10^−10 1.45
   (1.33–1.57)
   NOC2L Meta Analysis + PANDoRA^4 17,107 11,537 8.36 × 10^−14 1.26
   (1.19–1.35)
   UK Biobank 0.09 0.12 9399 1066 1.05 × 10^−5 1.42 (1.21–1.65)
   Meta Analysis^4 + UK Biobank 21,895 10,106 2.09 × 10^−10 1.24
   (1.16–1.32)
   [71]Open in a new tab

   Summary statistics from previously published PDAC GWAS and for the new
   meta-analysis presented here including UK Biobank PDAC cases and
   controls for rs13303010 including minor allele frequencies in cases and
   controls, P-value, and odds ratio (OR) with the 95% confidence
   intervals (CI). Statistics are generated using a logistic regression
   model adjusted for covariates outlined in the Methods. P-values are not
   multiple-testing corrected. Source data are provided as a Source Data
   file.

   ^arsID; SNP ID number.

   ^bMinor, major alleles.

   ^cLocation.

   ^dPosition (Genome build hg19).

   ^eNearest gene.

   GWAS have been instrumental in estimating disease risk, identifying
   candidate genes, and uncovering novel pathways underlying disease
   development. However, functional characterization of GWAS loci is
   critical to pinpoint the biological mechanism underlying risk. Most
   GWAS signals map to non-coding, regulatory regions of the genome and
   are hypothesized to influence disease risk through allele-specific
   changes in gene expression^[72]10. Further, each locus is decorated
   with tens to hundreds of highly correlated variants complicating the
   identification of functional variant(s) and gene(s) underlying the risk
   signal. Statistical fine mapping and genomic assays (e.g., ATAC-seq and
   massively parallel reporter assays) have been beneficial for reducing
   the number of candidate functional variants to move forward for
   testing^[73]11–[74]13. Chromatin capture and expression quantitative
   trait locus (eQTL) analysis are valuable for identifying putative
   functional genes^[75]14,[76]15. While these methods greatly assist in
   the process of functionally characterizing GWAS risk loci, it is still
   a time-consuming process. In fact, the biological mechanisms underlying
   only a handful of the 22 published PDAC risk signals have been
   functionally characterized to date: chr5p15.33/TERT^[77]16,
   chr16q23.1/CTRB2^[78]17, and chr13q22.1/DIS3^[79]18.

   Interestingly, some overlap has been observed between GWAS loci for
   PDAC and associated epidemiological risk factors. A number of PDAC risk
   loci have common SNPs or colocalize with GWAS for traits that influence
   PDAC risk: chr1q32.1/NR5A2/T2D, chr2p13.3/ETAA1/T2D/waist-hip-ratio
   (obesity measure), chr8q24.21/MIR1208/T2D, chr9q34/ABO/T2D/body fat
   percentage, chr12q24.31/HNF1A/Maturity-onset Diabetes of the Young/T2D,
   chr16q23.1/BCAR1/T2D, chr18q21.32/GRP/T2D/BMI/obesity
   ([80]https://mvp-ukbb.finngen.fi/ and NHGRI GWAS Catalog^[81]19).
   Further pathway enrichment analysis of genes ±100 kb of PDAC GWAS risk
   loci indicate an enrichment of genes associated with Maturity-onset
   diabetes of the young (KEGG), Sequence-specific DNA-binding
   transcription factor activity (GO Molecular Function), and Cellular
   response to UV (GO Biological Process)^[82]4. Further, DEPICT
   enrichment analysis indicated that genes associated with GWAS risk loci
   are highly expressed in numerous gastrointestinal tissues^[83]4.

   Here, we apply fine-mapping methods to identify candidate functional
   variants for in vitro testing. We subsequently identify rs13303160 as a
   functional variant at chr1p36.33 likely mediating the expression of
   KLHL17 through allele-preferential binding of JunB and JunD
   transcription factors. Our work to characterize KLHL17’s function in
   PDAC risk suggests a role in mitigating acinar to ductal metaplasia
   (ADM) and epithelial to mesenchymal transition (EMT) through protein
   homeostasis.

Results

Fine-mapping of the chr1p36.33 PDAC risk locus

   To identify candidate functional variants at the chr1p36.33 risk locus,
   we first performed a meta-analysis using GWAS data from PanScan
   I-III^[84]5–[85]8, PanC4^[86]9 and an additional 1066 PDAC cases and
   9399 controls from the UK Biobank (UKBB)^[87]20 resulting in 10,106
   cases and 21,895 control subjects with imputed GWAS data. In this
   analysis, rs13303010 remained the most statistically significant SNP at
   chr1p36.33 (OR = 1.24, 95% CI 1.16–1.32, P-value = 2.09 × 10^−10)
   (Fig. [88]1, Table [89]1).

Fig. 1. Overview of chr1p36.33 PDAC risk locus.

   [90]Fig. 1
   [91]Open in a new tab

   a Locus Zoom plot of the variants identified in the meta-analysis,
   colors are indicative of the linkage disequilibrium (LD) r^2 in
   reference to the tag SNP (red 0.8 < r^2 ≤ 1.0; yellow 0.6 < r^2 < 0.8,
   green 0.4 < r^2 < 0.6, blue 0.2 < r^2 < 0.4, purple 0 < r^2 < 0.2),
   -log10 P-values are calculated using a logistic regression analysis and
   are not multiple-testing corrected. The gray-dashed line indicates the
   Bonferroni-corrected P-value significance threshold (5×10^-8) used in
   GWAS; b UCSC genome browser view of chr1p36.33 with ChromHMM and
   ATAC-seq annotations in PDAC and normal-derived duct epithelial cell
   lines^[92]24; c Zoomed in UCSC browser snapshot showing the candidate
   functional variants and nearby genes. ChromHMM states are indicated by
   color: weak enhancer 1 (light green), weak enhancer 2 (dark blue),
   active enhancer (light blue), active element (purple), active
   transcriptional start site (TSS, red), bivalent TSS (yellow), polycomb
   repressed (light purple), quiescent (dark green). Source data are
   provided as a Source Data file.

   To identify credible causal variants (CCVs) underlying this association
   signal, we applied fine-mapping approaches to the summary statistics
   for this meta-analysis. We implemented the Bayesian approach Sum of
   Single Effects Linear Regression (SuSiE)^[93]13 to identify credible
   sets of CCV with 90% confidence. SuSiE identified one credible set with
   five variants (Table [94]2). We also applied likelihood ratio
   (LLR < 1:100) and linkage disequilibrium (LD r^2 > 0.8) thresholds
   which identified two additional CCVs (Table [95]2). To be inclusive, we
   moved all seven variants forward for in vitro functional analysis.

Table 2.

   Summary statistics and fine-mapping for the CCVs in UK Biobank
   meta-analysis
   ^aSNP, ^balleles, ^clocation OR (95% CI) P-value LD r2 LLR SuSiE PIP
   Chromatin accessibility
   rs13303010 (G, A) 1:894,573 1.24 (1.16-1.32) 2.09 × 10^−10 -- 1.00 0.27
   Yes
   rs3935066 (G, A) 1:900,730 1.25 (1.17-1.35) 2.74 × 10^−10 0.89 1.30
   0.22 No (>500 bp)
   rs113491766 (A, AG) 1:895,755 0.81 (0.76-0.86) 2.85 × 10^−10 1 1.35
   0.20 ~56 bp away
   rs13303327 (G, A) 1:895,706 1.24 (1.16-1.32) 3.38 × 10^−10 1 1.60 0.18
   ~110 bp away
   rs10465241 (C, T) 1:886,817 1.24 (1.16-1.32) 1.03 × 10^−9 0.90 4.74
   0.06 ~22 bp
   rs10465242 (G, A) 1:886,788 1.23 (1.1.5-1.32) 2.23 × 10^−9 0.90 10.07
   0.03 ~52 bp
   rs13303160 (G, A) 1:901,559 1.23 (1.15-1.32) 2.74 × 10^−9 0.93 12.31
   0.02 Yes
   [96]Open in a new tab

   Summary statistics of the CCVs for the newest meta-analysis with PDAC
   UK Biobank cases and controls including OR and 95% CI, P-value, linkage
   disequilibrium (LD) r^2 reported relative to the tag SNP based on the
   European 1000Genomes reference, likelihood ratio (LLR), SuSiE posterior
   inclusion probability (PIP), and proximity to accessible chromatin.
   GWAS statistics are generated using a logistic regression model
   adjusted for covariates outlined in the Methods. P-values are not
   multiple-testing corrected. LLR was calculated using the GWAS P-values
   and a chi-squared test. Top five SNPs were identified in the SuSiE
   credible set of variants. All variants listed met the fine-mapping
   criteria for functional follow-up. Accessible chromatin is based on the
   ATAC-seq data generated in PDAC-derived or normal-derived pancreas
   epithelial cell lines^[97]24. Source data are provided as a Source Data
   file.

   ^arsID.

   ^bMinor, major alleles.

   ^cchr:position (Genome build hg19).

   Most GWAS variants are noncoding and thought to affect gene expression
   of target genes in an allele specific manner. As such, GWAS variants
   have been shown to be enriched in active gene regulatory elements
   indicated by posttranslational modifications of histones (H3K4me3,
   H3K4me1, H3K27ac) and accessible chromatin^[98]11,[99]21–[100]23. We
   examined the set of statistically fine-mapped variants in the context
   of maps of pancreas gene regulatory elements (using a ChromHMM 8-state
   model and ATAC-seq) we generated in PDAC and normal-derived pancreas
   cell lines^[101]24. All the fine-mapped variants at chr1p36.33 lie
   within active and bivalent transcriptional start sites and active
   enhancer elements (Fig. [102]1b, c). Two variants lie within regions of
   open chromatin (Fig. [103]1c, Table [104]2) lending further support for
   the fine-mapped variants as candidate functional SNPs influencing gene
   expression in cis or trans.

Assessing allele-preferential binding and gene regulatory activity of
candidate functional variants

   To identify functional variants underlying this GWAS signal, we first
   sought to identify variants that exhibited allele-preferential
   transcription factor (TF) binding. We tested the set of seven
   fine-mapped variants at chr1p36.33 in electrophoretic mobility shift
   assays (EMSAs) using nuclear extracts from the PANC-1 pancreatic cancer
   cell line. Three of the seven variants demonstrated allele-preferential
   protein binding: rs13303010, rs13303327, and rs13303160 (Fig. [105]2).
   The GWAS tag SNP, rs13303010, showed preferential binding to the
   protective alternate allele (A) (Fig. [106]2a). Additionally, we
   observed consistent allele-preferential binding with the protective
   alternate allele (A) at rs13303327 (Fig. [107]2b). The third variant,
   rs13303160, exhibited some preferential binding to the risk-increasing
   reference (G) allele (Fig. [108]2c). These preferential binding
   patterns were also observed with MIA PaCa-2 (Supplementary
   Fig. [109]1a–c) and HeLa nuclear lysate (Supplementary Fig. [110]1d,
   e).

Fig. 2. Identification of allele-preferential binding and activity using EMSA
and luciferase reporter assays.

   [111]Fig. 2
   [112]Open in a new tab

   a–c Representative EMSA results with PANC-1 nuclear extract and
   fluorescently labeled 31 bp oligonucleotides with each variant,
   rs13303010 (n = 4 independent experiments), rs13303327 (n = 3
   independent experiments), rs13303160 (n = 4 independent experiments),
   respectively, centered in the middle. Competitor is the same sequence
   with no fluorescent label in excess (50, 100X, indicated by the black
   triangles). Arrows denote the allele-preferential binding. d–f
   Luciferase reporter assays using rs13303010, rs13303327, and
   rs13330160, respectively and the surrounding sequence as a promoter to
   the luciferase gene in three cell lines, number of biological
   replicates are indicated below the cell line name. g Luciferase assay
   for rs13303160 and surrounding sequence as an enhancer upstream of a
   minimal promoter and luciferase gene in three cell lines. For EMSA and
   luciferase, the risk allele is colored in red. For luciferase, the
   forward (fwd) and reverse (rev) orientation of the sequence was used.
   Pink bars indicate the risk allele and blue bars the protective allele.
   Gray bars are the Empty Vector (EV) control. The number of biological
   replicates is indicated underneath the cell line in the figure for
   (d–g). Error bars represent the standard error of the mean (SEM) and
   significance was determined by an unpaired, two-tailed t-test. Source
   data are provided as a Source Data file.

   We next moved the three variants with allele-preferential binding
   forward to evaluate allele-preferential gene regulatory activity using
   luciferase reporter activity assays. As these variants lie near active
   transcriptional start sites (TSS) (Fig. [113]1c), we first tested
   allele-preferential promoter activity by placing 141-201 base pair (bp)
   sequences centered on the variant of interest (see Methods) into a
   luciferase vector without a minimal promoter (in the pGL4.14 vector).
   We observed strong promoter activity for the rs13303010 constructs
   compared to the empty vector (EV) but minimal allele-preferential
   luciferase activity in MIA PaCa-2, PANC-1 and HEK293T cells
   (Fig. [114]2d). The second SNP, rs13303327, demonstrated
   allele-preferential luciferase activity with the alternate (A) allele
   having a stronger regulatory effect (Fig. [115]2e). The third SNP,
   rs13303160, exhibited strong promoter activity with the alternate (A)
   allele demonstrating an allele-preferential effect in the reverse
   orientation in the MIA PaCa-2 and HEK293T cells (Fig. [116]2f).

   While rs13303010 and rs13303327 lie in a 1328 bp region between the TSS
   of the NOC2L and KLHL17 genes, rs13303160 is located 360 bp downstream
   of the 3’UTR of KLHL17 and 303 bp upstream of the TSS for PLEKHN1
   (Fig. [117]1c), suggesting it could influence promoter and/or enhancer
   activity at this locus. We therefore tested the sequence surrounding
   rs13303160 as an enhancer upstream of a minimal promoter and the
   luciferase gene (in the pGL4.23 vector). As an enhancer element, the
   rs13303160 constructs demonstrated strong enhancer activity in all
   three cell lines with the alternate (A) allele exhibiting stronger
   activity (Fold change (FC) 1.5–2.8, P-value = 1 × 10^−6–0.02;
   Fig. [118]2g). Thus, through EMSA and luciferase assays, we narrowed
   the set of seven fine-mapped candidate functional variants down to two
   (rs13303327 and rs13303160) that each demonstrate allele-preferential
   protein binding and gene regulatory activity.

Identifying allele-preferential protein binding

   To identify TFs potentially mediating the allele-preferential
   regulatory activity we observed, we performed an in silico TF motif
   analysis using PERFECTOS-APE^[119]25. This analysis predicts TF binding
   potential for both alleles of a variant and provides a P-value for the
   estimated strength of binding. We then calculated the fold-change
   between P-values as a proxy for the binding affinity change between
   alleles.

   For rs13303327, we identified several E74-like factors (ELF), members
   of the E-twenty-six (ETS) family of transcription factors, having a
   13-24-fold difference in binding P-values between the A and G alleles
   (Table [120]3, Supplementary Table [121]1). The ELF transcription
   factors recognize the motif GGAA (Fig. [122]3a) which is disrupted by
   the risk allele-G at rs13303327 by replacing the last A with a G.

Table 3.

   Predicted TFs with allelic binding preferences for rs13303327 and