Abstract

Background

   Pancreatic cancer is a highly lethal cancer with limited diagnostic and
   therapeutic modalities.

Methods

   To begin to explore the genomic landscape of pancreatic cancer, we used
   massively parallel sequencing to catalog and compare transcribed
   regions and potential regulatory elements in two human cell lines
   derived from normal and cancerous pancreas.

Results

   By RNA-sequencing, we identified 2,146 differentially expressed genes
   in these cell lines that were enriched in cancer related pathways and
   biological processes that include cell adhesion, growth factor and
   receptor activity, signaling, transcription and differentiation. Our
   high throughput Chromatin immunoprecipitation (ChIP) sequence analysis
   furthermore identified over 100,000 regions enriched in epigenetic
   marks, showing either positive (H3K4me1, H3K4me3, RNA Pol II) or
   negative (H3K27me3) correlation with gene expression. Notably, an
   overall enrichment of RNA Pol II binding and depletion of H3K27me3
   binding were seen in the cancer derived cell line as compared to the
   normal derived cell line. By selecting genes for further assessment
   based on this difference, we confirmed enhanced expression of aldehyde
   dehydrogenase 1A3 (ALDH1A3) in two larger sets of pancreatic cancer
   cell lines and in tumor tissues as compared to normal derived tissues.

Conclusions

   As aldehyde dehydrogenase (ALDH) activity is a key feature of cancer
   stem cells, our results indicate that a member of the ALDH superfamily,
   ALDH1A3, may be upregulated in pancreatic cancer, where it could mark
   pancreatic cancer stem cells.

   Keywords: Pancreatic cancer, Transcriptome, Epigenome, Sequencing,
   ALDH1A3

Background

   Although pancreatic cancer is the tenth most commonly diagnosed cancer
   in the U.S., it is the fourth most common cause of cancer mortality
   with close to 37,000 deaths per year in the U.S. [[50]1]. The
   aggressive nature of this cancer is further emphasized by the dismal
   five-year survival rate of less than 5% [[51]2]. Pancreatic ductal
   adenocarcinoma (PDAC), the most common form of pancreatic cancer, is
   thought to arise through a multistep process involving intermediate
   precursor lesions known as pancreatic intraepithelial neoplasias
   (PanINs) [[52]3]. Genome-wide sequencing approaches have revealed a
   complex set of somatic alterations in pancreatic tumors such as single
   base mutations, amplifications, deletions and complex rearrangements
   that drive cancerous growth through specific signaling pathways
   [[53]4,[54]5]. The genes most frequently altered in pancreatic cancer
   are KRAS, TP53, CDKN2A and SMAD4, leading to an activation of growth
   promoting and cell survival pathways, and inactivation of tumor
   suppressors and apoptotic pathways [[55]4].

   In addition to an accumulation of somatic mutations that affect the DNA
   sequence directly, epigenetic events can lead to an increase in cell
   growth or survival and thereby contribute to tumor development [[56]6].
   DNA methylation and post-translational modification of histone proteins
   are epigenetic mechanisms that regulate gene expression in normal and
   cancerous cells. DNA methylation at specific genes can increase or
   decrease in cancer and is often associated with an inactivation of
   tumor suppressor genes [[57]7]. Post-translational modification of
   histones mediates appropriate gene expression by regulating access of
   the transcriptional machinery to the underlying DNA [[58]8]. Changes in
   DNA methylation have been investigated in pancreatic cancer
   [[59]9-[60]11] but genome wide maps of histone modification patterns
   have not been reported in this disease.

   To enhance our understanding of the pancreatic cancer genome, we
   catalogued transcribed sequences and potential functional elements in
   cell lines derived from normal and neoplastic pancreatic tissues by
   next generation RNA-sequencing and chromatin immunoprecipitation (ChIP)
   with massively parallel sequencing. Differentially expressed genes and
   pathways were identified, and gene expression was correlated to histone
   modification patterns that mark active and repressed chromatin.

Methods

Cell lines and cell culture

   Two human pancreatic cell lines were purchased from ATCC: hTERT-HPNE
   (derived from normal ductal pancreatic tissue, immortalized with the
   TERT gene; modal chromosome number = 46) [[61]12,[62]13] and PANC-1
   (derived from pancreatic ductal adenocarcinoma; modal chromosome number
   = 63) [[63]14]. The hTERT-HPNE cell line was cultured in 75% DMEM
   (ATCC), 25% Medium M3 Base (Incell Corporation), supplemented with 5%
   fetal bovine serum (FBS), 10 ng/ml human recombinant EGF, 5.5 mM
   D-glucose (1 g/L) and 750 ng/ml puromycin. The PANC-1 cell line was
   cultured in DMEM (ATCC) supplemented with 10% FBS. Cells were grown to
   log phase and harvested for RNA and ChIP sequencing experiments at
   approximately 60-80% confluency. The cell lines were harvested at
   passage 18 (PANC-1) and 23 (hTERT-HPNE) after purchase from ATCC.

RNA preparation

   Total RNA was isolated from ~5×10^6 cells using the mirVana™ miRNA
   isolation kit (Ambion) according to the manufacturer’s protocol
   (retaining the small RNA fraction in the pool of total RNA). DNAse
   treatment was performed with TURBO DNase (Ambion); RNA quantity was
   assessed by Nano Drop (Thermo Scientific) and RNA quality by using the
   Agilent 2100 Bioanalyzer (Agilent Technologies). RNA quality (RIN)
   scores for RNA prepared from cell lines were > 9.0.

ChIP and template preparation

   Chromatin immunoprecipitation (ChIP) was performed with the ChIP-IT™
   Express kit (Active Motif) according to the manufacturer’s protocol.
   Briefly, cells were grown to log phase and cross-linked with 1%
   formaldehyde for 10 minutes at room temperature. The fixation reaction
   was stopped by a glycine stop solution and cells were washed in PBS and
   homogenized in lysis buffer supplemented with protease inhibitors.
   Shearing was performed by sonication in a Diagenode Bioruptor (30-sec
   on/off pulses for 15–20 min at high setting) to obtain 200–500 bp DNA
   fragments. After centrifugation, the supernatant was used for chromatin
   immunoprecipitation (from ~1×10^7 cells each) using antibodies for:
   anti-mono-methylated histone H3 at lysine 4 (H3K4me1) (Abcam, ab8895),
   anti-tri-methylated histone H3 at lysine 4 (H3K4me3) (Abcam, ab12209),
   anti-tri-methylated histone H3 at lysine 27 (H3K27me3) (Abcam, ab6002)
   and anti-RNA polymerase II (Pol II) (Abcam, ab817). Input samples
   consisted of 10 μL sonicated chromatin. For each ChIP, 20 μg sheared
   chromatin was incubated overnight at 4°C on an end-to-end rotator, with
   8μg antibody, protein G magnetic beads and protease inhibitors.
   Finally, the ChIP and input DNA samples were reverse cross-linked,
   treated with proteinase K and purified with the QIAquick PCR
   Purification Kit (Qiagen).

Library preparation and sequencing

   For mRNA-seq sample preparation, the mRNA-seq Sample Prep Kit
   (Illumina) was used according to the manufacturer’s protocol. Briefly,
   1 μg total RNA was used for polyA mRNA selection using Sera-mag
   oligo(dT) beads, followed by thermal fragmentation at 94°C. First
   strand cDNA synthesis was performed using Superscript II reverse
   transcriptase and random primers. Second strand synthesis was performed
   using DNA Polymerase I followed by end repair with T4 DNA polymerase,
   Klenow DNA polymerase and T4 polynucleotide kinase (PNK). Finally, the
   double-stranded cDNA was ligated to Illumina paired-end (PE) adaptors
   and size selection performed for fragments in the 350±25 bp range.
   Libraries were amplified using Phusion DNA polymerase, followed by
   purification with the QIAquick PCR Purification Kit (Qiagen). Amplified
   libraries were diluted with Elution Buffer to a final concentration of
   10 nM and run at a concentration of 4–6 pM on two Genome Analyzer
   (GAII) lanes.

   MicroRNA-seq sample preparation was performed with the Small RNA v1.5
   sample prep Kit (Illumina) according to the manufacturer’s protocol.
   Briefly, 3′ and 5′ small RNA adapters were ligated to 10 μg total RNA
   followed by reverse transcription with Superscript II and amplification
   using Phusion DNA polymerase. This was followed by size selection of
   cDNA in the 100–108 bp range and measurement using a Bioanalyzer.
   Amplified miRNA-seq libraries were diluted with Elution Buffer to a
   final concentration of 10 nM and run at a concentration of 4–6 pM on
   one Genome Analyzer (GAIIx) lanes.

   For ChIP-seq sample preparation, pooled ChIP reactions (10 ng) were
   used to prepare a single library using the ChIP-seq Sample Prep Kit
   (Illumina) according to the manufacturer’s protocol. Briefly, DNA ends
   were repaired with T4 DNA polymerase, Klenow polymerase and T4 PNK. The
   Klenow fragment (3′ to 5′ exo minus) was then used to generate a
   protruding 3′ A base and adapter oligos were ligated using DNA ligase.
   ChIP DNA in the 200±25 bp range was recovered and amplified using
   Phusion polymerase. Library concentration was assessed using a
   Bioanalyzer. Amplified libraries were diluted with Elution Buffer to a
   final concentration of 10 nM and run at a concentration of 4–6 pM on
   2–3 Genome Analyzer (GAII or GAIIx) lanes. All next-generation
   sequencing was carried out at the National Cancer Institute’s Center
   for Cancer Research Sequencing Facility operated by the Advanced
   Technology Program of SAIC-Frederick, Inc. (Gaithersburg, MD).

Sequence alignment and analysis

   mRNA-sequencing: paired-end RNA-seq was repeated twice for each cell
   line to obtain reliable numbers of sequence reads (Additional file
   [64]1: Table S1A). The median number of reads (35 bp) generated per
   paired-end experiment was 8.9 M (8.6 to 11.2 M). Total reads were
   combined for each cell line: 19,778,468 for the hTERT-HPNE and
   17,769,249 for PANC-1 cells. No significant difference was observed in
   the number of reads generated between the two cell lines (Student’s
   t-test, P=0.53). Paired-end reads were mapped to the RefSeq database
   (National Center for Biotechnology Information (NCBI) build 37) using
   the Burrows-Wheeler Aligner (BWA) software with default parameters that
   allow up to 3 alignments for each read and up to 2 mismatches for the
   seed sequence (the first 25 bp of each read) [[65]15]. Reads that
   failed to map to RefSeq were mapped to the Ensembl database, which
   includes additional transcripts and pseudogenes [[66]16]. Remaining
   unmapped reads were mapped to the human genome assembly (NCBI build
   37). The analysis pipeline is shown in Additional file [67]2: Figure
   S1. Gene expression was calculated in reads per kilobase of exon per
   million mapped sequence reads (RPKM) [[68]17]. After filtering, we
   included 11,249 genes expressed above 1 RPKM in at least one cell line
   for further analysis. The edgeR package was used to identify genes that
   were differentially expressed in the two cell lines using total tag
   count values for each gene [[69]18]. Data was normalized using the
   weighted trimmed mean of log-ratio values (using hTERT-HPNE as a
   reference) to account for library size; statistical analysis was
   performed using the Fisher’s exact test. We considered genes to be
   differentially expressed between cell lines when the Benjamini-Hochberg
   false discovery rate (FDR) was less than 0.05 and the fold change
   difference in expression was equal to or greater than 3.

   miRNA-sequencing: After trimming adaptor sequences and filtering
   low-quality reads, a total of 9,262,094 and 18,006,865 single-end short
   (22 bp) sequence reads were obtained from PANC-1 and hTERT-HPNE cells,
   respectively (Additional file [70]1: Table S1B). The miRNAkey software
   [[71]19] was used to remove adaptor sequence from the 3′-ends of reads,
   to map raw reads to the miRBase mature database (release 16) [[72]20],
   calculate normalized RPKM expression values and quantify differentially
   expressed miRNAs. Two hundred and fifty eight miRNA genes were detected
   at 1 RPKM or higher in at least one of the cell lines. Genes were
   considered differentially expressed with a fold change difference ≥ 3
   and a Bonferroni corrected Chi-square P-value < 0.05.

   Pathway enrichment analysis: Gene set enrichment analysis of
   differentially expressed mRNA and miRNA genes was performed based on
   KEGG (Kyoto Encyclopedia of Genes and Genomes) [[73]21] and GO (Gene
   Ontology) [[74]22] annotations using GOseq, as it accounts for the bias
   of over-detection of differential expression for long and highly
   expressed transcripts [[75]23]. The Wallenius approximation method with
   Benjamini-Hochberg corrected P-values was used to determine enriched
   pathways; and an FDR < 0.05 was considered significant.

   ChIP-sequencing: ChIP-seq single-end experiments were run on 2–3 lanes
   to obtain an adequate number of reads (Additional file [76]1: Table
   S1C). Matching input DNA was sequenced to obtain a reference for each
   cell line. The total number of short (35 bp) reads generated per lane
   varied from 4,509,035 to 27,577,446 with a median of 19,279,525. Raw
   reads were combined for each histone modification mark and RNA Pol II
   as well as for input DNA for each of the cell lines. No significant
   difference was observed in the number of reads generated in the two
   cell lines for the different histone modification marks or for RNA Pol
   II (Student’s t-test, P-value range: 0.17-0.83). Alignment to the human
   genome assembly (NCBI build 37) was performed using Bowtie, allowing
   for up to 2 mismatches and only one (best) alignment [[77]24]. The
   SICER software [[78]25] was used to identify qualified peaks (islands)
   of histone and Pol II binding by comparing sequence reads from
   immunoprecipitated and input DNA. A consolidating window size of 200
   bps was used, a gap size of 200 bps (H3K4me1, H3K4me3 and Pol II) or
   600 bps (H3K27me3), effective genome size of 81%, ratio of enrichment
   between experimental data (PANC-1) and control (hTERT-HPNE) ≥3 and an
   FDR <0.05. Differentially enriched epigenetic marks across the two cell
   lines were identified by the SICER-df function. A cis-regulatory
   element annotation system (CEAS) was used to attain summary statistics
   on ChIP enrichment peaks based on location in promoters, gene bodies or
   nongenic regions using the RefSeq database [[79]26].

Integration of ChIP-seq and RNA-seq

   Genes were divided into quartiles based on digital expression levels
   (RPKM values) for each cell line. Global profiling curves were
   generated for genes in each of the quartile groups by plotting the read
   distributions (tags were binned into 25 bps bins and trimmed based on
   Poisson distribution) of different histone modification marks and RNA
   Pol II within 5,000 bp up- and downstream of transcription start sites
   (TSS) for RefSeq genes.

Genomic copy number alterations

   To assess if a bias existed in read mapping for ChIP-Seq because of
   chromosomal copy number gains or losses, we extracted DNA (Gentra
   Puregene Tissue Kit, Qiagen) from the hTERT-HPNE and PANC-1 cell lines,
   and genotyped at NCI’s Cancer Genomics Research Laboratory using the
   Illumina Human Hap 1M-Duo chip. The Illumina GenomeStudio software was
   used to compute affinity-normalized probe intensities (normalized
   genotype probe intensities, log R ratio (LRR) and B allele frequency
   (BAF)), quality scores and to call genotypes. Renormalized (quantile)
   LRR and BAF values were then analyzed using a custom software pipeline,
   R-Gada [[80]27], to detect whole-chromosome and segmental events as
   previously described [[81]28].

   We only observed chromosomal copy number alterations in the PANC-1 cell
   line but not in the hTERT-HPNE cells. Adjacent events were merged if
   they had an identical state and distance between segments was <1 Mb.
   Each event was then classified as copy number gain, loss or copy
   neutral loss of heterozygosity (CNLOH). Since we observed only 200
   (0.49%) regions enriched for RNA Pol II binding and 41 (0.57%) regions
   enriched for H3K27me3 binding in regions that were lost in PANC-1
   cells, these were excluded from further analysis. To estimate a
   possible bias in mapping, we then calculated the number of regions
   enriched for RNA Pol II and H3K27me3 binding in the PANC-1 cells per 1
   Mb for each autosomal chromosome for gain, CNLOH and copy neutral
   events. For interclass, unpaired comparisons, we used the Mann–Whitney
   U-test to assess differences in the distribution of genomic regions
   enriched for RNA Pol II and H3K27me3 binding in the PANC-1 cells.

   To assess if genes selected for replication by TaqMan expression
   analysis (see below) were located in regions of copy number gain or
   loss, we also genotyped the following cell lines and assessed
   chromosomal abnormalities as described above: AsPC-1, BxPC-3, Capan-1,
   CFPAC-1, Hs 766T, SU.86.86 and SW 1990.

Assessment of differential mRNA expression analysis using Real-Time qPCR

   An assessment of differential expression levels was attempted for 4
   genes using custom mRNA Taqman expression assays (Applied Biosystems):
   ALDH1A3 (Hs00167476_m1), ITGBL1 (Hs00191224_m1), NFE2L3 (Hs00852569_g1)
   and SEMA4B (Hs00384240_m1). Fresh frozen pancreatic tissue samples (10
   normal and 10 tumor derived) were obtained from the Mayo Clinic in
   Rochester MN (approved by the Institutional Review Board of the Mayo
   Clinic). All tumors were from patients diagnosed as adenocarcinoma of
   the pancreas, with tumor percentages ≥70%; all normal derived samples
   were adjacent to tumors (unmatched to tumors), confirmed by a
   pathologist to be normal by pathology and with ≥80% epithelial
   component. Eleven additional pancreatic cancer cell lines were
   purchased from ATCC and cultured according to their guidelines: AsPC-1,
   BxPC-3, Capan-1, Capan-2, CFPAC-1, Hs 766T, HPAF-II, MIA PaCa-2, Panc
   10.05, SU.86.86 and SW 1990. An additional 39 pancreatic cancer cell
   lines (see Additional file [82]3: Table S6 for names of cell lines and
   references) were kindly provided by Dr. Udo Rudloff, Surgery Branch,