Abstract Background Pancreatic cancer is a highly lethal cancer with limited diagnostic and therapeutic modalities. Methods To begin to explore the genomic landscape of pancreatic cancer, we used massively parallel sequencing to catalog and compare transcribed regions and potential regulatory elements in two human cell lines derived from normal and cancerous pancreas. Results By RNA-sequencing, we identified 2,146 differentially expressed genes in these cell lines that were enriched in cancer related pathways and biological processes that include cell adhesion, growth factor and receptor activity, signaling, transcription and differentiation. Our high throughput Chromatin immunoprecipitation (ChIP) sequence analysis furthermore identified over 100,000 regions enriched in epigenetic marks, showing either positive (H3K4me1, H3K4me3, RNA Pol II) or negative (H3K27me3) correlation with gene expression. Notably, an overall enrichment of RNA Pol II binding and depletion of H3K27me3 binding were seen in the cancer derived cell line as compared to the normal derived cell line. By selecting genes for further assessment based on this difference, we confirmed enhanced expression of aldehyde dehydrogenase 1A3 (ALDH1A3) in two larger sets of pancreatic cancer cell lines and in tumor tissues as compared to normal derived tissues. Conclusions As aldehyde dehydrogenase (ALDH) activity is a key feature of cancer stem cells, our results indicate that a member of the ALDH superfamily, ALDH1A3, may be upregulated in pancreatic cancer, where it could mark pancreatic cancer stem cells. Keywords: Pancreatic cancer, Transcriptome, Epigenome, Sequencing, ALDH1A3 Background Although pancreatic cancer is the tenth most commonly diagnosed cancer in the U.S., it is the fourth most common cause of cancer mortality with close to 37,000 deaths per year in the U.S. [[50]1]. The aggressive nature of this cancer is further emphasized by the dismal five-year survival rate of less than 5% [[51]2]. Pancreatic ductal adenocarcinoma (PDAC), the most common form of pancreatic cancer, is thought to arise through a multistep process involving intermediate precursor lesions known as pancreatic intraepithelial neoplasias (PanINs) [[52]3]. Genome-wide sequencing approaches have revealed a complex set of somatic alterations in pancreatic tumors such as single base mutations, amplifications, deletions and complex rearrangements that drive cancerous growth through specific signaling pathways [[53]4,[54]5]. The genes most frequently altered in pancreatic cancer are KRAS, TP53, CDKN2A and SMAD4, leading to an activation of growth promoting and cell survival pathways, and inactivation of tumor suppressors and apoptotic pathways [[55]4]. In addition to an accumulation of somatic mutations that affect the DNA sequence directly, epigenetic events can lead to an increase in cell growth or survival and thereby contribute to tumor development [[56]6]. DNA methylation and post-translational modification of histone proteins are epigenetic mechanisms that regulate gene expression in normal and cancerous cells. DNA methylation at specific genes can increase or decrease in cancer and is often associated with an inactivation of tumor suppressor genes [[57]7]. Post-translational modification of histones mediates appropriate gene expression by regulating access of the transcriptional machinery to the underlying DNA [[58]8]. Changes in DNA methylation have been investigated in pancreatic cancer [[59]9-[60]11] but genome wide maps of histone modification patterns have not been reported in this disease. To enhance our understanding of the pancreatic cancer genome, we catalogued transcribed sequences and potential functional elements in cell lines derived from normal and neoplastic pancreatic tissues by next generation RNA-sequencing and chromatin immunoprecipitation (ChIP) with massively parallel sequencing. Differentially expressed genes and pathways were identified, and gene expression was correlated to histone modification patterns that mark active and repressed chromatin. Methods Cell lines and cell culture Two human pancreatic cell lines were purchased from ATCC: hTERT-HPNE (derived from normal ductal pancreatic tissue, immortalized with the TERT gene; modal chromosome number = 46) [[61]12,[62]13] and PANC-1 (derived from pancreatic ductal adenocarcinoma; modal chromosome number = 63) [[63]14]. The hTERT-HPNE cell line was cultured in 75% DMEM (ATCC), 25% Medium M3 Base (Incell Corporation), supplemented with 5% fetal bovine serum (FBS), 10 ng/ml human recombinant EGF, 5.5 mM D-glucose (1 g/L) and 750 ng/ml puromycin. The PANC-1 cell line was cultured in DMEM (ATCC) supplemented with 10% FBS. Cells were grown to log phase and harvested for RNA and ChIP sequencing experiments at approximately 60-80% confluency. The cell lines were harvested at passage 18 (PANC-1) and 23 (hTERT-HPNE) after purchase from ATCC. RNA preparation Total RNA was isolated from ~5×10^6 cells using the mirVana™ miRNA isolation kit (Ambion) according to the manufacturer’s protocol (retaining the small RNA fraction in the pool of total RNA). DNAse treatment was performed with TURBO DNase (Ambion); RNA quantity was assessed by Nano Drop (Thermo Scientific) and RNA quality by using the Agilent 2100 Bioanalyzer (Agilent Technologies). RNA quality (RIN) scores for RNA prepared from cell lines were > 9.0. ChIP and template preparation Chromatin immunoprecipitation (ChIP) was performed with the ChIP-IT™ Express kit (Active Motif) according to the manufacturer’s protocol. Briefly, cells were grown to log phase and cross-linked with 1% formaldehyde for 10 minutes at room temperature. The fixation reaction was stopped by a glycine stop solution and cells were washed in PBS and homogenized in lysis buffer supplemented with protease inhibitors. Shearing was performed by sonication in a Diagenode Bioruptor (30-sec on/off pulses for 15–20 min at high setting) to obtain 200–500 bp DNA fragments. After centrifugation, the supernatant was used for chromatin immunoprecipitation (from ~1×10^7 cells each) using antibodies for: anti-mono-methylated histone H3 at lysine 4 (H3K4me1) (Abcam, ab8895), anti-tri-methylated histone H3 at lysine 4 (H3K4me3) (Abcam, ab12209), anti-tri-methylated histone H3 at lysine 27 (H3K27me3) (Abcam, ab6002) and anti-RNA polymerase II (Pol II) (Abcam, ab817). Input samples consisted of 10 μL sonicated chromatin. For each ChIP, 20 μg sheared chromatin was incubated overnight at 4°C on an end-to-end rotator, with 8μg antibody, protein G magnetic beads and protease inhibitors. Finally, the ChIP and input DNA samples were reverse cross-linked, treated with proteinase K and purified with the QIAquick PCR Purification Kit (Qiagen). Library preparation and sequencing For mRNA-seq sample preparation, the mRNA-seq Sample Prep Kit (Illumina) was used according to the manufacturer’s protocol. Briefly, 1 μg total RNA was used for polyA mRNA selection using Sera-mag oligo(dT) beads, followed by thermal fragmentation at 94°C. First strand cDNA synthesis was performed using Superscript II reverse transcriptase and random primers. Second strand synthesis was performed using DNA Polymerase I followed by end repair with T4 DNA polymerase, Klenow DNA polymerase and T4 polynucleotide kinase (PNK). Finally, the double-stranded cDNA was ligated to Illumina paired-end (PE) adaptors and size selection performed for fragments in the 350±25 bp range. Libraries were amplified using Phusion DNA polymerase, followed by purification with the QIAquick PCR Purification Kit (Qiagen). Amplified libraries were diluted with Elution Buffer to a final concentration of 10 nM and run at a concentration of 4–6 pM on two Genome Analyzer (GAII) lanes. MicroRNA-seq sample preparation was performed with the Small RNA v1.5 sample prep Kit (Illumina) according to the manufacturer’s protocol. Briefly, 3′ and 5′ small RNA adapters were ligated to 10 μg total RNA followed by reverse transcription with Superscript II and amplification using Phusion DNA polymerase. This was followed by size selection of cDNA in the 100–108 bp range and measurement using a Bioanalyzer. Amplified miRNA-seq libraries were diluted with Elution Buffer to a final concentration of 10 nM and run at a concentration of 4–6 pM on one Genome Analyzer (GAIIx) lanes. For ChIP-seq sample preparation, pooled ChIP reactions (10 ng) were used to prepare a single library using the ChIP-seq Sample Prep Kit (Illumina) according to the manufacturer’s protocol. Briefly, DNA ends were repaired with T4 DNA polymerase, Klenow polymerase and T4 PNK. The Klenow fragment (3′ to 5′ exo minus) was then used to generate a protruding 3′ A base and adapter oligos were ligated using DNA ligase. ChIP DNA in the 200±25 bp range was recovered and amplified using Phusion polymerase. Library concentration was assessed using a Bioanalyzer. Amplified libraries were diluted with Elution Buffer to a final concentration of 10 nM and run at a concentration of 4–6 pM on 2–3 Genome Analyzer (GAII or GAIIx) lanes. All next-generation sequencing was carried out at the National Cancer Institute’s Center for Cancer Research Sequencing Facility operated by the Advanced Technology Program of SAIC-Frederick, Inc. (Gaithersburg, MD). Sequence alignment and analysis mRNA-sequencing: paired-end RNA-seq was repeated twice for each cell line to obtain reliable numbers of sequence reads (Additional file [64]1: Table S1A). The median number of reads (35 bp) generated per paired-end experiment was 8.9 M (8.6 to 11.2 M). Total reads were combined for each cell line: 19,778,468 for the hTERT-HPNE and 17,769,249 for PANC-1 cells. No significant difference was observed in the number of reads generated between the two cell lines (Student’s t-test, P=0.53). Paired-end reads were mapped to the RefSeq database (National Center for Biotechnology Information (NCBI) build 37) using the Burrows-Wheeler Aligner (BWA) software with default parameters that allow up to 3 alignments for each read and up to 2 mismatches for the seed sequence (the first 25 bp of each read) [[65]15]. Reads that failed to map to RefSeq were mapped to the Ensembl database, which includes additional transcripts and pseudogenes [[66]16]. Remaining unmapped reads were mapped to the human genome assembly (NCBI build 37). The analysis pipeline is shown in Additional file [67]2: Figure S1. Gene expression was calculated in reads per kilobase of exon per million mapped sequence reads (RPKM) [[68]17]. After filtering, we included 11,249 genes expressed above 1 RPKM in at least one cell line for further analysis. The edgeR package was used to identify genes that were differentially expressed in the two cell lines using total tag count values for each gene [[69]18]. Data was normalized using the weighted trimmed mean of log-ratio values (using hTERT-HPNE as a reference) to account for library size; statistical analysis was performed using the Fisher’s exact test. We considered genes to be differentially expressed between cell lines when the Benjamini-Hochberg false discovery rate (FDR) was less than 0.05 and the fold change difference in expression was equal to or greater than 3. miRNA-sequencing: After trimming adaptor sequences and filtering low-quality reads, a total of 9,262,094 and 18,006,865 single-end short (22 bp) sequence reads were obtained from PANC-1 and hTERT-HPNE cells, respectively (Additional file [70]1: Table S1B). The miRNAkey software [[71]19] was used to remove adaptor sequence from the 3′-ends of reads, to map raw reads to the miRBase mature database (release 16) [[72]20], calculate normalized RPKM expression values and quantify differentially expressed miRNAs. Two hundred and fifty eight miRNA genes were detected at 1 RPKM or higher in at least one of the cell lines. Genes were considered differentially expressed with a fold change difference ≥ 3 and a Bonferroni corrected Chi-square P-value < 0.05. Pathway enrichment analysis: Gene set enrichment analysis of differentially expressed mRNA and miRNA genes was performed based on KEGG (Kyoto Encyclopedia of Genes and Genomes) [[73]21] and GO (Gene Ontology) [[74]22] annotations using GOseq, as it accounts for the bias of over-detection of differential expression for long and highly expressed transcripts [[75]23]. The Wallenius approximation method with Benjamini-Hochberg corrected P-values was used to determine enriched pathways; and an FDR < 0.05 was considered significant. ChIP-sequencing: ChIP-seq single-end experiments were run on 2–3 lanes to obtain an adequate number of reads (Additional file [76]1: Table S1C). Matching input DNA was sequenced to obtain a reference for each cell line. The total number of short (35 bp) reads generated per lane varied from 4,509,035 to 27,577,446 with a median of 19,279,525. Raw reads were combined for each histone modification mark and RNA Pol II as well as for input DNA for each of the cell lines. No significant difference was observed in the number of reads generated in the two cell lines for the different histone modification marks or for RNA Pol II (Student’s t-test, P-value range: 0.17-0.83). Alignment to the human genome assembly (NCBI build 37) was performed using Bowtie, allowing for up to 2 mismatches and only one (best) alignment [[77]24]. The SICER software [[78]25] was used to identify qualified peaks (islands) of histone and Pol II binding by comparing sequence reads from immunoprecipitated and input DNA. A consolidating window size of 200 bps was used, a gap size of 200 bps (H3K4me1, H3K4me3 and Pol II) or 600 bps (H3K27me3), effective genome size of 81%, ratio of enrichment between experimental data (PANC-1) and control (hTERT-HPNE) ≥3 and an FDR <0.05. Differentially enriched epigenetic marks across the two cell lines were identified by the SICER-df function. A cis-regulatory element annotation system (CEAS) was used to attain summary statistics on ChIP enrichment peaks based on location in promoters, gene bodies or nongenic regions using the RefSeq database [[79]26]. Integration of ChIP-seq and RNA-seq Genes were divided into quartiles based on digital expression levels (RPKM values) for each cell line. Global profiling curves were generated for genes in each of the quartile groups by plotting the read distributions (tags were binned into 25 bps bins and trimmed based on Poisson distribution) of different histone modification marks and RNA Pol II within 5,000 bp up- and downstream of transcription start sites (TSS) for RefSeq genes. Genomic copy number alterations To assess if a bias existed in read mapping for ChIP-Seq because of chromosomal copy number gains or losses, we extracted DNA (Gentra Puregene Tissue Kit, Qiagen) from the hTERT-HPNE and PANC-1 cell lines, and genotyped at NCI’s Cancer Genomics Research Laboratory using the Illumina Human Hap 1M-Duo chip. The Illumina GenomeStudio software was used to compute affinity-normalized probe intensities (normalized genotype probe intensities, log R ratio (LRR) and B allele frequency (BAF)), quality scores and to call genotypes. Renormalized (quantile) LRR and BAF values were then analyzed using a custom software pipeline, R-Gada [[80]27], to detect whole-chromosome and segmental events as previously described [[81]28]. We only observed chromosomal copy number alterations in the PANC-1 cell line but not in the hTERT-HPNE cells. Adjacent events were merged if they had an identical state and distance between segments was <1 Mb. Each event was then classified as copy number gain, loss or copy neutral loss of heterozygosity (CNLOH). Since we observed only 200 (0.49%) regions enriched for RNA Pol II binding and 41 (0.57%) regions enriched for H3K27me3 binding in regions that were lost in PANC-1 cells, these were excluded from further analysis. To estimate a possible bias in mapping, we then calculated the number of regions enriched for RNA Pol II and H3K27me3 binding in the PANC-1 cells per 1 Mb for each autosomal chromosome for gain, CNLOH and copy neutral events. For interclass, unpaired comparisons, we used the Mann–Whitney U-test to assess differences in the distribution of genomic regions enriched for RNA Pol II and H3K27me3 binding in the PANC-1 cells. To assess if genes selected for replication by TaqMan expression analysis (see below) were located in regions of copy number gain or loss, we also genotyped the following cell lines and assessed chromosomal abnormalities as described above: AsPC-1, BxPC-3, Capan-1, CFPAC-1, Hs 766T, SU.86.86 and SW 1990. Assessment of differential mRNA expression analysis using Real-Time qPCR An assessment of differential expression levels was attempted for 4 genes using custom mRNA Taqman expression assays (Applied Biosystems): ALDH1A3 (Hs00167476_m1), ITGBL1 (Hs00191224_m1), NFE2L3 (Hs00852569_g1) and SEMA4B (Hs00384240_m1). Fresh frozen pancreatic tissue samples (10 normal and 10 tumor derived) were obtained from the Mayo Clinic in Rochester MN (approved by the Institutional Review Board of the Mayo Clinic). All tumors were from patients diagnosed as adenocarcinoma of the pancreas, with tumor percentages ≥70%; all normal derived samples were adjacent to tumors (unmatched to tumors), confirmed by a pathologist to be normal by pathology and with ≥80% epithelial component. Eleven additional pancreatic cancer cell lines were purchased from ATCC and cultured according to their guidelines: AsPC-1, BxPC-3, Capan-1, Capan-2, CFPAC-1, Hs 766T, HPAF-II, MIA PaCa-2, Panc 10.05, SU.86.86 and SW 1990. An additional 39 pancreatic cancer cell lines (see Additional file [82]3: Table S6 for names of cell lines and references) were kindly provided by Dr. Udo Rudloff, Surgery Branch,