Abstract Introduction The potato (Solanum tuberosum) cultivar ‘Xin Daping’ is tetraploid with white skin and white flesh, while the cultivar ‘Hei Meiren’ is also tetraploid with purple skin and purple flesh. Comparative transcriptome analysis of white and purple cultivars was carried out using high-throughput RNA sequencing in order to further understand the mechanism of anthocyanin biosynthesis in potato. Methods and Results By aligning transcript reads to the recently published diploid potato genome and de novo assembly, 209 million paired-end Illumina RNA-seq reads from these tetraploid cultivars were assembled on to 60,930 transcripts, of which 27,754 (45.55%) are novel transcripts and 9393 alternative transcripts. Using a comparison of the RNA-sequence datasets, multiple versions of the genes encoding anthocyanin biosynthetic steps and regulatory transcription factors were identified. Other novel genes potentially involved in anthocyanin biosynthesis in potato tubers were also discovered. Real-time qPCR validation of candidate genes revealed good correlation with the transcriptome data. SNPs (Single Nucleotide Polymorphism) and indels were predicted and validated for the transcription factors MYB AN1 and bHLH1 and the biosynthetic gene anthocyanidin 3-O-glucosyltransferase (UFGT). Conclusions These results contribute to our understanding of the molecular mechanism of white and purple potato development, by identifying differential responses of biosynthetic gene family members together with the variation in structural genes and transcription factors in this highly heterozygous crop. This provides an excellent platform and resource for future genetic and functional genomic research. Introduction Potato (Solanum tuberosum L.) is one of the world’s most important food crops, and is recognized as a source of health-promoting antioxidants for the human diet [[45]1]. Potato tubers contain significant amounts of polyphenols. Anthocyanins are the predominant group of visible polyphenols that comprise the red, purple, and blue pigmentation of potato [[46]2–[47]4]. As well as playing important physiological roles in plant response to stress and acting as attractants for pollination and seed dispersal, anthocyanins have high free radical scavenging activity, anti-inflammatory and anti-microbial properties, and dietary consumption has been associated with a reduced incidence of cardiovascular diseases, cancers, neurodegenerative diseases, osteoporosis, or diabetes [[48]5–[49]7]. Pigmented potato genotypes have been shown to contain significantly higher levels of anthocyanins and antioxidant activity; especially cultivars with purple and red skin and/or flesh, compared to those with white and yellow tubers [[50]4–[51]7]. Therefore, a high anthocyanin potato has potential as a new cultivar with enhanced health benefits. In many plant species, the anthocyanin biosynthetic pathway and its regulation has been elucidated. The genes that encode flavonoid biosynthetic steps and regulatory genes have also been cloned [[52]8, [53]9]. The biosynthesis of anthocyanin pigments is regulated at the transcriptional level by a MYB-bHLH-WD40 (MBW) transcription factor (TF) complex [[54]10, [55]11], composed of TFs from the R2R3-MYB, basic Helix-Loop-Helix (bHLH) and WD40 classes [[56]12, [57]13]. MYB TFs can be classified into three subfamilies based on the number of highly conserved imperfect repeats in the DNA-binding domain including R3 MYB (MYB1R) with one repeat, R2R3 MYB with two repeats, and R1R2R3 MYB (MYB3R) with three repeats [[58]14]. Those associated with up-regulation of the anthocyanin pathway are R2R3 MYBs. The first plant protein studied with a bHLH domain was Lc which is involved in the control of flavonoid/anthocyanin biosynthesis in maize [[59]15]. The first two hundred amino acids of this bHLH protein are required to interact with the MYB transcription factor partner, while the C-terminal of the protein interacts with WD40 proteins [[60]16]. The C-terminal ACT-like domain facilitates binding of the MYB to the promoter [[61]17]. WD proteins have four to eight imperfect tandem repeats and interact with other proteins through the WD repeat region [[62]18, [63]19]. Genes involved in the regulation of the anthocyanin pathway have been identified in many different species, for example AN1 and AN2 in Petunia [[64]20, [65]21]; C1 and P1 in Zea mays [[66]22, [67]23]; Rosea1 and Delila in Antirrhinum [[68]24, [69]25]; PAP1 and TTG8 in Arabidopsis [[70]26]; MYB1 and MdMYB10 in apples [[71]27, [72]28]; and VvMYBA in grape [[73]29]. In apple, two candidate bHLH transcription cofactors (bHLH3 and bHLH33) are also needed for activating promoters of anthocyanin biosynthetic genes and MYB10 autoregulation [[74]30]. In potato, previous studies have found that the biosynthesis of anthocyanin pigments in the periderm of the tuber is controlled in part by three loci, D, P, and R. Genetic evidence based on the co-localization in the potato genetic map of R, P and D indicated these loci encode dihydroflavonol 4-reductase (DFR), flavonoid 3’, 5’-hydroxylase (F3’5’H) and an R2R3 MYB transcription factor designated AN1[[75]31]. The recent development of next-generation sequencing technologies can generate sequences on an unprecedented scale with a markedly reduced cost compared with traditional technologies [[76]32]. Next-generation RNA-sequencing (RNA-Seq) has rapidly replaced microarrays as an approach to profile transcriptomes in a high-throughput way [[77]33]. It allows detection of transcripts with low abundance, identifies novel transcript units, and reveals their differential expression between different samples [[78]34, [79]35]. To date, there have been no reports of using RNA-Seq technology to analyze the control of pigmentation in different potato cultivars. The availability of the complete genome sequence of the doubled monoploid Solanum tubersosum clone DM1-3 516R44 [[80]36] has made RNA-seq more informative. In this study, we used next-generation sequencing technology and bioinformatics tools to analyze the transcriptome of tetraploid white and purple potato cultivars, by both de novo assembly of transcripts and alignment to the published diploid potato genome to analyse differentially expressed genes related to anthocyanin biosynthesis between the cultivars and discovered new versions of the pathway genes and potential transcription factors. In addition, putative SNPs were identified and validated in AN1 bHLH1 and UFGT. According to the generated RNA-seq datasets, there are significant differences in expression among genes involved in anthocyanin biosynthesis. These RNA-seq datasets enhance the available transcriptome data of potato, and improves our understanding of variations in tuber color to identify important genes for anthocyanin biosynthesis in potato. Materials and Methods Plant material, RNA extraction, library construction and RNA sequencing A purple potato (Solanum tuberosum L.) cultivar ‘Hei Meiren’ (purple skin and purple flesh), a white potato cultivar ‘Xin Daping’ (white skin and white flesh) ([81]Fig 1), two red potato cultivars ‘Gannongshu NO.5’ (red skin and white flesh) and ‘Qinshu NO.9’ (red skin, white flesh and red vascular ring) ([82]S1 Fig) were grown in a greenhouse at Gansu Agricultural University, China. Three potato cultivars ‘Agria’ (white skin and white flesh), ‘Red Jackets’ (red skin and white flesh) and ‘Heather’ (purple skin and white flesh) ([83]S1 Fig) were bought in a local supermarket, New Zealand. Five fresh tubers (diameter 4-5cm) were harvested, and cleaned with sterilized water. Skin tissue was carefully separated from cortex tissue using a scalpel to minimize flesh contamination. The skin and flesh of these potatoes were then immediately frozen in liquid nitrogen and stored in a -80°C freezer for later use. Fig 1. Tubers of Solanum tuberosum cultivars. [84]Fig 1 [85]Open in a new tab (A) white skin and white flesh tetraploid cultivar ‘Xin Daping’, (B) purple skin and purple flesh tetraploid cultivar ‘Hei Meiren’. Total RNA was extracted from purple skin and purple flesh of ‘Hei Meiren’ and white skin and white flesh of ‘Xin Daping’ using the PureLink Plant RNA Reagent Kit (Invitrogen, USA) according to the manufacturer's instructions. The RNA was quality checked and quantified using a Nanodrop ND-1000 (Nanodrop Technologies, USA). Enrichment of mRNA, fragment interruption, addition of adapters, size selection, PCR amplification and RNA-Seq were performed by Beijing Genome Institute (BGI; Shenzhen, China). The mRNA was separated from total RNA using oligo (dT) magnetic beads (Invitrogen, USA), then fragmented into short fragments using fragmentation buffer. cDNA was synthesized using the mRNA fragments as templates. Short fragments were purified and resolved with EB buffer for end repair and single nucleotide A (adenine) addition. Short fragments (200±20bp) were connected to adapters, suitable fragments are selected for the PCR amplification as templates according to agarose gel electrophoresis results, then an Agilent 2100 Bioanalyzer and ABI StepOnePlus Real-Time PCR Systems were used in quantification and qualification of the sample library. In total, there were 4 RNA-Seq libraries constructed—purple skin (PS), purple flesh (PF), white skin (WS) and white flesh (WF)—using the same protocol. Finally, an aliquot from each of the libraries was barcoded, pooled and sequenced using Illumina HiSeq 2000 in paired-end (PE) mode. The remaining RNA was used for real-time quantitative PCR (qPCR) verification. The RNA-seq dataset is available at the NCBI Sequence Read Archive (SRA) with the accession number: SRP036626. RNA Data filtering and de novo transcriptome assembly Raw RNA sequencing reads passing BGI’s quality control (QC) were 100bp long in length. According to the FastQC ([86]http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) report, the reads were further trimmed to 75bp long, with 15bp removed from 5’ and 10bp from 3’. Trimmed reads with phred scores less than 20 and uncertain bases N were discarded from the analysis. De novo transcriptome assembly was performed for each cultivar (‘Hei Meiren’ with PS and PF, ‘Xin Daping’ with WS and WF) using Trinity version r20131110 ([87]http://trinityrnaseq.sourceforge.net/). Transcriptome assembly completeness was assessed using cegma_v2.4.010312 ([88]http://korflab.ucdavis.edu/datasets/cegma/). These two assemblies were merged with PGSC_DM_v3.4 gene models downloaded from the Potato Genome Sequencing Consortium (PGSC, [89]http://solanaceae.plantbiology.msu.edu/pgsc_download.shtml, [[90]36]) to create the consensus potato transcriptome reference using EvidentialGene VERSION 2013.07.27 ([91]http://arthropods.eugenes.org/EvidentialGene/). Differentially gene expression test Cleaned reads from each of the 4 RNA-Seq libraries were mapped to the consensus potato transcriptome reference sequences using bowtie2-2.2.1 ([92]http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) with a maximum fragment size of 500 and maximum of 1 mismatch in seed region of 20 bases. Raw read counts were extracted from the sorted alignment files. For differential expression test (DET), we compared three conditions: WF vs. PF; WS vs. WS; White (S & F) vs. Purple (S & F). For each comparison, genes with low read counts per million (cpm < 1) that didn’t represent sufficient statistic significance were excluded from the DET analysis. The libraries were digitally normalized after filtering and the DET was completed using EdgeR 3.24 ([93]http://www.bioconductor.org/packages/release/bioc/html/edgeR.html) [[94]37] and gplots 2.14 ([95]http://cran.r-project.org/web/packages/gplots/index.html) in R 3.0.1 ([96]http://www.r-project.org/) environment. Genes with FDR < 0.05 and the absolute value of logFC (log2 fold change) not less than 1.0 were considered as highly differentially expressed genes (DEGs). The gene expression unit was calculated using the RPKM [[97]38] method (Reads per kilobase transcriptome per million mapped reads). Annotation and predicted CDS Identified differentially expressed transcripts were annotated using MapMen Mercator Web Service ([98]http://mapman.gabipd.org/web/guest/app/mercator) by searching the Arabidopsis TAIR proteins (TAIR10), SwissProt/UniProt Plant Proteins, Clusters of orthologous eucaryotic genes database (KOG) and Use conserved domain database (CDD), with a value of blast cut-off of 80. The final description for each novel transcript that was assembled from our RNA-Seq data but not predicted in the PGSC gene set was based on a series of searches as well as the ratio lengths to the HSP (highest-scoring segment pairs) from Blast. The complete Open Reading Frame (ORF) was predicted using the BioPerl ([99]http://www.bioperl.org/). Gene ontology functional enrichment and pathway analysis of DEGs The DEGs were mapped to GO terms in the GO database ([100]http://www.geneontology.org/) to calculate gene numbers for every term. The hypergeometric test was then used to find significantly enriched GO terms in the input list of DEGs, based on 'GO::TermFinder' ([101]http://smd.stanford.edu/help/GO-TermFinder/GO_TermFinder_help.sht ml/). GO terms conforming to p-value through Bonferroni Correction ≤ 0.05 were defined as significantly enriched GO terms. KEGG [[102]39] ([103]http://www.genome.jp/kegg/) is used to perform pathway enrichment analysis of DEGs. MapMan software (version 3.6.0) was used to display expression profiles at the pathway level [[104]40]. The expression profiles of the metabolic pathways can be viewed by a discrete signal visualized using different types (blue and red) and intensity of color. Real-time PCR (qPCR) quantitative analysis For qPCR analysis, total RNA from skin and flesh of white cultivars ‘Xin Daping’ and ‘Agria’, purple cultivar ‘Hei Meiren’ and ‘Heather’, red cultivars ‘Gannongshu NO.5’, ‘Qinshu NO.9’ and ‘Red Jackets’ were extracted as described above. Each RNA sample was subjected to DNase digestion (Takara, Dalian, China) to remove any remaining DNA and first strand cDNA synthesis was carried out using oligo dT according to the manufacturer’s instructions (SuperScript III, Invitrogen, USA). cDNA was diluted twenty-fold and used as the template for qPCR. All the primer sequences for 4-coumarate-CoA ligase (4CL), flavanone 3 beta-hydroxylase (F3H), flavonoid 3'-monooxygenase (F3’H), chalcone synthase (CHS), dihydroflavonol 4-reductase (DFR), flavonoid 3',5'-hydroxylase (F3’5’H), leucoanthocyanidin dioxygenase/anthocyanidin synthase (LDOX/ANS), flavonol synthase (FLS), anthocyanidin 3-O-glucosyltransferase (UFGT), transcription factors AN1 ^816 and bHLH1 are listed in [105]S1 Table. qPCR DNA amplification and analysis was carried out using the LightCycler System (Roche LightCycler 480, Roche), with LightCycler software version 4. The LightCycler FastStart SYBR Green Master Mix (Roche) was used and 10 μl of total reaction volume applied to all the reactions following the manufacturer’s method. qPCR conditions were 5 min at 95°C, followed by 40 cycles of 5 s at 95°C, 5 s at 60°C, and 10 s at 72°C, followed by 65°C to 95°C melting curve detection. The qPCR efficiency of each gene was obtained by analyzing the standard curve of a cDNA serial dilution of that gene. Relative abundance was calculated with the ΔC[T] method using actin (accession number: [106]X55752) for template normalization. Potato actin is widely used as a reference gene in qPCR analysis [[107]19, [108]41, [109]42]. Error bars shown in qPCR data are technical replicates, representing the means ±SE of four replicate qPCR reactions. Statistical significance was determined by one-way ANOVA. SNPs and indel discovery We used the nucleotide sequences of the published UFGT1 ([110]KP096267) from white skin of ‘Xin Daping’ cultivar, AN1 ([111]AY841127) from red skin of ‘Y83-1’ cultivar and bHLH1 ([112]JX848660) from purple tuber of ‘Magic Molly’ as the reference sequences for SNPs and indels discovery. Cleaned reads from the white and purple genotypes were mapped back to the references using Bowtie 2 with mapping criteria ‘very sensitive’.