Abstract Background Polygonum minus is an herbal plant in the Polygonaceae family which is rich in ethnomedicinal plants. The chemical composition and characteristic pungent fragrance of Polygonum minus have been extensively studied due to its culinary and medicinal properties. There are only a few transcriptome sequences available for species from this important family of medicinal plants. The limited genetic information from the public expressed sequences tag (EST) library hinders further study on molecular mechanisms underlying secondary metabolite production. Methods In this study, we performed a hybrid assembly of 454 and Illumina sequencing reads from Polygonum minus root and leaf tissues, respectively, to generate a combined transcriptome library as a reference. Results A total of 34.37 million filtered and normalized reads were assembled into 188,735 transcripts with a total length of 136.67 Mbp. We performed a similarity search against all the publicly available genome sequences and found similarity matches for 163,200 (86.5%) of Polygonum minus transcripts, largely from Arabidopsis thaliana (58.9%). Transcript abundance in the leaf and root tissues were estimated and validated through RT-qPCR of seven selected transcripts involved in the biosynthesis of phenylpropanoids and flavonoids. All the transcripts were annotated against KEGG pathways to profile transcripts related to the biosynthesis of secondary metabolites. Discussion This comprehensive transcriptome profile will serve as a useful sequence resource for molecular genetics and evolutionary research on secondary metabolite biosynthesis in Polygonaceae family. Transcriptome assembly of Polygonum minus can be accessed at [38]http://prims.researchfrontier.org/index.php/dataset/transcriptome. Keywords: De novo assembly, Hybrid assembly, Illumina sequencing, RNA-seq, Persicaria minor, 454 sequencing Introduction Secondary metabolites are organic compounds that are non-vital but indirectly influence plant survival, development and growth. Three major groups of plant secondary metabolites identified by chemical groups are flavonoids and phenolic compounds, terpenoids, and nitrogen/sulfur-containing compounds. Plant secondary metabolites are important natural sources for the development of medicines and natural products. The myriads of plant secondary metabolites reflect the diverse species of plants and their ecological roles, such as adaptation to different environments and defense against biotic stresses ([39]Moore et al., 2014). Polygonum is a genus in the Polygonaceae family with up to 300 species, many of which are important as traditional medicinal plants ([40]Narasimhulu, Reddy & Mohamed, 2014). Polygonum minus Huds. (syn. Persicaria minor) is a culinary flavoring ingredient common in South East Asia and is also used as a remedy for different maladies ranging from indigestion to poor eyesight ([41]Christapher et al., 2015; [42]George et al., 2014). The leaves of Polygonum minus contain high levels of essential oils (72.54%), mainly comprised of aliphatic aldehydes, namely dodecanal (48.18%) and decanal (24.36%) ([43]Yaacob, 1990). Less abundant aldehydes include 1-decanol, 1-dodecanol, undecanal, tetradecanal, 1-undecanol, nonanal, and 1-nonanol ([44]Baharum et al., 2010). Furthermore, the metabolite profiling of Polygonum minus leaf revealed many terpenoids and flavonoids with antioxidant activities ([45]Baharum et al., 2010; [46]Goh et al., 2016). The abundance of secondary metabolites in Polygonum minus has led to the establishment of hairy root system for the production of plant secondary metabolites ([47]Ashraf et al., 2014). β-caryophyllene was found to be the main sesquiterpenes secreted into the hairy root culture media. These studies showed the potential of developing Polygonum minus as a resource to produce natural products. While many secondary metabolites have been identified in Polygonum minus, its biosynthetic pathways remain unclear due to the limited genomic information that is available for the plant. Previously, a total of 3,352 expressed sequence tags (ESTs) were generated from standard cDNA libraries of Polygonum minus leaf, root and stem tissues ([48]Roslan et al., 2012). This study indicated the abundance of flavonoid biosynthesis-related genes in the root tissue. The emergence of next generation sequencing has made transcriptomic analysis of plant possible with increasing speed and affordability. RNA-sequencing (RNA-seq) allows novel gene discovery and identification of transcripts of interest in various biological processes. This is especially suitable for many non-model organisms with limited genomic information ([49]Varshney et al., 2009; [50]Ward, Ponnala & Weber, 2012). In the past few years, this platform has been repeatedly utilized to discover and identify genes involved in the biosynthesis of secondary metabolites. For examples, alkaloid biosynthesis in Uncaria rhynchophylla ([51]Guo et al., 2014), ginsenoside biosynthesis in Panax ginseng ([52]Jayakodi et al., 2015), glucosinolate biosynthesis in Raphanus sativus ([53]Wang et al., 2013), biosynthesis of capsaicinoids in Capsicum frutescens ([54]Liu et al., 2013), biosynthesis of flavonoids in safflower ([55]Li et al., 2012), and caffeine biosynthesis in Camellia sinensis ([56]Shi et al., 2011). To date, there are only two plants from the Polygonum genus with RNA-seq data deposited to the public SRA database, namely Polygonum cuspidatum ([57]Hao et al., 2012) and Polygonum tinctorium ([58]Minami, Sarangi & Thul, 2015). The limited genetic information from this important family of medicinal plants hinders further study on molecular mechanisms underlying the production of bioactive compounds. To profile transcripts related to the biosynthesis of secondary metabolites in Polygonum minus, RNA-seq was performed on the leaf and root tissues. Sequence data generated from 454 and Illumina platforms were assembled, both independently and together, for comparison. This new dataset was compared to all Polygonum minus EST transcripts previously deposited to the NCBI database (dated Sep 2014) for validation of the assembly quality. The combined de novo assembly from two different sequencing platforms allowed us to overcome limitations of each technology. We also performed KEGG pathway annotation to identify transcripts related to the biosynthesis of secondary metabolites. This study reveals candidate genes involved in the biosynthesis of secondary metabolites, especially on the biosynthesis of phenylpropanoids and flavonoids in Polygonum minus and serves as an invaluable genetic resource for its development as a commercial herbal crop. Materials and Methods Sample preparation and transcriptome sequencing Root and leaf tissues of cultivated Polygonum minus grown in compost soil without fertilizer were sampled independently from the experimental plot (3°16′14.63″N, 101°41′11.32″E) at Universiti Kebangsaan Malaysia. For the leaf tissue, five expanded young leaves from the apical parts of the plants were collected and pooled as one biological replicate. Samples acquired from 45 day old plants were rinsed with distilled water and flash frozen in liquid nitrogen before stored at −80 °C. Total RNA was isolated using the Lopez-Gomez method with modifications ([59]López-Gómez & Gómez-Lim, 1992) by adding 50% PVP-40 due to high polysaccharide and phenolic compounds in Polygonum minus. RNA quality and quantity were assessed using gel electrophoresis, ND-1000 Nanodrop spectrophotometer (Thermo Scientific) and Agilent 2100 Bioanalyzer with a minimum RNA integrity number of 7. For the root sample, 250 ng of poly(A) RNA was prepared from 800 ng of total RNA using PolyATtract mRNA isolation kit (Promega, Madison, WI, USA) and used as a starting material for the Roche GS FLX sequencing at Malaysia Genome Institute. The cDNA preparation was done according to the cDNA Rapid Library Preparation Method Manual of Roche. The emulsion polymerase chain reaction(PCR) condition was performed using long fragment Lib-A emPCR amplification condition for amplicons that are 550 bp or greater. The conditions are as follows: 94 °C for 4 min, 50 cycles of 94 °C for 30 s and 60 °C for 10 min. For the leaf sample, total RNA from two biological replicates were used for the Illumina HiSeq™ 2000 sequencing with an average read length of 90 bp through the standard library (200 bp) preparation and paired-end sequencing workflow established at BGI-Shenzhen, China. Transcriptome de novo assembly Raw reads from both 454 pyrosequencing platform and Illumina HiSeq™ 2000 were filtered to remove adapter sequences with sequence pre-processing tools, Cutadapt ([60]Martin, 2011) and Trimmomatic ([61]Bolger, Lohse & Usadel, 2014), respectively. High quality Illumina raw reads with Phred score ≥ 25 were kept for assembly. For root transcriptome, iAssembler pipeline ([62]Zheng et al., 2011), which includes MIRA ([63]Chevreux et al., 2004) and CAP3 ([64]Huang & Madan, 1999) assemblers, was executed with the filtered dataset. The analysis pipeline includes three consecutive runs of MIRA with default parameters followed by CAP3 assembly to obtain the final assembled transcripts. Trinity pipeline ([65]Grabherr et al., 2011) was used to assemble the leaf transcriptome from two leaf libraries, whereas the combined transcriptome was assembled from two leaf libraries and one simulated root library. Assembled reads from the root transcriptome were clipped to 90 bp pseudo reads with 5 bp overlap using an in-house PHP script ([66]http://gitlab.inbiosis.ws/open-source/rnaseq-utils) to simulate Illumina sequencing output for accommodating Trinity assembler short read requirement. For this assembly, leaf raw read datasets were normalized with digital normalization following Khmer 1.0 mRNASeq protocol ([67]Brown et al., 2014). This project was registered at NCBI’s BioProject with the accession number [68]PRJNA208436. All raw read datasets were deposited to NCBI SRA database ([69]http://www.ncbi.nlm.nih.gov/sra) with the accession number [70]SRX669305 (leaf) and [71]SRX313492 (root) ([72]Loke et al., 2016). Assembled transcripts were deposited to NCBI TSA database ([73]http://www.ncbi.nlm.nih.gov/genbank/tsa) with the accession number [74]GCJZ00000000. The assembled transcripts with annotation can also be accessed at [75]http://prims.researchfrontier.org/index.php/dataset/transcriptome. Transcript functional annotation An annotation pipeline from Trinotate was performed to annotate assembled transcripts ([76]Grabherr et al., 2011). The Trinotate annotation pipeline includes several software packages such as BLASTX, BLASTP, PFAM search, SignalP, and RNAmmer that are essential in transcriptome functional annotation. All analyses were performed in parallel using assembled FASTA sequences. Functional annotation for all transcripts was performed by running BLASTX similarity search against Trinotate Swiss-Prot protein database (September 2015) with E-value <1e−5 considered as significant hits. For the leaf and combined transcriptomes, Trinotate annotation reports were generated using the standard annotation pipeline ([77]http://trinotate.github.io). Gene Ontology (GO) and Conserved Domain Database (CDD) were used to annotate the transcripts based on similarity. Translated peptides were generated using the Transdecoder program embedded in the Trinity assembly pipeline for protein-based analysis using Eukaryotic Orthologous Group (KOG) classification. All results were deposited into Trinotate-provided SQLite database template and a spreadsheet summary report was generated from Trinotate using BLASTX E-value cutoff of 1e−5. KEGG pathway mapping and enrichment analysis KEGG pathway mapping was performed by associating Enzyme Commission (EC) number from BLAST search results with UniProt database ([78]http://www.uniprot.org/mapping, September, 2015). Metabolite pathway maps with custom color codes were generated using KEGG online Mapper API ([79]http://www.genome.jp/kegg/tool/map_pathway2.html, updated April 1, 2015), with all associated EC number. Pathway enrichment analysis with hypergeometric test was performed by using “Annotate” and “Identify” subprograms in KOBAS version 2.0 with Benjamini–Hochberg correction ([80]Xie et al., 2011). The leaf and root transcriptomes were compared against combined transcriptome as a background for enrichment analysis. Transcript abundance estimation To estimate the relative abundance of transcripts in the leaf and root transcriptomes, filtered 454 (original unclipped) and Illumina raw reads were aligned to the combined transcriptome assembly using RSEM ([81]Li & Dewey, 2011). RSEM statistical model is based on Expectation–Maximization (EM) algorithm to compute maximum likelihood abundance estimates. Transcripts per kilobase million (TPM), which normalizes for transcript length first then sequencing depth, was used as an estimate for the relative expression level (based on proportion of mapped reads) of each transcript in the leaf and root tissues. The same approach was used to identify the list of transcripts present in leaf, root or both tissue types for KEGG pathway mapping by using RSEM estimated count value to determine the presence (TPM > 0) or absence (TPM = 0) of a transcript. RT-qPCR of selected transcripts DNase-treated (Ambion, Huntingdon, UK) RNA (1 μg) was reverse transcribed using iScript™ cDNA Synthesis Kit (Bio-Rad, Hercules, CA, USA) per manufacturer’s protocol. The expression of seven transcripts related to the phenylpropanoid and flavonoid biosynthetic pathways were selected from the transcriptome dataset and specific primer pairs ([82]Table S1) were designed using PrimerBlast software. For RT-qPCR analysis, 1:20 dilution of cDNA was used as template in 20 μL volume and reactions were performed in the iQ™5 Real-Time PCR detection System (Bio-Rad, Hercules, CA, USA) using the iTaq Universal SYBR® Green SuperMix kit (Bio-Rad, Hercules, CA, USA). The amplification was executed with the following cycling program: 3 min at 95 °C, 40 cycles of 10 s at 95 °C, 30 s at 60 °C, and 30 s at 72 °C; and 0.06 s for plate reading at 65 °C followed by a melting curve analysis. Primer efficiencies were determined through standard curves of five cDNA dilution factors in triplicate. Calcium-Dependent Protein Kinase (CDPK) and Polyubiquitin (UBQ) were selected as references to normalize