Abstract

Background

   Polygonum minus is an herbal plant in the Polygonaceae family which is
   rich in ethnomedicinal plants. The chemical composition and
   characteristic pungent fragrance of Polygonum minus have been
   extensively studied due to its culinary and medicinal properties. There
   are only a few transcriptome sequences available for species from this
   important family of medicinal plants. The limited genetic information
   from the public expressed sequences tag (EST) library hinders further
   study on molecular mechanisms underlying secondary metabolite
   production.

Methods

   In this study, we performed a hybrid assembly of 454 and Illumina
   sequencing reads from Polygonum minus root and leaf tissues,
   respectively, to generate a combined transcriptome library as a
   reference.

Results

   A total of 34.37 million filtered and normalized reads were assembled
   into 188,735 transcripts with a total length of 136.67 Mbp. We
   performed a similarity search against all the publicly available genome
   sequences and found similarity matches for 163,200 (86.5%) of Polygonum
   minus transcripts, largely from Arabidopsis thaliana (58.9%).
   Transcript abundance in the leaf and root tissues were estimated and
   validated through RT-qPCR of seven selected transcripts involved in the
   biosynthesis of phenylpropanoids and flavonoids. All the transcripts
   were annotated against KEGG pathways to profile transcripts related to
   the biosynthesis of secondary metabolites.

Discussion

   This comprehensive transcriptome profile will serve as a useful
   sequence resource for molecular genetics and evolutionary research on
   secondary metabolite biosynthesis in Polygonaceae family. Transcriptome
   assembly of Polygonum minus can be accessed at
   [38]http://prims.researchfrontier.org/index.php/dataset/transcriptome.

   Keywords: De novo assembly, Hybrid assembly, Illumina sequencing,
   RNA-seq, Persicaria minor, 454 sequencing

Introduction

   Secondary metabolites are organic compounds that are non-vital but
   indirectly influence plant survival, development and growth. Three
   major groups of plant secondary metabolites identified by chemical
   groups are flavonoids and phenolic compounds, terpenoids, and
   nitrogen/sulfur-containing compounds. Plant secondary metabolites are
   important natural sources for the development of medicines and natural
   products. The myriads of plant secondary metabolites reflect the
   diverse species of plants and their ecological roles, such as
   adaptation to different environments and defense against biotic
   stresses ([39]Moore et al., 2014). Polygonum is a genus in the
   Polygonaceae family with up to 300 species, many of which are important
   as traditional medicinal plants ([40]Narasimhulu, Reddy & Mohamed,
   2014).

   Polygonum minus Huds. (syn. Persicaria minor) is a culinary flavoring
   ingredient common in South East Asia and is also used as a remedy for
   different maladies ranging from indigestion to poor eyesight
   ([41]Christapher et al., 2015; [42]George et al., 2014). The leaves of
   Polygonum minus contain high levels of essential oils (72.54%), mainly
   comprised of aliphatic aldehydes, namely dodecanal (48.18%) and decanal
   (24.36%) ([43]Yaacob, 1990). Less abundant aldehydes include 1-decanol,
   1-dodecanol, undecanal, tetradecanal, 1-undecanol, nonanal, and
   1-nonanol ([44]Baharum et al., 2010). Furthermore, the metabolite
   profiling of Polygonum minus leaf revealed many terpenoids and
   flavonoids with antioxidant activities ([45]Baharum et al., 2010;
   [46]Goh et al., 2016). The abundance of secondary metabolites in
   Polygonum minus has led to the establishment of hairy root system for
   the production of plant secondary metabolites ([47]Ashraf et al.,
   2014). β-caryophyllene was found to be the main sesquiterpenes secreted
   into the hairy root culture media. These studies showed the potential
   of developing Polygonum minus as a resource to produce natural
   products.

   While many secondary metabolites have been identified in Polygonum
   minus, its biosynthetic pathways remain unclear due to the limited
   genomic information that is available for the plant. Previously, a
   total of 3,352 expressed sequence tags (ESTs) were generated from
   standard cDNA libraries of Polygonum minus leaf, root and stem tissues
   ([48]Roslan et al., 2012). This study indicated the abundance of
   flavonoid biosynthesis-related genes in the root tissue. The emergence
   of next generation sequencing has made transcriptomic analysis of plant
   possible with increasing speed and affordability. RNA-sequencing
   (RNA-seq) allows novel gene discovery and identification of transcripts
   of interest in various biological processes. This is especially
   suitable for many non-model organisms with limited genomic information
   ([49]Varshney et al., 2009; [50]Ward, Ponnala & Weber, 2012). In the
   past few years, this platform has been repeatedly utilized to discover
   and identify genes involved in the biosynthesis of secondary
   metabolites. For examples, alkaloid biosynthesis in Uncaria
   rhynchophylla ([51]Guo et al., 2014), ginsenoside biosynthesis in Panax
   ginseng ([52]Jayakodi et al., 2015), glucosinolate biosynthesis in
   Raphanus sativus ([53]Wang et al., 2013), biosynthesis of capsaicinoids
   in Capsicum frutescens ([54]Liu et al., 2013), biosynthesis of
   flavonoids in safflower ([55]Li et al., 2012), and caffeine
   biosynthesis in Camellia sinensis ([56]Shi et al., 2011). To date,
   there are only two plants from the Polygonum genus with RNA-seq data
   deposited to the public SRA database, namely Polygonum cuspidatum
   ([57]Hao et al., 2012) and Polygonum tinctorium ([58]Minami, Sarangi &
   Thul, 2015). The limited genetic information from this important family
   of medicinal plants hinders further study on molecular mechanisms
   underlying the production of bioactive compounds.

   To profile transcripts related to the biosynthesis of secondary
   metabolites in Polygonum minus, RNA-seq was performed on the leaf and
   root tissues. Sequence data generated from 454 and Illumina platforms
   were assembled, both independently and together, for comparison. This
   new dataset was compared to all Polygonum minus EST transcripts
   previously deposited to the NCBI database (dated Sep 2014) for
   validation of the assembly quality. The combined de novo assembly from
   two different sequencing platforms allowed us to overcome limitations
   of each technology. We also performed KEGG pathway annotation to
   identify transcripts related to the biosynthesis of secondary
   metabolites. This study reveals candidate genes involved in the
   biosynthesis of secondary metabolites, especially on the biosynthesis
   of phenylpropanoids and flavonoids in Polygonum minus and serves as an
   invaluable genetic resource for its development as a commercial herbal
   crop.

Materials and Methods

Sample preparation and transcriptome sequencing

   Root and leaf tissues of cultivated Polygonum minus grown in compost
   soil without fertilizer were sampled independently from the
   experimental plot (3°16′14.63″N, 101°41′11.32″E) at Universiti
   Kebangsaan Malaysia. For the leaf tissue, five expanded young leaves
   from the apical parts of the plants were collected and pooled as one
   biological replicate. Samples acquired from 45 day old plants were
   rinsed with distilled water and flash frozen in liquid nitrogen before
   stored at −80 °C. Total RNA was isolated using the Lopez-Gomez method
   with modifications ([59]López-Gómez & Gómez-Lim, 1992) by adding 50%
   PVP-40 due to high polysaccharide and phenolic compounds in Polygonum
   minus. RNA quality and quantity were assessed using gel
   electrophoresis, ND-1000 Nanodrop spectrophotometer (Thermo Scientific)
   and Agilent 2100 Bioanalyzer with a minimum RNA integrity number of 7.

   For the root sample, 250 ng of poly(A) RNA was prepared from 800 ng of
   total RNA using PolyATtract mRNA isolation kit (Promega, Madison, WI,
   USA) and used as a starting material for the Roche GS FLX sequencing at
   Malaysia Genome Institute. The cDNA preparation was done according to
   the cDNA Rapid Library Preparation Method Manual of Roche. The emulsion
   polymerase chain reaction(PCR) condition was performed using long
   fragment Lib-A emPCR amplification condition for amplicons that are 550
   bp or greater. The conditions are as follows: 94 °C for 4 min, 50
   cycles of 94 °C for 30 s and 60 °C for 10 min.

   For the leaf sample, total RNA from two biological replicates were used
   for the Illumina HiSeq™ 2000 sequencing with an average read length of
   90 bp through the standard library (200 bp) preparation and paired-end
   sequencing workflow established at BGI-Shenzhen, China.

Transcriptome de novo assembly

   Raw reads from both 454 pyrosequencing platform and Illumina HiSeq™
   2000 were filtered to remove adapter sequences with sequence
   pre-processing tools, Cutadapt ([60]Martin, 2011) and Trimmomatic
   ([61]Bolger, Lohse & Usadel, 2014), respectively. High quality Illumina
   raw reads with Phred score ≥ 25 were kept for assembly. For root
   transcriptome, iAssembler pipeline ([62]Zheng et al., 2011), which
   includes MIRA ([63]Chevreux et al., 2004) and CAP3 ([64]Huang & Madan,
   1999) assemblers, was executed with the filtered dataset. The analysis
   pipeline includes three consecutive runs of MIRA with default
   parameters followed by CAP3 assembly to obtain the final assembled
   transcripts. Trinity pipeline ([65]Grabherr et al., 2011) was used to
   assemble the leaf transcriptome from two leaf libraries, whereas the
   combined transcriptome was assembled from two leaf libraries and one
   simulated root library. Assembled reads from the root transcriptome
   were clipped to 90 bp pseudo reads with 5 bp overlap using an in-house
   PHP script ([66]http://gitlab.inbiosis.ws/open-source/rnaseq-utils) to
   simulate Illumina sequencing output for accommodating Trinity assembler
   short read requirement. For this assembly, leaf raw read datasets were
   normalized with digital normalization following Khmer 1.0 mRNASeq
   protocol ([67]Brown et al., 2014).

   This project was registered at NCBI’s BioProject with the accession
   number [68]PRJNA208436. All raw read datasets were deposited to NCBI
   SRA database ([69]http://www.ncbi.nlm.nih.gov/sra) with the accession
   number [70]SRX669305 (leaf) and [71]SRX313492 (root) ([72]Loke et al.,
   2016). Assembled transcripts were deposited to NCBI TSA database
   ([73]http://www.ncbi.nlm.nih.gov/genbank/tsa) with the accession number
   [74]GCJZ00000000. The assembled transcripts with annotation can also be
   accessed at
   [75]http://prims.researchfrontier.org/index.php/dataset/transcriptome.

Transcript functional annotation

   An annotation pipeline from Trinotate was performed to annotate
   assembled transcripts ([76]Grabherr et al., 2011). The Trinotate
   annotation pipeline includes several software packages such as BLASTX,
   BLASTP, PFAM search, SignalP, and RNAmmer that are essential in
   transcriptome functional annotation. All analyses were performed in
   parallel using assembled FASTA sequences.

   Functional annotation for all transcripts was performed by running
   BLASTX similarity search against Trinotate Swiss-Prot protein database
   (September 2015) with E-value <1e−5 considered as significant hits. For
   the leaf and combined transcriptomes, Trinotate annotation reports were
   generated using the standard annotation pipeline
   ([77]http://trinotate.github.io). Gene Ontology (GO) and Conserved
   Domain Database (CDD) were used to annotate the transcripts based on
   similarity. Translated peptides were generated using the Transdecoder
   program embedded in the Trinity assembly pipeline for protein-based
   analysis using Eukaryotic Orthologous Group (KOG) classification. All
   results were deposited into Trinotate-provided SQLite database template
   and a spreadsheet summary report was generated from Trinotate using
   BLASTX E-value cutoff of 1e−5.

KEGG pathway mapping and enrichment analysis

   KEGG pathway mapping was performed by associating Enzyme Commission
   (EC) number from BLAST search results with UniProt database
   ([78]http://www.uniprot.org/mapping, September, 2015). Metabolite
   pathway maps with custom color codes were generated using KEGG online
   Mapper API ([79]http://www.genome.jp/kegg/tool/map_pathway2.html,
   updated April 1, 2015), with all associated EC number. Pathway
   enrichment analysis with hypergeometric test was performed by using
   “Annotate” and “Identify” subprograms in KOBAS version 2.0 with
   Benjamini–Hochberg correction ([80]Xie et al., 2011). The leaf and root
   transcriptomes were compared against combined transcriptome as a
   background for enrichment analysis.

Transcript abundance estimation

   To estimate the relative abundance of transcripts in the leaf and root
   transcriptomes, filtered 454 (original unclipped) and Illumina raw
   reads were aligned to the combined transcriptome assembly using RSEM
   ([81]Li & Dewey, 2011). RSEM statistical model is based on
   Expectation–Maximization (EM) algorithm to compute maximum likelihood
   abundance estimates. Transcripts per kilobase million (TPM), which
   normalizes for transcript length first then sequencing depth, was used
   as an estimate for the relative expression level (based on proportion
   of mapped reads) of each transcript in the leaf and root tissues. The
   same approach was used to identify the list of transcripts present in
   leaf, root or both tissue types for KEGG pathway mapping by using RSEM
   estimated count value to determine the presence (TPM > 0) or absence
   (TPM = 0) of a transcript.

RT-qPCR of selected transcripts

   DNase-treated (Ambion, Huntingdon, UK) RNA (1 μg) was reverse
   transcribed using iScript™ cDNA Synthesis Kit (Bio-Rad, Hercules, CA,
   USA) per manufacturer’s protocol. The expression of seven transcripts
   related to the phenylpropanoid and flavonoid biosynthetic pathways were
   selected from the transcriptome dataset and specific primer pairs
   ([82]Table S1) were designed using PrimerBlast software. For RT-qPCR
   analysis, 1:20 dilution of cDNA was used as template in 20 μL volume
   and reactions were performed in the iQ™5 Real-Time PCR detection System
   (Bio-Rad, Hercules, CA, USA) using the iTaq Universal SYBR® Green
   SuperMix kit (Bio-Rad, Hercules, CA, USA). The amplification was
   executed with the following cycling program: 3 min at 95 °C, 40 cycles
   of 10 s at 95 °C, 30 s at 60 °C, and 30 s at 72 °C; and 0.06 s for
   plate reading at 65 °C followed by a melting curve analysis. Primer
   efficiencies were determined through standard curves of five cDNA
   dilution factors in triplicate. Calcium-Dependent Protein Kinase (CDPK)
   and Polyubiquitin (UBQ) were selected as references to normalize