Abstract Background Illumina second generation sequencing is now an efficient route for generating enormous sequence collections that represent expressed genes and quantitate expression level. Taxus is a world-wide endangered gymnosperm genus and forms an important anti-cancer medicinal resource, but the large and complex genomes of Taxus have hindered the development of genomic resources. The research of its tissue-specific transcriptome is absent. There is also no study concerning the association between the plant transcriptome and metabolome with respect to the plant tissue type. Methodology/Principal Findings We performed the de novo assembly of Taxus mairei transcriptome using Illumina paired-end sequencing technology. In a single run, we produced 13,737,528 sequencing reads corresponding to 2.03 Gb total nucleotides. These reads were assembled into 36,493 unique sequences. Based on similarity search with known proteins, 23,515 Unigenes were identified to have the Blast hit with a cut-off E-value above 10^−5. Furthermore, we investigated the transcriptome difference of three Taxus tissues using a tag-based digital gene expression system. We obtained a sequencing depth of over 3.15 million tags per sample and identified a large number of genes associated with tissue specific functions and taxane biosynthetic pathway. The expression of the taxane biosynthetic genes is significantly higher in the root than in the leaf and the stem, while high activity of taxane-producing pathway in the root was also revealed via metabolomic analyses. Moreover, many antisense transcripts and novel transcripts were found; clusters with similar differential expression patterns, enriched GO terms and enriched metabolic pathways with regard to the differentially expressed genes were revealed for the first time. Conclusions/Significance Our data provides the most comprehensive sequence resource available for Taxus study and will help define mechanisms of tissue specific functions and secondary metabolism in non-model plant organisms. Introduction Taxus is a genus of yews, small coniferous trees or shrubs in the gymnosperm family Taxaceae. There are at least 14 species in Taxus [31][1], [32][2], most of which are the sources of biological active compounds such as paclitaxel or Taxol, a chemotherapeutic drug used in the treatment of many types of cancer. The increasing demands have created a supply crisis and raised serious environmental concerns [33][3]. Presently the mainstay solution is semisynthesis from several precursors which have the same core skeleton and can be isolated from renewable yew resources [34][4], [35][5]. Paclitaxel and its precursors belong to a group of typical secondary metabolites named the taxane diterpenoids or taxoids, but the distribution and content are highly varied with species and tissues. For that reason, choosing suitable Taxus species and screening the constituents in each tissue are essential to cost-effective production of taxane drugs. Previous studies mainly focused on Taxus needles from various species origin which displayed distinct chemical distribution [36][5], [37][6]. In recent years, we systematically investigate the root constituents from various Taxus species and found that roots have relatively simple chemical profiles and possess high yields of valuable taxanes such as paclitaxel (P), cephalomannine (C), 10-deacetylpaclitaxel (10-DAT) and 7-xylosyltaxanes. Rational exploitation of the taxanes that the root contains is of great help for alleviating the taxane supply crisis. However, the biosynthesis pathway of paclitaxel and other taxanes is not fully elucidated and the underlying molecular mechanism of the metabolic difference between different Taxus tissues has not been studied, which hamper improvements in taxane drug production. During the past few years, the high demand for low-cost sequencing has driven the development of high-throughput second generation sequencing technologies that parallelize the sequencing process, producing thousands or millions of sequences at once [38][7], [39][8]. Solexa, now part of Illumina developed a sequencing technology based on reversible dye-terminators [40][9]. The inexpensive production of large volumes of sequence data via second generation sequencing is the primary advantage over conventional methods [41][10]. Collins et al. [42][11] used a combination of mapping and de novo assembly tools in the analysis of data from Solexa sequencing of the polyploid plant Pachycladon enysii and demonstrated that transcriptome analysis using high-throughput short-read sequencing need not be restricted to the genomes of model organisms. The de novo transcriptome sequencing and characterization based on Illumina second generation sequencing technology has been performed successfully for sweet potato [43][12], Eucalyptus tree [44][13], chickpea [45][14], and orchid [46][15]. Illumina produces orders of magnitude more sequence at a fraction of the cost of 454 platform. Despite its obvious potential, Illumina second generation sequencing has not been applied to the gymnosperm research. Besides transcriptome sequencing, another approach to gene expression analysis, digital gene expression (DGE) profiling, can be performed on the Illumina Genome Analyzer sequencing platform. DGE tag sequencing is an implementation of the LongSAGE (serial analysis of gene expression) protocol on the Illumina sequencing platform that increases utility while reducing both the cost and time required to generate gene expression profiles. The ultra-high-throughput sequencing capability of the Illumina platform allows the cost-effective generation of libraries containing an average of 20 million tags, a 200-fold improvement over classical LongSAGE [47][16]. Illumina DGE has less sequence composition bias, leading to a better representation of AT-rich tag sequences, and allows a more accurate profiling of a subset of the transcriptome characterized by AT-rich genes expressed at levels below the threshold of detection of LongSAGE [48][17]. Illumina DGE has been used in a sweet orange red-flesh mutant to study the molecular process regulating lycopene, a secondary metabolite, accumulation [49][18]. Lee et al. [50][19] compared the expression profiles of cultured dedifferentiated Taxus cuspidata cells and undifferentiated cambial meristematic cells and identified marker genes and transcriptional programs consistent with a stem cell identity. However, the expression pattern of the genes involved in the taxane biosynthesis was not reported. The reference transcriptome for the DGE tag mapping is obtained with 454 pyrosequencing and thus is relatively small (301 Mb of sequence). The main goal of this study is, first, to create a backbone for the Taxus and gymnosperm research community, including large part of the transcriptome as well as expression patterns in different tissues; second, to look at phenotype and try to look at correlation between gene expression and taxane metabolites. In this study, we generated over two billion bases of high-quality DNA sequence with Illumina technology and demonstrated the suitability of short-read sequencing for de novo assembly and annotation of genes expressed in a gymnosperm with little prior genome information [51][20]. In a single run, we identified 36,493 distinct Unigenes including hundreds of taxane biosynthetic genes. Furthermore, we compared the gene expression profiles of three taxane-producing Taxus tissues using a digital gene expression (DGE) system. The assembled, annotated transcriptome sequences and gene expression profiles provide an invaluable resource for the identification of Taxus genes involved in taxane biosynthesis and tissue specific functions. Results Transcriptome sequencing (mRNA-seq) output, assembly, expression annotation By Illumina high-throughput second generation sequencing, 13,737,528 reads were obtained, with total nucleotides 2,033,154,144 (2.03 Gb). The average read size, Q20 percentage (sequencing error rate <1%), and GC percentage are 148 bp, 96.48%, and 45.54%, respectively. From these short reads 112,769 contigs were assembled, with median length of 384 bp ([52]Fig. 1). From contigs 56,393 scaffolds were made by SOAPdenovo, with median length of 928 bp ([53]Fig. 1). From scaffolds 36,493 Unigenes were obtained with median length of 1,077 bp ([54]Fig. 1, [55]Table S1). There is no complete gymnosperm genome available, therefore we use the four sequence datasets, i.e., 201,405 clone sequences and 27,720 cluster sequences (Unigenes) of Picea glauca from [56]http://www.arborea.ulaval.ca/research/sequencing/gene_catalogue/ind ex.html, 36,906 Unigene sequences of a Korea Taxus cuspidata [57][19], and 20,557 Unigene sequences (12,975 singletons and 7,582 contigs) of a China T. cuspidata [58][21], to add annotation against gymnosperm species ([59]Table S2). With the alignment parameters E value ≤1e-5 and the sequence identity ≥80%, 44.6%, 43.8%, 75.3%, and 49.8% of T. mairei Unigenes were annotated with clone sequences and Unigenes of P. glauca, Unigenes of Korea T. cuspidata, and Unigenes of China T. cuspidata, respectively. The RPKM method is able to eliminate the influence of different gene length and sequencing level on the calculation of gene expression [60][22]. The median RPKM value of all Unigenes is 15.73, while the minimal and maximal values are 0 (nine Unigenes) and 5,464.39 (Unigene 15,622), respectively. Twenty six Unigenes has RPKM values of more than 1,000 ([61]Table S1), many of which may be involved in plant defense. Seventy six Unigenes has RPKM values of less than 0.4, implying the potential of mRNA-seq in detecting genes of extremely low expression. Figure 1. Statistics of Illumina short read assembly quality. [62]Figure 1 [63]Open in a new tab The length distribution of the de novo assembly for contigs, scaffolds and Unigenes is shown. 1, 200; 2, 300; 3, 400; 4, 500; 5, 600; 6, 700; 7, 800; 8, 900; 9, 1,000; 10, 1,100; 11, 1,200; 12, 1,300; 13, 1,400; 14, 1,500; 15, 1,600; 16, 1,700; 17, 1,800; 18, 1,900; 19, 2,000; 20, 2,100; 21, 2,200; 22, 2,300; 23, 2,400; 24, 2,500; 25, 2,600; 26, 2,700; 27, 2,800; 28, 2,900; 29, 3,000; 30, >3,000. Functional annotation Distinct gene sequences were first searched using BLASTx against the NCBI non-redundant (nr) database using a cut-off E-value of 10^−5. Using this approach, 23,515 Unigenes (64.4% of all Unigenes) returned an above cut-off BLAST result ([64]Table S1). Due to lack of genome and EST information in Taxus, 35.6% of Unigenes could not be matched to known genes. Similarly, up to 19,869 Unigenes (54.4% of all) had no Swissprot annotation. Gene ontology (GO) assignments were used to classify the functions of the predicted Taxus mairei genes. Based on sequence homology, 10,263 sequences can be categorized into 54 functional groups ([65]Fig. 2). In each of the three main categories (biological process, cellular component and molecular function) of the GO classification, “metabolic process”, “cell part” and “binding” terms are dominant respectively. We also noticed a high-percentage of genes from categories of “cellular process”, “organelle” and “catalytic” and only a few genes from terms of “nutrient reservoir”, “extracellular region part” and “death” ([66]Fig. 2). To further evaluate the completeness of our transcriptome library and the effectiveness of our annotation process, we searched the annotated sequences for the genes involved in COG classifications. In total, out of 23,515 nr hits, 14,381 sequences have a COG classification ([67]Fig. 3). Among the 25 COG categories, the cluster for “general function prediction” represents the largest group (2,458, 17.1%) followed by “transcription” (1,173, 8.2%) and “posttranslational modification, protein turnover, chaperones” (1,146, 8.0%). The following categories: extracellular structures (2, 0.014%); nuclear structure (3, 0.02%) and cell motility (78, 0.5%), represent the smallest groups ([68]Fig. 3). To identify the biological pathways that are active in Taxus mairei, we mapped the 23,515 annotated sequences to the reference canonical pathways in Kyoto Encyclopedia of Genes and Genomes (KEGG) [69][23]. In total, we assigned 11,550 sequences to 124 KEGG pathways. The pathways with most representation by the unique sequences were “metabolic pathways” (2,867 members); “biosynthesis of plant hormones” (730 members) and “spliceosome” (675 members). In secondary metabolism subclass, most transcripts were mapped to “biosynthesis of phenylpropanoids” (611), “biosynthesis of alkaloids derived from shikimate pathway” (309), “flavonoid biosynthesis” (254), “carotenoid biosynthesis” (156), “diterpenoid biosynthesis” (94), “metabolism of xenobiotics by cytochrome P450” (90), and “terpenoid backbone biosynthesis” (69). These annotations provide a valuable resource for investigating specific processes, functions and pathways and allow for the identification of novel genes involved in the pathways of secondary metabolite synthesis. Figure 2. Histogram presentation of Gene Ontology classification. [70]Figure 2 [71]Open in a new tab The results are summarized in three main categories: biological process, cellular component and molecular function. The right y-axis indicates the number of genes in a category. The left y-axis indicates the percentage of a specific category of genes in that main category. Figure 3. Histogram presentation of clusters of orthologous groups (COG) classification. [72]Figure 3 [73]Open in a new tab Out of 23515 nr hits, 14,381 sequences have a COG classification among the 25 categories. A, RNA processing and modification; B, Chromatin structure and dynamics; C, Energy production and conversion; D, Cell cycle control, cell division, chromosome partitioning; E, Amino acid transport and metabolism; F, Nucleotide transport and metabolism; G, Carbohydrate transport and metabolism; H, Coenzyme transport and metabolism; I, Lipid transport and metabolism; J, Translation, ribosomal structure and biogenesis; K, Transcription; L, Replication, recombination and repair; M, Cell wall/membrane/envelope biogenesis; N, Cell motility; O, Posttranslational modification, protein turnover, chaperones; P, Inorganic ion transport and metabolism; Q, Secondary metabolites biosynthesis, transport and catabolism; R, General function prediction only; S, Function unknown; T, Signal transduction mechanisms; U, Intracellular trafficking, secretion, and vesicular transport; V, Defense mechanisms; W, Extracellular structures; Y, Nuclear structure; Z, Cytoskeleton. Detection of sequences related to the taxane biosynthetic pathway and metabolism In plants, the plastid-localized 2-C-methyl-d-erythritol 4-phosphate (MEP) pathway provides the precursors for the synthesis of terpenes. All genes encoding the seven enzymes in the plastidial MEP pathway [74][24] were identified in the Illumina datasets ([75]Table S3). The sequences encoding 1-deoxy-D-xylulose-5-phosphate synthase (DXS), the first enzyme of the MEP pathway, were substantially abundant with RPKM 910.6. The two enzymes catalyzing the slow steps of the mevalonate-independent pathway, 4-hydroxy-3-methylbut-2-en-1-yl diphosphate synthase (HDS) and 4-hydroxy-3-methylbut-2-enyl diphosphate reductase (HDR), held RPKM 496.3 and 365.7, respectively. The downstream IPP isomerase (IDI) and geranylgeranyl diphosphate synthase (GGPPS) held RPKM 173.2 and 787.7, respectively. The reaction catalyzed by GGPPS transforms the IPP and the isomer DMAPP into the GGPP, which plays a pivotal role in the isoprenoid metabolism and the taxane biosynthetic pathway. The downstream enzyme of DXS, DXP reductoisomerase (DXR; RPKM 168.3), and GGPPS were quite upregulated in methyl jasmonate-induced Taxus cells [76][25]. For the first time, all 13 known genes involved in the paclitaxel biosynthesis were identified in the Illumina datasets. As an enzyme of the early step, taxane 10β-hydroxylase (T10OH), with RPKM 2611.2, was well represented ([77]Table S3). Taxadiene synthase (TS), with RPKM 2,106.5, catalyzes the cyclization of the linear isoprenoid substrate GGPP to form taxa-4(5),11(12)diene. Taxane 2α-O-benzoyltransferase (TBT), involved in the production of 10-deacetylbaccatin III (10-DAB) in the middle step, held RPKM 473.0. Taxoid 14β-hydroxylase (T14OH), which is responsible for diversion of the pathway to 14-hydroxy taxoids that are prominent metabolites of Taxus cell cultures [78][26], held RPKM 623.2 in the present study. Although the biosynthesis of paclitaxel and related taxoids are beginning to be understood, many steps still remain to be addressed. The C1β-hydroxylation and the C4,C20β-epoxidation leading to oxetane ring formation, oxidation of the C9-hydroxyl group to the corresponding carbonyl in the taxane core formation, CoA-dependent ligation to β-phenylalanine, 2′-hydroxylation of N-debenzoyl-2′-deoxytaxol, and N-benzoylation of N-debenzoyltaxol in the side-chain assembly are the remaining steps [79][25], [80][27]. Available evidence suggests that C4, C20β-epoxidation occurs at the level of a taxadien-hexaol and is followed by ring expansion toward oxetane construction [81][28]. These steps could be mediated by cytochrome P450 (CYP). In the side-chain assembly, the catalyzed transformation of 2′-hydroxylation of β-phenylalanoyl baccatin III (N-debenzoyl-2′-deoxytaxol) into phenylisoserinoyl baccatin III (N-debenzoyltaxol) implies the involvement of CYP [82][29]. Meanwhile, paclitaxel biosynthesis competes with side routes produced by numerous acyl/aroyl transferases or the oxidation of the taxane nucleus derived from common precursors. Identification of these side-route genes could have an important implication in eventually increasing paclitaxel yields [83][30]. For this purpose, we screened our Illumina datasets and discovered 211 candidate genes of CYP, seven epoxidases, 34 CoA ligases, 32 N-benzoyltransferases, 102 glycosyltransferases (GTs), and six xylosyltransferases (XYLTs) ([84]Table S1). These sequences provide useful information for elucidating the remaining steps and exploring the side-route enzymes in paclitaxel biosynthetic pathway. Digital gene expression (DGE) library sequencing Three DGE libraries were made from the Taxus root, stem, and leaf, respectively, and sequenced using Illumina technology. We first evaluate the sequencing quality. The total tags of root, stem, and leaf are 3,520,076, 3,527,059, and 3,665,630, respectively ([85]Table 1, [86]Fig. 4). Clean tags are those used in the analysis after filtering the dirty tags. 94.17% and 95.17% of the total tags in the leaf and the stem are clean tags, respectively, which is higher than that in the root (89.59%) ([87]Fig. 4A–C). More tags with copy number <2 are in the root than in the leaf and the stem (9.95% vs. 4.62%, 4.72%). The leaf library has more tags containing N than the root and the stem libraries (1.20% vs. 0.45%, 0.12%). The distinct tags of root, stem, and leaf are 512,507, 263,301, and 285,590, respectively ([88]Fig. 4D–F). The distribution of the distinct tags in the three tissues is significantly different (p<0.0001). 34.66% and 35.65% of the distinct tags in the leaf and the stem are clean tags, respectively, which is higher than that in the root (29.58%). The leaf library has more distinct tags containing N than the root and the stem libraries (5.95% vs. 2.02%, 1.13%). Despite the above differences, results suggest that the sequencing quality is high enough to allow for the following analyses. Table 1. Mapping DGE library sequences to the reference transcriptome database. Root Stem Leaf Raw data Total 3,520,076 3,527,059 3,665,630 Distinct tag 512,507 263,301 285,590 Clean tag Total number 3,153,516 3,356,529 3,452,074 Distinct tag number 151,577 93,874 98,972 All tag mappingto gene Total number 1,400,670 1,420,764 1,575,977 Total % of clean tag 44.42% 42.33% 45.65% Distinct tag number 50,958 40,579 41,072 Distinct tag % of clean tag 33.62% 43.23% 41.50% Unambiguous tagmapping to gene Total number 1,388,905 1,412,459 1,564,537 Total % of clean tag 44.04% 42.08% 45.32% Distinct tag number 50,568 40,313 40,785 Distinct tag % of clean tag 33.36% 42.94% 41.21% All tag-mappedgenes Number 17,477 16,142 16,314 % of ref genes 47.89% 44.23% 44.70% Unambiguous tag-mapped genes Number 17,285 15,966 16,163 % of ref genes 47.37% 43.75% 44.29% [89]Open in a new tab Figure 4. Evaluation of DGE library sequencing quality. [90]Figure 4 [91]Open in a new tab A, distribution of root total tags; B, distribution of leaf total tags; C, distribution of stem total tags; D, distribution of root distinct tags; E, distribution of leaf distinct tags; F, distribution of stem distinct tags. The saturation analysis was performed to check whether the number of detected genes keep increasing when sequencing amount (total tag number) increases ([92]Fig. S1). When sequencing amount of three DGE libraries reaches 2 M or higher, the number of detected genes almost ceases to increase. Heterogeneity and redundancy are two significant characteristics of mRNA expression. Small number categories of mRNA have very high abundance, while the rest majority stays at a very low level of expression. The distribution of clean tag expression can be used to evaluate the normality of the whole DGE data ([93]Fig. 5). The clean tags of the root, the stem, and the leaf are 3,153,516, 3,356,529, and 3,452,074, respectively ([94]Table 1, [95]Fig. 5A–C). The distribution of the clean tags in the three libraries is significantly different (p<0.0001). 70.82% and 70.91% of the clean tags in the leaf and the stem possess more than 100 copies, respectively, which is higher than that in the root (58.36%). 8.47% of the clean tags in the root possess copies between two and five, while only 4.63% and 4.47% of the clean tags in the leaf and the stem have such a low abundance, respectively. There are 151,577, 93,874, and 98,972 distinct clean tags in the root, stem, and leaf, respectively ([96]Fig. 4 and [97]Fig. 5D–F). The distribution of the distinct clean tags in the three tissues is significantly different (p<0.0001). 64.75% of the distinct clean tags in the root have low copy numbers between two and five, which is higher than those in the leaf (57.52%) and the stem (56.89%). DGE method generates absolute rather than relative gene expression measurements and avoids many inherent limitations of microarray analysis. Figure 5. Distribution of total tags and distinct tags over different tag abundance categories. [98]Figure 5 [99]Open in a new tab Distribution of the total clean tags in the root (A), leaf (B), and stem (C). Numbers in the square brackets indicate the range of copy numbers for a specific category of tags. For example, [100][2], [101][5] means all the tags in this category has 2 to 5 copies. Numbers in the parentheses show the total tag copy number for all the tags in that category. Distribution of the distinct clean tags in the root (D), leaf (E), and stem (F). Numbers in the square brackets indicate the range of copy numbers for a specific category of tags. Numbers in the parentheses show the total types of tags in that category. Mapping sequences to the reference transcriptome database To reveal the molecular events behind DGE profiles, we mapped the tag sequences of the three DGE libraries to our transcriptome reference database that contains 36,493 Unigene sequences. There are 117,847 unambiguous reference tags, i.e., 99.54% of all reference tags. 81.54% (29,757) of Unigenes possess CATG restriction site of NlaIII enzyme used in the tag library construction. Among the 93,874 (stem) to 151,577 (root) distinct tags generated from the Illumina sequencing of the three libraries, 40,313 (stem) to 50,568 (root) distinct tags were mapped to a gene in the reference database ([102]Table 1, [103]Fig. S2). Tags mapped to a unique sequence are the most critical subset of the DGE libraries as they can explicitly identify a transcript. Up to 50.7% (18,512) of the sequences in our transcriptome reference database could be unequivocally identified by unique tag ([104]Table S4). Next, the level of gene expression was determined by calculating the number of unambiguous tags for each gene and then normalizing this to the number of transcripts per million tags (TPM) ([105]Fig. S3). The dynamic range of DGE spanned five orders of magnitude, i.e., from zero to 36,696 tag counts. However, the tag counts for the majority of genes were low in the three libraries. The median tag counts are only 18, 14, and 14 in the root, stem, and leaf, respectively ([106]Table S4), suggesting the ability of DGE in detecting rare transcripts. [107]Fig. S3 shows that the mRNA transcribed from the major kinds of genes is represented in fewer than 315 tags (about TPM 100) and only a small proportion of genes are highly expressed. The frequency distribution of gene expression in the leaf is more similar to that in the stem than that in the root. Differentially expressed genes (DEGs), antisense transcripts and novel transcripts To identify genes showing a significant expression difference in different Taxus tissues, the differentially expressed tags between two samples were identified by an algorithm modified from Audic and Claverie [108][31]. A total of 15,660 significantly changed tag entities were detected between the root and leaf libraries. Those tags were mapped to 6,740 genes with 1,854 genes up-regulated (higher expression in the leaf) and 2,830 genes down-regulated ([109]Fig. 6, [110]Table S5). Between the root and stem libraries, a total of 3,584 DEGs were detected with less up-regulated genes (1,339) and more down-regulated genes (2,245) ([111]Table S6). Between the stem and leaf libraries, 1,332 DEGs were detected with 726 up-regulated genes and 606 down-regulated genes ([112]Table S7). This suggests that there are more DEGs between root and leaf and between root and stem than those between stem and leaf. Next we analyzed the expression of antisense transcripts in the three tissues. Sense-antisense plays an important role on gene expression regulation. Sequencing tags mapped to the complementary strand of the sense gene suggest that its antisense strand also has transcripts, and this gene may use the sense-antisense regulation. The antisense transcripts were detected in 8,954, 8,597, and 9,563 Unigenes of the root, stem, and leaf libraries, respectively ([113]Table 2). 21.3%–22.6% of these genes are orphan sequences - no homologues found in the NCBI database. 25.6%–26.6% of Unigenes with TPM(+)/TPM(−) ≥1 are orphan sequences, which might be consistent with the expectation that Taxus mairei expresses genes with no homologues in other species. 3.99%–5.49% of Unigenes with TPM(+)/TPM(-) <1 are orphan sequences, which might be functional in the post-transcriptional gene regulation. Figure 6. Changes in gene expression profile among the different Taxus tissues. [114]Figure 6 [115]Open in a new tab The number of up-regulated and down-regulated genes between root and leaf, root and stem, and stem and leaf are summarized. For example, “1,854” means that the expression of 1,854 genes is statistically significantly higher in the leaf than in the root; “2,830” means that the expression of 2,830 genes is statistically significantly lower in the leaf than in the root. Table 2. Antisense regulation of genes in the three Taxus tissues. Root Stem Leaf No. of Unigenes with antisense(−) transcript 8,954 8,597 9,563 TPM(−) min 0.63 0.6 0.58 max 1,908.35 3,867.09 5,143.86 median 1.59 1.49 1.45 Percentage of sequenceswithout Blast hit 22.6% (2029) 22.1% (1907) 21.3% (2042) TPM(+)/TPM(−) ratio min 0 0 0 max 1,158.30 990.60 1,241.13 median 7.23 7.34 6.35 No. of Unigenes with TPM(+)/TPM(−)<1 1,310 1,719 2,230 Percentage of sequences without Blast hit 5.49% (72) 4.36% (75) 3.99% (89) No. of Unigenes with TPM(+)/TPM(−)≥1 7,644 6,878 7,333 Percentage of sequences without Blast hit 25.6% (1957) 26.6% (1832) 26.6% (1953) [116]Open in a new tab TPM(+), Normalized expression level of sense genes; TPM(−), Normalized expression level of antisense genes. It is unknown whether the expression profiles of paclitaxel biosynthetic genes differ in the three Taxus tissues, therefore we analyzed the DEGs involved in this secondary metabolism pathway. Comparing the root and the leaf, we found the expression of 23 Unigenes was significantly higher in the root than in the leaf (FDR ≤0.001 and ∣log[2]Ratio∣≥1, [117]Tables 3 and [118]S8), while only six Unigenes had significantly lower expression in the root than in the leaf. Similarly, the expression of 29 Unigenes was much higher in the root than in the stem, while only one Unigene had much lower expression in the root than in the stem. Sixteen Unigenes were expressed less abundantly in both the leaf and the stem than in the root, including Unigene 35278 (GGPPS), Unigenes 11020 and 28660 (taxadiene 5α hydroxylase, T5OH), Unigene 32910 (taxadien-5α-ol O-acetyltransferase, TAT), Unigenes 13330, 1373, 8115, and 26700 (T10OH), Unigene 14478 (taxoid 2α hydroxylase, T2OH), Unigene 17688 (taxoid 7β hydroxylase, T7OH), Unigene 2582 (TBT), Unigene 11534 (phenylpropanoyl transferase, BAPT), Unigenes 15942, 24051, and 5777 (3′-N-debenzoyl-2′-deoxytaxol N-benzoyltransferase, DBTNBT), and Unigene 21200 (phenylalanine aminomutase, PAM). In contrast, the expression of 61 Unigenes was comparable between the stem and the leaf, while only eight Unigenes had much lower expression in the stem than in the leaf. These results imply that the expression level of genes involved in the paclitaxel biosynthetic pathway is higher in the root than in the leaf and the stem, which might result in more enzyme products and higher enzyme activities in the root. Since the antisense regulation is a ubiquitous mechanism, we also analyzed whether there is potential antisense regulation involved in the paclitaxel biosynthetic pathway. Forty two, 25, and 37 Unigenes in the root, stem, and leaf, respectively, were found to have antisense expression, ranging from TPM 0.58 to TPM 10.15 ([119]Table S9). Nine, eight, and 14 Unigenes in the root, stem, and leaf, respectively, were found to have sense/antisense expression ratio of not more than one. For example, Unigene 35249, representing the DXS of the MEP pathway, had antisense expression but no sense expression in all three tissues; Unigene 14905, the T10OH of the paclitaxel biosynthetic pathway, had antisense expression but no sense expression in the leaf and the stem. The role of antisense expression in the regulation of gene expression is not well known especially not outside model species. It should be noted that what is detected as antisense transcripts could belong to the transcript belonging to another gene transcribed on the other strand as well as misplaced short reads. Table 3. Number of DEGs involved in the taxane biosynthetic pathway. Root vs. leaf Root vs. stem Stem vs. leaf TPM root>leaf TPM rootstem TPM rootleaf TPM stem98%) including 10-DAB, baccatin III, and paclitaxel were purchased from Sigma. Other taxane standards (purity >96%) used in this study including DAXT, 10-DAT, and cephalomannine were purchased from Shanghai Jinhe Biotechnology Co. Ltd. Plant samples were collected from several yew plantations in China ([176]Tables 5 and 6). All plant samples were collected from trees whose age was known. Yew needles were separated from stem and yew hair roots (i. d. ≤2 mm) and washed with water to remove soil before drying. Yew materials were air-dried (≤50°C), ground to fine particles and sieved through a 40-mesh screen. The extraction and sample preparation were performed as previously described [177][4], [178][5]. Instruments and analytical conditions All assays were performed on a Shimadzu Prominence UFLC system equipped with a CBM-20A communications bus module, an SIL-20ACHT autosampler, two LC-20AD pumps, a DGU-20A3 vacuum degasser, a CTO-20AC column oven, and an SPD-M 20A diode array detector. A Shim-pack XR-ODS (100 mm×2.0 mm, 2.2 µm; Shimadzu) analytical column with an ODS guard column (5 mm×2.0 mm, 2.2 µm; Shimadzu) was used and kept at 50°C. The mobile phase consisted of water (A) and CH[3]CN (B). The following gradient condition was used: 0–11 min, 69–52.5% A; 11–12 min, 52.5–5% A; 12–15 min, 5% A; and 15–18 min, 69% A. The flow rate was set at 0.4 ml/min, and the injection volume was 1 µl. DAD detection was achieved in the range of 190–370 nm, and the wavelength was set at 230 nm for quantitative analysis. Mass detection and validation of the quantitative analysis were performed as previously described [179][4], [180][5]. Supporting Information Table S1 Annotation of 36,493 Unigenes assembled from 13,737,528 Illumina sequence reads. (XLS) [181]Click here for additional data file.^ (9.6MB, xls) Table S2 Taxus mairei Unigene sequence alignment with Picea glauca , Korea T. cuspidata , and China T. cuspidate. All clone sequences (Closeq) and cluster sequences (Cluseq) are retrieved from [182]http://www.arborea.ulaval.ca/research/sequencing/gene_catalogue/in dex.html. Clone sequences are all in the direct orientation (5′->3′) and correspond to either ESTs, or assemblies of ESTs when clones produced several ESTs. There is a single clone sequence per clone, which contains its completion status in the FASTA header line. For each cluster, a representative clone is selected (usually the longest and most complete cDNA) to represent a gene. Thus, cluster sequences are nothing else than the clone sequence of their representative clones. QueryX, start position of Taxus mairei Unigene sequence that can be aligned with Picea glauca sequence (alignment parameters E value ≤1e-5 and the sequence identity ≥80%); QueryY, end position of T. mairei Unigene sequence; SbjctX, start position of P. glauca sequence that is aligned with T. mairei Unigene sequence; SbjctY, end position of P. glauca sequence; 5′ read (F), Unigene of T. mairei that can be aligned with 5′ CDS sequence of P. glauca in the forward orientation; 5′ read (R), Unigene of T. mairei that can be aligned with 5′ CDS sequence of P. glauca in the reverse orientation; 3′ read (F), Unigene of T. mairei that can be aligned with 3′ CDS sequence of P. glauca in the forward orientation; 3′ read (R), Unigene of T. mairei that can be aligned with 3′ CDS sequence of P. glauca in the reverse orientation; is CDS (F), Unigene of T. mairei that can be aligned with full length CDS sequence of P. glauca in the forward orientation; is CDS (R), Unigene of T. mairei that can be aligned with full length CDS sequence of P. glauca in the reverse orientation. Unigene of T. mairei that can be aligned with the internal (not 5′, 3′, or full length) CDS sequence of P. glauca is not annotated with the above six terms. (XLS) [183]Click here for additional data file.^ (10.6MB, xls) Table S3 Genes related to the taxane biosynthetic pathway and metabolism. (DOC) [184]Click here for additional data file.^ (55KB, doc) Table S4 Number of unambiguous clean tags for each gene and normalized TPM (number of transcripts per million clean tags) values in the three DGE libraries. (XLS) [185]Click here for additional data file.^ (2.7MB, xls) Table S5 Differentially expressed genes between root and leaf. TPM: transcript copies per million tags. Raw intensity: the total number of tags sequenced for each gene. FDR: false discovery rate. We used FDR ≤0.001 and the absolute value of log[2]Ratio ≥1 as the threshold to judge the significance of gene expression difference. In order to calculate the log[2]Ratio and FDR, we used TPM value of 0.001 instead of 0 for genes that do not express in one sample. (XLS) [186]Click here for additional data file.^ (1.5MB, xls) Table S6 Differentially expressed genes between root and stem. (XLS) [187]Click here for additional data file.^ (1.1MB, xls) Table S7 Differentially expressed genes between stem and leaf. (XLS) [188]Click here for additional data file.^ (426.5KB, xls) Table S8 DEGs involved in the taxane biosynthetic pathway. (XLS) [189]Click here for additional data file.^ (64KB, xls) Table S9 Antisense regulation of genes involved in the paclitaxel biosynthetic pathway. (DOC) [190]Click here for additional data file.^ (39KB, doc) Table S10 Novel transcripts detected by mapping DGE tags to the Taxus fosmid end sequences (FES) and the Vitis genome. (DOC) [191]Click here for additional data file.^ (49KB, doc) Table S11 Clusters containing the taxane biosynthetic genes detected in the clustering analyses of DEGs. (XLS) [192]Click here for additional data file.^ (46.5KB, xls) Table S12 Pathway enrichment analysis for DEGs (root vs. leaf). (DOC) [193]Click here for additional data file.^ (49KB, doc) Table S13 Quantitative differences in average concentration of six valuable taxanes in the needles of Taxus mairei (different age). (DOC) [194]Click here for additional data file.^ (38KB, doc) Table S14 Quantitative differences in average concentration of six valuable taxanes in the roots of Taxus mairei (different age). (DOC) [195]Click here for additional data file.^ (38.5KB, doc) Figure S1 The saturation analysis of the DGE library sequencing. A, root; B, leaf. The results revealed that with the increase of total sequence number (sequencing depth), the number of new distinct tag decreased markedly. (TIF) [196]Click here for additional data file.^ (231.8KB, tif) Figure S2 Statistics of DGE clean tag alignment. PM(Sense), perfect match to gene (sense); 1 tag->1 gene, match to one gene; 1 tag->n gene, match to more than one gene; 1 MM(Sense), match to gene (sense) with 1 bp mismatch; PM(AntiSense), perfect match to anti-sense gene; 1 MM(AntiSense), match to anti-sense gene with 1 bp mismatch; PM Genome 1 tag->1 position, perfect match to the Vitis genome sequence with one best hit; PM Genome 1 tag->n position, perfect match to the Vitis genome sequence with multiple best hits; 1 MM Genome, match to the Vitis genome sequence with 1 bp mismatch; Unkown Tag, not match to gene (sense and anti-Sense) and the Vitis genome sequence. (TIF) [197]Click here for additional data file.^ (1.3MB, tif) Figure S3 The level of gene expression for each gene. Gene expression level was determined by calculating the number of unambiguous tags for each gene and then normalizing to TPM (transcript copies per million tags). A, root; B, stem; C, leaf. (TIF) [198]Click here for additional data file.^ (222.2KB, tif) Figure S4 Prediction of protein coding sequence (CDS) from the assembled Unigenes. Unigenes are firstly aligned by Blastx (E value <0.00001) to protein databases in the priority order of nr, Swiss-Prot, KEGG and COG. Unigenes aligned to databases with higher priority will not enter the next circle. The alignments end when all circles are finished. Proteins with highest ranks in Blast results are taken to decide the coding region sequences of Unigenes, then the coding region sequences are translated into amino acid sequences with the standard codon table. Thus both the nucleotide sequences (5′-3′) and amino acid sequences of the Unigene coding region are acquired. Unigenes that cannot be aligned to any database are scanned by ESTScan ([199]http://www.ch.embnet.org/software/ESTScan.html) to get the nucleotide sequence (5′-3′) and amino acid sequence of the coding regions. A, length distribution of CDS predicted from Blast results and by ESTScan. 1, 200; 2, 300; 3, 400; 4, 500; 5, 600; 6, 700; 7, 800; 8, 900; 9, 1,000; 10, 1,100; 11, 1,200; 12, 1,300; 13, 1,400; 14, 1,500; 15, 1,600; 16, 1,700; 17, 1,800; 18, 1,900; 19, 2,000; 20, 2,100; 21, 2,200; 22, 2,300; 23, 2,400; 24, 2,500; 25, 2,600; 26, 2,700; 27, 2,800; 28, 2,900; 29, 3,000; 30, >3,000. B, gap (N) distribution of CDS predicted from Blast results and by ESTScan. 1, 0; 2, 0.01; 3, 0.02; 4, 0.03; 5, 0.04; 6, 0.05; 7, 0.06; 8, 0.07; 9, 0.08; 10, 0.09; 11, 0.1; 12, 0.11; 13, 0.12; 14, 0.13; 15, 0.14; 16, 0.15; 17, 0.16; 18, 0.17; 19, 0.18; 20, 0.19; 21, 0.2; 22, 0.21; 23, 0.22; 24, 0.23; 25, 0.24; 26, 0.25; 27, 0.26; 28, 0.27; 29, 0.28; 30, 0.29; 31, 0.3; 32, >0.3. (TIF) [200]Click here for additional data file.^ (295.8KB, tif) Figure S5 Pipeline of the transcriptome (mRNA-seq) bioinformatic analysis. (TIF) [201]Click here for additional data file.^ (110.8KB, tif) Figure S6 Pipeline of the DGE bioinformatic analysis. (TIF) [202]Click here for additional data file.^ (141.8KB, tif) Acknowledgments