Abstract Gene discovery in the Malaysian giant freshwater prawn (Macrobrachium rosenbergii) has been limited to small scale data collection, despite great interest in various research fields related to the commercial significance of this species. Next generation sequencing technologies that have been developed recently and enabled whole transcriptome sequencing (RNA-seq), have allowed generation of large scale functional genomics data sets in a shorter time than was previously possible. Using this technology, transcriptome sequencing of three tissue types: hepatopancreas, gill and muscle, has been undertaken to generate functional genomics data for M. rosenbergii at a massive scale. De novo assembly of 75-bp paired end Ilumina reads has generated 102,230 unigenes. Sequence homology search and in silico prediction have identified known and novel protein coding candidate genes (∼24%), non-coding RNA, and repetitive elements in the transcriptome. Potential markers consisting of simple sequence repeats associated with known protein coding genes have been successfully identified. Using KEGG pathway enrichment, differentially expressed genes in different tissues were systematically represented. The functions of gill and hepatopancreas in the context of neuroactive regulation, metabolism, reproduction, environmental stress and disease responses are described and support relevant experimental studies conducted previously in M. rosenbergii and other crustaceans. This large scale gene discovery represents the most extensive transcriptome data for freshwater prawn. Comparison with model organisms has paved the path to address the possible conserved biological entities shared between vertebrates and crustaceans. The functional genomics resources generated from this study provide the basis for constructing hypotheses for future molecular research in the freshwater shrimp. Background The Malaysian giant freshwater prawns, scientifically known as Macrobrachium rosenbergii (de Man, 1879), is the largest Palaemonid shrimp of one of the most diverse genera for freshwater Crustacea [39][1], [40][2]. It is commonly found along coastal river systems: it lives in the freshwater as an adult but the larvae require a brackish environment for survival and development [41][3], [42][4]. This discovery had led to the development of mass rearing techniques for commercial production [43][5]. Since then, this species has been widely cultured both in and outside its native range and for human consumption. Annual revenue generated from its commercial production has been estimated at approximately one billon USD per annum (FAO, 2002). Its commercial significance has thus spawned interest on research addressing various aspects that are directly or indirectly affecting commercial production: fisheries, nutrition, reproduction and bacterial and viral disease as well as environmental stress response. Despite great interest in this organism, gene discovery in M. rosenbergii has been conducted at a relatively small scale compared to another freshwater prawns of the same genus, and also compared with other economically important shrimp including the oriental river prawn (M. nipponense), the marine Penaeid shrimps (Litopenaeus vannamei, Penaeus monodon, Marsupenaeus japonicus and Fenneropenaeus chinensis) and other marine Malacostracas (Petrolisthes cinctipes, Homarus americanus). As compared to the currently available 11 expressed sequence tags (ESTs) of M. rosenbergii in the public EST database (accessed on June 2, 2010), the EST sequencing of the oriental river prawn, M. nipponense, ovary cDNA library has already generated 3,256 ESTs, thus representing the largest addition to the freshwater shrimp EST data [44][6], far exceeding EST information available for M. rosenbergii. Gene discovery through EST sequencing has also been conducted on more than nine tissue types and various physiological conditions for marine Penaeid shrimps to identify immune-related, reproduction-related, sex-related and differentially expressed genes between normal and diseased individuals [45][7]. In addition, the third largest EST sequencing project for crustaceans is currently represented by transcriptome characterization of the porcelain crab, which generated approximately 100,000 high quality ESTs [46][8], providing an even greater amount of sequence information for genome level comparative study among the Malacostracas in the crustacean taxa. Next generation sequencing (NGS) technologies are revolutionizing genomic research with their ability to provide enormous amounts of sequence data with a greater breadth and depth of information in shorter times and at a significantly lower cost than the traditional Sanger sequencing technology. Recently, NGS technology has been used to sequence and characterize the whole transcriptome of organisms in various organs and cell lines [47][9], [48][10], [49][11]. RNA-seq's ability to provide both qualitative and quantitative information of gene expression at higher sensitivity and accuracy has given this technology a primary advantage over precedent transcriptomic methods such as EST sequencing, serial analysis of gene expression (SAGE), massively parallel signature sequencing (MPSS) and microarrays. Furthermore, RNA-seq does not require any in-depth understanding of the genome where the transcriptome is derived from, which makes it the method of choice for rapid characterization of the entire transcriptome of non-model organisms [50][11]. Taking advantage of the RNA-seq technology, the transcriptome of three main organs in M. rosenbergii were sequenced: gills, hepatopancreas and muscle to (i) generate large amounts of EST sequence data for gene discovery, (ii) mine simple sequence repeat markers that represent a resource for trait mapping and (iii) generate differential gene expression profiles of these main organs to better understand their functions in the giant freshwater prawn. The data obtained from this study can be further utilized to develop specific DNA/RNA markers that can be used in future molecular studies to address a wide range of specific research questions. Materials and Methods Total RNA extraction Adult prawns were obtained from a hatchery at Jelebu, Negeri Sembilan, Malaysia. Gill and hepatopancreas tissues were collected from a pool of individuals that are full sibs from a breeding program which are stocked in the same pond reducing environment effectswhile muscle tissues were obtained from two individuals (one normal and one infected individual) which had been screened for infectious hypodermal and hematopoietic necrosis virus (IHHNV) infection using 389F/389R OIE primers sets [51][12]. The tissues were immediately frozen in liquid nitrogen after dissection before storage at −80^ oC until RNA extraction. Total RNA (∼20 µg) was extracted from the tissues using Trizol reagent (Invitrogen, Carlsband, CA, USA) following the manufacturers' protocols before dissolving it in diethylpyrocarbonate (DEPC) treated water to give a final concentration of >750ng/µL. The dissolved total RNA was stored at −80^ oC prior to RNA sequencing. Illumina sequencing The cDNA libraries preparation and sequencing were conducted using Illumina sequencing technology (Beijing Genome Institute, Shenzhen, China). Poly A+ mRNA was isolated from the total RNA using Sera-mag Magnetic Oligo (dT) Beads (Illumina). The extracted RNA was hydrolysed to smaller fragments (∼150–200 bp) to avoid priming bias. Double stranded cDNA was synthesized from the fractionated RNA pool using reverse transcriptase, RNase H (Invitrogen) and DNA polymerase I (New England BioLabs) primed with random hexamers. Four sequencing cDNA libraries of gills, hepatopancrease, muscle (Infected and non- infected) were prepared according to the manufacturer's protocols (Illumina) before being subjected to 75 bp paired end sequencing using an Illumina Genome Analyzer platform. De novo assembly and assessment After filtering out any contamination and low quality reads, the clean reads were clipped into 25-bp kmers, assembled into contigs and joined into scaffolds based on the reads' paired-end information using the SOAPdenovo software [52][13]. To avoid redundancy, the contigs and scaffolds from each library were subjected to clustering using TGI Clustering tool [53][14] to generate library-specific unigenes. One set of standard unigene sequences was generated by clustering library-specific unigenes from all four libraries. For assembly quality assessment, the resultant unigenes were BLASTed [54][15] against M. rosenbergii assembled EST and mRNA sequences obtained from the NCBI EST and nucleotide databases. Protein coding gene identification and classification The unigenes sequences were annotated by using homology search (BLASTX) [55][15] with E-value cutoff of 10^−5 against NCBI non-redundant (Nr) database, Swissprot, Cluster of Orthologous Groups database (COG) [56][16], [57][17] and Kyoto Encyclopedia of Genes and Genome (KEGG) database [58][18]. Gene Ontology [59][19] assignment was conducted using BLAST2GO software [60][20] with default parameters. The coding sequence and the gene direction of the annotated unigenes were determined based on the BLAST results from four databases aforementioned. For those sequences that had no annotation, the coding sequence and direction were predicted using ESTscan [61][21] using BLAST predicted coding sequence data as the training set. Non-coding RNA identification and characterization To identify the non-coding RNA in the unigenes, the remaining unigenes that had no BLAST protein annotation and no coding sequence predicted using both EMBOSS getORF and ESTscan were subjected to BLAST (E-value < = 0.01) sequence homology search against the Rfam 10.0 database [62][22], [63][23]. RNA structure and sequence similarity search (INFERNAL program) using covariance model (CMsearch) [64][24] was conducted on the significant hits from the BLAST search results. Repetitive elements and simple sequence repeat (SSR) markers mining Repetitive elements in the whole transcriptome were identified using RepeatMasker 3.2.9 [65][25] with the species option set as Arthropoda. Simple sequence repeats (SSR) or microsatellites identified were further characterized for their distribution across the transcriptome. SSR loci selection for primer design was carried out using the Window based iQDD software [66][26]. Microsatellite was set as minimum five repetitions of 2–6 bp motifs. ‘All against all’ BLAST (E-value cutoff: 10^−40) [67][15] was performed on sequences with soft-masked microsatellites. Only the unique sequences which had no BLAST hit to other unigenes and no short repeats in the flanking regions were selected for the PCR primer design (default parameters, minimum PCR product size  = 150). Thirty five primer sets amplifying SSR loci associated with known protein coding genes were selected for further testing. They were tested for successful SSR amplification and size verification of expected PCR products in 16 individuals. Each PCR amplification was performed in a total volume of 10 µl of PCR mixture consisting of 1.2 µl of distilled water, 3.0 µl of 5X Green GoTaq PCR Buffer (Promega), 1.5 µl of MgCl[2] (25 mM), 0.25 µl of each dATP, dGTP, dCTP and dTTP, 0.5µl of each forward primer and reverse primer (10 µM), 2.0 µl (20ng) of DNA from each sample and 0.3 µl Promega GoTaq® Polymerase (5 U/µl). The Bio-Rad C1000 Thermal Cycling profile was as follows: 3 min of initial denaturation at 96°C, 39 cycles of denaturation at 94°C for 10 s, annealing for 45 sec (shown in [68]Table 1 for primers with polymorphic loci only) and extension at 72°C for 30 sec. The final extension at 72°C for 7 min was performed before it was maintained at 4°C. PCR products were run on 1.0% agarose gels stained with ethidium bromide to test for successful amplification and on 4.0% Metaphor® agarose gel for the first polymorphism screening. Subsequent polymorphism genotyping using ABI 3130 genetic analyzer (ABI, Inc.) will be performed on four populations of cultured prawns, and the results described in a separate publication. Table 1. Primers and potential polymorphic status for 23 successfully amplified SSR. LocusID GeneID Motif Primer_sequence F: forward; R: reverse Tm PCR ProductSize Potential Polymorphic MR5 Unigene7776_All (TC)[6] F: TTCCCCAATGCTTCTTCATC R: ACGCACCTCCTTGTATCCAC 55.0 150–177 YES MR7 Unigene4546_All (CT)[6] F: GTTTCCGATGCCAAAGACAT R: ATATCGACGGCCTCTTGGTA 63.3 150–170 NULL MR8 Unigene9496_All (TC)[6] F: ACTTCTTGGCTTCAAGGGCT R: TCCAGTCAAAAGAATTCGCA 55.0 150–200 YES MR13 Unigene26276_All (TC)[7] F: TGGACATCTTTGCATAGCCA R: CACATCGGGGTTATTTTGGT 61.4 150–160 YES MR14 Unigene12899_All (CT)[8] F: CTCTGCTTCGTAAAATCGCC R: GAACACTTTTGGCATGGGAG 61.4 150–165 YES MR18 Unigene54607_All (TC)[10] F: GTCCTCACAGCTGGTTCGAT R: ATTTAACCCCTCGCCATTCT 65.0 250–286 YES MR20 Unigene9630_All (TG)[13] F: GAACCGCATACATTTTCCCC R: GAGAAATTTGGACATTGTCGG 63.3 150–150 NULL MR23 Unigene51169_All (TC)[23] F: CAAAGTGAGATTCATTACGGAGG R: GCCTTCATTTGGCATTGAAA 55.0 150–152 NO MR24 Unigene2731_All (AGC)[5] F: ACTCCACCAAACATTGAGCC R: CCCCTCAAGTGGTCAGTCAT 61.4 150–180 NULL MR31 Unigene16282_All (TTG)[5] F: CGCCCAAGATCTGATCCAC R: ATCTCAACAGTAACATGGACTCAAAC 64.5 150–150 NO MR32 Unigene10567_All (ACC)[5] F: AATCGATCATCACCAGCCTC R: TTGTTCCAACAGAACCCTCC 64.5 150–184 YES MR39 Unigene7685_All (ATT)[6] F: GGTGGACTGAGACTGGCTTC R: TGACAATGCAGATTCCCAAA 64.5 250–294 NO MR40 Unigene18380_All (TCC)[6] F: TCAGAGATGTATTCCCCACAGA R: TCCCCTGATCTTTAAATCCTCC 64.5 150–150 NO MR41 Unigene11347_All (TCA)[7] F: TCTCGTGTGACATAGGCAGC R: TTGGAAGCAGAGAACAAGATTTC 63.3 150–199 YES MR47 Unigene9912_All (GGA)[7] F: AAGTGGAGGTGGAACAGGTG R: CTGAGACGGTCTTCTCCCTG 60.0 300–302 YES MR51 Unigene826_All (CTT)[7] F: AGCTGTACACCTCTGGCTCG R: CTACGAAACGCATGGTTGG 63.3 150–150 YES MR52 Unigene9209_All (TCC)[7] F: CCACTCCATTCACTCCCACT R: CTACACCACCACAGACACCG 63.3 200–228 YES MR53 Unigene33994_All (CAT)[7] F: TGGATGTATAACTACGACGACAGC R: CCGACTCTTCTTCGTCTTCC 61.4 150–159 YES MR55 Unigene35088_All (CCT)[6] F: GTCCGAGTGGCCTAGGGT R: TTGGAATCCAGCTCTGAAGG 65.0 150–163 YES MR56 Unigene9209_All (CCT)[9] F: GAGACAAGCCGTGAAGGAAG R: AGTGAATGGAGTGGGTGGAG 61.4 150–183 YES MR57 Unigene33873_All (AGG)[13] F: CGCTGTGCTGTACATGACCT R: TGGTGTTAGGAACAATGTCG 65.0 150–150 NULL MR58 Unigene51169_All (TCT)[24] F: AGTCTCCTAAGACCCCGGAA R: TATCGTCGCCATCACTAGCA 65.0 300–301 YES MR59 Unigene1832_All (ATTT)[5] F: TTTCATGGAAATTTGGCACA R: TGGTCTGAGAAGTGTTTTATTTCA 65.0 200–204 NULL [69]Open in a new tab Expression analysis and normalization Clean reads from each library were mapped to the standard unigene set using SOAP [70][27] with a maximum of two mismatches allowed. For paired end reads mapping, an insert size of range less than ∼60 bp was allowed, depending on the size of the fragments during the RNA fragmentation for each sample. The expression level based on the number of reads mapped to the standard unigenes was normalized by the number of reads per kilobase per million reads (RPKM) [71][9]. graphic file with name pone.0060839.e001.jpg (1) Where C is the number of mappable reads that fell onto the unigene, N is the total number of mappable reads in the experiment and L is the total length of the unigene in base pair. Differential gene expression analysis and functional enrichment The significance of differentially expressed genes (DGE) was evaluated by using following statistical tests: 1. Statistical algorithm developed by Audic and Claverie (1997) [72][28] graphic file with name pone.0060839.e002.jpg (2) Where N1 are total clean reads from sample 1, N2 are total clean reads from sample 2, x are number of reads from sample 1 mapped to unigene and y are number of reads from sample 2 mapped to unigene. 2. False discovery rate analysis for multiple testing (FDR; cutoff value: 0.001) [73][29] 3. The absolute value of the unigene RPKM ratio in two samples (cutoff: 1): graphic file with name pone.0060839.e003.jpg (3) GO functional and KEGG pathway enrichment of the DGE were performed and statistically tested with a hypergeometric test: graphic file with name pone.0060839.e004.jpg (4) Whereby N is the number of all unigenes with GO annotation, n is the number of DGE in N, M is number of all unigenes that are annotated to a certain GO term and m is the number of DGE in M. P-values of GO functional enrichment were re-evaluated using Bonferroni correction while FDR was calculated for KEGG pathway enrichment. The significant cutoff value for the corrected p-value and FDR was set as 0.05 for both enrichments. Results and Discussion Sequencing, assembly, and clustering output In total, 86 million clean 75 bp-paired-end reads from four M. rosenbergii tissue libraries, generated a total of 12.9 gigabases (Gb) of sequences. High quality sequencing data was reflected by the average Q[20] of 92.50% and a percentage of unknown nucleotide (N percentage) of 0.05% for the four libraries. De novo assembly of each library's clean raw reads generated a total of 595,195 contigs with size range of 75–500 bp. (Gills: 183,382; Hepatopancreas: 258,050; Normal Muscle: 84,530; Infected Muscle: 68,783). The contigs were assembled into 367,251 scaffolds of mean size 323 bp and average N[50] of 425 bp. Clustering the scaffolds of each library resulted in a total of 177,275 library-specific unigenes. Clustering the library-specific unigenes from all four libraries resulted in 102,230 standard unigenes, with mean size of 566 bp and N[50] of 746 bp. The length distribution of the unigenes obtained is illustrated in [74]Figure 1. The reads data can be obtained from the NCBI Short Read Archive (SRA) under the accession number: SRA 045433.1, while the assembled sequence can be accessed in the NCBI Transcriptome Shotgun Assembly (TSA) Sequence archive under the accession number: TSA [75]JP351514-[76]JP355722. Assembly assessment of the unigenes had been shown to match M. rosenbergii publicly available EST and full mRNA sequence with E-value <10^−25 (data not shown). Multiple unigenes matched to the same or different region of one gene can be attributed mainly to occurrence of alternative splicing, isoforms and incomplete coverage of the gene by one unigene. Figure 1. Size distribution of the assembled 102,230 unigenes. [77]Figure 1 [78]Open in a new tab The abundance of unigenes assembled from four libraries: gill, hepatopancreas, healthy muscle and infected muscle based on nucleotide length (nt) in the size range of 100–500, 500–1,000, 1,000–1,500, 1500–2,000 and more than 2,000 nt. Identification of known and novel protein coding genes BLASTX search of the unigenes against the NCBI nr, Swissprot, KEGG and COG databases returned 24,476 (23.94%), 20,261 (19.81%), 16,462 (16.10%) and 8,837 (8.64%) strong matches respectively, giving a final total of 24,739 unigenes (24.19%) annotated (Table S1 in [79]File S1). The high number of non-annotated genes (∼75%) could be attributed to the lack of well annotated protein-coding genes nor any well characterized genome of M. rosenbergii or other Crustaceans in the public databases, as well as the large number of sequences (∼70%) with short lengths of 100–500 bp ([80]Figure 1). From the BLAST annotated unigenes, only 24,500 unigenes had coding sequence longer than 300 bp. The resultant coding sequence data can be used as the training set for the prediction of novel protein coding genes. Based on the ESTScan result on the unigenes with no matches from the BLAST search, 7,595 unigenes (7.43%) were predicted to contain coding sequence longer than 300 bp and thus classified as novel protein coding genes (Table S1 in [81]File S2). All together, 32,334 unigenes (31.63%) were classified as protein coding genes in the transcriptome. The remaining unigenes were expected to consist of either messenger like non-coding RNA or fragments of untranslated region of protein coding genes. Classification of known protein coding genes Based on the protein annotation results from the Nr database homology search, 7,533 unigenes (Biological process: 5,231 unigenes; Cellular component: 4,689 unigenes; Molecular function: 6198 unigenes) could be assigned to 46 GO groups where the distribution of unigenes across all groups is presented in [82]Figure 2 (Table S1 in [83]File S1). Highly represented biological process ontology were biological regulation, cellular, metabolic, multicellular organismal and developmental processes, where each ontology was represented by more than 1,500 unigenes. GO assigned unigenes were mainly compartmentalized in the cell, organelle and macromolecular complex components. From the molecular function ontology analysis, a significantly large number of unigenes belonged to binding (4,490 unigenes) and catalytic activity (3,094 unigenes) categories. Together, the GO assignments for all three main categories were designated as representation of the general gene expression profile for the three tissues: hepatopancreas, gills and muscle of M. rosenbergii for the subsequent expression analysis. Figure 2. Gene Ontology classification of the 7,533 protein annotated unigenes. [84]Figure 2 [85]Open in a new tab Unigene sequences were systematically classified into GO sub-categories under the Biological Process, Cellular Component and Molecular Function Gene Ontology Catalogue system. Each bar represents the relative abundance of unigenes classified under each sub-category. As shown in [86]Figure 3, COG classification based on the BLAST search against the COG databases resulted in 8,837 unigenes categorized into 26 categories with the highest number of unigenes falling under general function prediction (3,947, 44.6%), followed by genetic information processing categories (transcription: 1,593, 17.9%; translation, ribosomal structure and biogenesis: 1,583, 17.7%; replication, recombination and repair: 1,569, 17.8%) and carbohydrate transport and metabolism (1,440, 16.3%). The functional categories ‘RNA processing and modification’, ‘extracellular’ and ‘nuclear’ structures had the least representation (<1% for each category) in the whole transcriptome data. Figure 3. Histogram presentation of Clusters of Orthologus Groups classification of 8,837 known protein annotated unigenes. [87]Figure 3 [88]Open in a new tab Each bar represents the number of unigenes classified into each of the 26 COG functional categories. Systematic classification of the unigenes based on KEGG biological pathway showed 16,403 unigenes being mapped to 218 KEGG pathways. At the highest categorical level, the transcriptome data covered 127 metabolic pathways, 87 human diseases pathways, 40 organismal system pathways, 19 environmental processing pathways, 17 genetic information processing pathways, and 13 cellular process pathways. The distribution of unigenes across the second categorical level of KEGG pathways is as shown in [89]Figure 4. At the most specific categorical level, 2,342 unigenes (14.3%) were mapped to the global metabolic pathways and as high as 776 (4.73%) and 717 (4.37%) unigenes were mapped to spliceosome' pathway under ‘transcription’ category and ‘regulation of actin cytoskeleton' pathway under ‘cell motility’ category respectively. In addition, ‘pathways in cancer’, ‘focal adhesion’, ‘amoebiasis’ and ‘endocytosis’ are among the highest represented pathways (>3% of unigenes for each pathway) in the transcriptome data. The least represented pathways with less than 10 unigenes categorized in each pathways were ‘asthma’, ‘phenylalanine, tyrosine and tryptophan biosynthesis’, ‘D-glutamine and D-glutamate metabolism’, ‘biotin metabolism’, ‘vitamin B6 metabolism’, ‘lipoic acid metabolism’, ‘butirosin and neomycin biosynthesis’ and ‘polyketide sugar unit biosynthesis'. Figure 4. KEGG biological pathway classification histograms for 16, 403 protein annotated unigenes. [90]Figure 4 [91]Open in a new tab Each bar represents the number of unigenes that are systematically categorized into sub-classes under Metabolism, Genetic Information Processing, Cellular Processes, Organismal Systems and Human Diseases. Massive amounts of transciptome data have allowed for the revival of systems biology that aims to elucidate the function of a cell or more complex biological organization by conceptualizing interacting genes as systems or functional groups. Emergence of databases such as COG, GO and KEGG that aim to identify correlated genes for specific physiological processes and allow comparison and description of functional features using common terminologies have facilitated the extraction of biologically meaningful information from high throughput functional genomics data. However, the comprehensiveness of the databases poses a limitation or bias to the representation of overall functional information derived from each database. From the transcriptome data in this study, differences were observed in the distribution of molecular data representing a specific biological entity in the GO, COG and KEGG classifications ([92]Figure 2–[93]4) which can be attributed to differences in the coverage of protein coding genes by each classification system (GO: ∼23%; COG: ∼27%; KEGG: ∼50%). These differences may return different outcomes for the downstream analysis (e.g. functional and pathway enrichment) which utilize the information from these functional classifications. Identification of non-coding RNA Although the total RNA was enriched for poly (A) mRNA prior to sequencing and sequences less than 100 bp removed prior to downstream analysis, small sized non-coding RNA had been observed in the pool of unigenes that had neither protein annotation nor potential protein coding sequence. Twenty four unigenes had matched small non-coding RNA and were classified into 6 categories: 5S rRNA (15), 5.8S rRNA (1), RNaseP (3), tRNA (3), U1 (1) and miRNA mir598 (1) (Table S2 in [94]File S2). The unigene predicted to be miRNA mir598 however, did not include the sequence of mature miRNA mir598 and thus has been reclassified as unknown messenger like non-coding RNA. The remaining unigenes that were not classified as any of the non-coding RNA aforementioned will be further filtered and used to mine precursors for novel miRNA genes and described in a separate paper. Repetitive element identification and SSR characterization Repetitive elements Out of 57,891,795 bp of non-ambiguous sequence, 756,405 bp (1.31%) of the transcriptome were identified to contain repetitive elements. Simple repeats were the most abundant repetitive element followed by retroelements and DNA transposons. Specific characterization of each repetitive element is summarized in [95]Table 2. Table 2. The distribution of repetitive elements in M. rosenbergii transcriptome. Number of elements Length occupied (bp) Percentage of sequence (%) Retroelements 728 76,814 0.13 Penelope 3 165 0.00 LINEs 359 32, 845 0.06 LTR elements 369 43, 969 0.08 DNA transposon 124 13,313 0.02 Unclassified 11 554 0.00 Total interspersed repeats 90,681 0.16 Satellites 6 675 0.00 Simple repeats 8,332 31,8339 0.55 [96]Open in a new tab EST-microsatellite (EST-SSR): mining and validation It was observed that 7,159 unigenes (2,373 BLAST annotated protein coding genes; 203 predicted protein coding genes; 3 predicted small ncRNA genes; 4,580 others) contained one or more simple sequence repeats (SSR). The repeat motifs in the overall transcriptome can be categorized as follows: 6 dinucleotide repeats, 20 trinucleotide repeats, 47 tetranucleotide-repeats, 80 pentanucleotide repeats and 28 hexanucleotide repeats. The most prevalent SSR observed in the transcriptome were trinucleotides (39.8%) and dinucleotides (33.4%) while the pentanucleotides (2.60%) were the least common. EST associated microsatellites, also known as EST-SSR or genic microsatellites are most often associated with the coding genes that are conserved across species over large evolutionary time scales. This allows the study of association of these SSR with the function of the genes and the identification of conserved regions in the genomes of different species and genera by using EST-SSR as the anchor marker [97][30]. Genomic SSR has been identified and characterized in M. rosembergii [98][31], [99][32], [100][33], [101][34], [102][35]. Up to this date, however, no EST-SSR has been isolated in this organism. Thus, this study represents the first attempt to isolate EST-SSR in M. rosenbergii. Out of 8,332 SSR loci identified, 1,730 primer sets were designed to amplify SSR loci located in unique unigene sequences. 736 of them were derived from known protein coding genes (Table S3 in [103]File S2). The repeat motifs included in this pool of unigenes were dinucleotides (379 loci), trinucleotides (347 loci), tetranucleotides (7 loci) and hexanucleotides (3 loci). Out of 35 loci selected from 736 loci identified above, PCR validation showed that 23 SSR loci could be amplified successfully, while 12 loci failed to amplify or amplified with larger than expected PCR product size. Failed amplification can be attributed to insertion of an intron in the primer binding region, while larger PCR product size can be caused by insertion of one or more introns in the region close to or within the microsatellite loci. Out of 23 successfully amplified loci, polymorphic screening showed 14 potential polymorphic loci, 4 monomorphic loci and 5 exhibiting null alleles possibly caused by mutation at the primer binding site. Details of the experimental results for successfully amplified loci were shown in [104]Table 1. Differential gene expression profiles of different tissues The overall differential gene expression profiles of the three tissues (G: gill, H: hepatopancreas; M: normal muscle) were shown in [105]Figure 5 and the number of significant differentially regulated genes in respective tissues was indicated in [106]Table 3 (Table S4 in [107]File S2). The differential gene expression analysis between IHHNV infected and non-infected muscle tissues will be discussed in detail in a separate paper. Figure 5. Digital gene expression in two tissues with the following labels: G – Gill, H – Hepatopancreas, M – Healthy Muscle. [108]Figure 5 [109]Open in a new tab Each point represents a unigene. The x and y-axis are the log[10] of the normalized expression level (RPKM) of unigene in the indicated tissue. Red and green points indicate significant change at the absolute value of log[2] (RPKM ratio in two tissues) ≥1 and FDR  = 0.001. Red points indicate up-regulated unigenes and green points indicate down-regulated unigenes in the tissues in which its expression level is represented by the y-axis. Blue points indicate insignificant differentially expressed unigenes. (a) Expression level of unigenes in G versus H, (b) Expression level of unigenes in G versus M and (c) Expression level of unigenes in H versus M. Table 3. The number of significant differentially regulated unigenes. Tissue1(t1) Tissue2(t2) Differentially regulated unigenes up-regulated in t1 up-regulated in t2 Hepatopancreas Gill 50,757 33,157 17,600 Gill Muscle 38,511 33,297 5,214 Hepatopancreas Muscle 53,753 48,362 5,391 [110]Open in a new tab Hepatopancreas had the most number of up-regulated genes followed by gill and lastly muscle. Hepatopancreas up-regulated genes are almost double of up-regulated genes in gill and nine times the number of up-regulated genes in muscle, while gill had approximately six times the number of up-regulated genes in comparison with muscle. This observation is parallel to the status of hepatopancreas as the most active organ in crustaceans and its multi-functional role analogous to vertebrate's liver and insect's fat body. GO functional enrichment analysis The results of significant GO enrichment when comparing differentially expressed genes in three tissues are shown in [111]Table 4. Cellular component GO enrichment analysis revealed significant enriched gene expressed in different cellular compartments in all three sets of differential expression analysis ([112]Table 4). GO enrichment in the biological process category for gill and muscle showed significant enrichment in categories related to development. Table 4. The list of Gene Ontology (GO) terms enriched in three tissues and the Bonferroni corrected p-value associated to each enrichment. Tissue 1 Tissue 2 Enriched GO Categories Category Corrected p-value Gill Hepatopancreas Cellular component myosin II complex 4.22E-02 Gill Muscle Biological process Developmental process 1.62E-06 Anatomical structure development 6.69E-05 Multicellular organismal development 1.00E-04 Multicellular organismal process 1.06E-03 Cellular process 1.70E-03 Cellular component organization 4.81E-03 System development 8.37E-03 Vessicle mediated transport 1.87E-02 Cellular component Cytoplasm 4.10E-03 Actin skeleton 5.74E-03 Cytoplasmic part 2.00E-02 Anchoring junction 4.52E-02 Hepatopancreas Muscle Cellular component Intracellular 1.26E-05 Intracellular organelle 1.00E-04 Organelle 1.20E-04 Intracellular part 2.00E-04 Cell 3.20E-04 Cell part 3.20E-04 Intracellular membrane-bounded organelle 1.09E-03 Membrane-bounded organelle 1.59E-03 [113]Open in a new tab Based on the result, the functional specialization of gill, hepatopancreas and muscle tissues was not captured effectively by the GO annotation system. Since only ∼23% of the protein coding genes in the transcriptome had been annotated using GO system, it is very likely that the shrimp specific protein coding genes are underrepresented. These results were expected since crustacean species have relatively less functional molecular information compared to mammals and insects. However, with the increasing number of high throughput gene expression studies conducted in many crustacean species, the utility of the GO annotation system in functional genomics of non-model animals will definitely improve in future as increasing amounts of crustacean transcriptome data are incorporated into the system. KEGG pathway enrichment analysis Out of 16,403 unigenes mapped to the KEGG pathways, 9,603 (hepatopancreas and gill), 11,385 (hepatopancreas and muscle) and 9,370 (gill and muscle) unigenes were differentially expressed in two respective tissues. Significance tests showed that 40 pathways were highly enriched when comparing hepatopancreas to gills ([114]File S3), 33 pathways when comparing muscle to gill, and 22 pathways when comparing muscle to hepatopancreas (Table S5 in [115]File S2). Reconciling pathway enrichment data with functions With almost ∼50% (16,403 unigenes) of the protein coding genes represented by the KEGG annotation system, the pathway enrichment was able to capture the functional specialization of the prawns' gill and hepatopancreas tissues compared to GO functional enrichment. This observation can be validated based on known functional roles obtained through extensive mechanistic studies covering various biological aspects in prawns as well as other closely related crustaceans. The functional diversities of the gills and hepatopancreas were much higher in comparison to the muscle. This is comparable to the proposed multi-functionalities (metabolism, immune and environmental stress response) of hepatopancreas and gill in crustaceans [116][37], [117][38]. The number of genes up-regulated in the muscles in comparison to the other two tissues was very small so much so that most of the significantly enriched pathways showed down-regulation of majority of the genes in the muscles. Hepatopancreas and gill on the other hand might share similar function, complement or interact with each other to perform metabolic process in the organism, as active regulations of genes by both tissues can be seen in most of the significantly enriched pathways listed in [118]File S3. Thus, this present study focused on the organ functions of gill and hepatopancreas based on the results obtained from differential gene expression and pathway enrichment analysis of the tissues from these two organs. Metabolism: nutrition Knowledge on the nutritional requirements of the omnivorous M. rosenbergii are essential to determine appropriate feed formulation for optimal survival and growth of the organism that lives in the clear water systems where natural food is not available. M. rosenbergii in general requires high carbohydrate and protein for energy and growth, and smaller portions of lipids, vitamins and minerals for proper physiological processes in the organism. Hepatopancreas is the major organ for metabolism in crustaceans. It plays vital role in the synthesis of digestive enzymes, and in secretion, digestion, nutrient absorption, excretion, reserve (lipid and glycogen) storage and mobilization [119][36], [120][37]. These metabolic activities are strongly influenced by physiological demands (molting, reproduction, digestion process, disease) and environmental change (hypoxia) [121][39]. In the context of nutritional metabolism in hepatopancreas and gill, pathway enrichment showed significant regulation of amino acids, carbohydrates, lipid, glycan, ‘vitamins and cofactors', and ‘terpenoids and polyketides’ metabolism in both tissues ([122]File S3). The metabolism profiles presented in this study have addressed some uncertainties in terms of metabolic capability, nutritional benefits and diversifying roles of digestive enzymes in prawns, as raised by various nutritional studies in M. rosenbergii as well as in other crustaceans. Freshwater prawns require the same ten essential amino acids, EAA (Arginine, Histidine, Isoleucine, Leucine, Lysine, Methionine, Phenylalanine, Threonine, Trytophan and Valine) as other crustaceans and fish species for growth. Evaluation index based on A/E indices (A: proportion of individual EAA; E: proportion of total EAA) has been calculated to assess the amino acid requirement in M. rosenbergii. From the ten amino acids, Arginine, Leucine, Lysine and Phenylalanine constitute the highest composition in amino acid profile of M. rosenbergii, Penaeus monodon and Marsupenaeus japonicus [123][40]. In this present study, amino acids Phenylalanine, Tyrosine and Histidine metabolisms were significantly enriched in gill and hepatopancreas. Even though Tyrosine cannot be synthesized endogenously by freshwater prawns, it can be synthesized from Phenylalanine when Phenylalanine is sufficiently supplied [124][41]. It was observed that Phenylalanine conversion to Tyrosine is up-regulated in the gill. Tyrosine is mainly metabolized for the synthesis of acetoacetate in gill; thyroxine and L-DOPA derivatives (e.g. dopamine, noradrenaline) in both hepatopancreas and gill. In gill, Histidine is mainly metabolized into histamine which is an important neuromodulator or neurotransmitter in crustaceans. Prawns utilize carbohydrate more efficiently compared to protein and lipid for energy especially during starvation [125][42]. Similar to other marine shrimps and fish, freshwater prawns utilize complex polysaccharides such as starch and dextrin more effectively compared to mono- and di- saccharides [126][43], [127][44]. In the present study, starch, dextrin, sucrose, maltose and galactose were highly metabolized to monosaccharide/monosaccharide monophosphate for active energy production and glucuronoside/glucuronate metabolites for excretion in gill and hepatopancreas. Active carbohydrate metabolism observed suggests that both tissues, especially hepatopancreas, play an active role in energy supply in M. rosenbergii. Active regulation of glucosamine (an amino sugar and intermediates of glucose and chitin) metabolism and N-acetylglucosamine/chitin interconversion observed in the two tissues could be attributed to (1) the exogenous chitin digestion which has been shown to exhibit nutritional benefit in shrimps [128][45] and/or (2) molt cycle which involves chitin degradation of the old procuticle layer and chitin biosynthesis to form new procuticle layer using the building blocks obtained from the degraded endogenous chitin and/or the diet. The chitinase gene exists in multiple isoforms in marine shrimps such as Fenneropenaeus chinensis, Litopenaeus vannamei and Penaeus monodon. In P. monodon, the chitinase isoforms PmChi-1 and PmChi-3 were shown to express constitutively in the hepatopancreas tissues and thus they are suggested to be involved in chitin exogenous digestion. However, PmChi-1 also exhibits a significantly higher expression during premolt in P. monodon [129][46], [130][47]. In addition, PmChi-2, similar to Pjchi-2, a molt-related chitinase isoform in M. japonicus, is highly expressed in gill especially during intermolt and down-regulated during post-molt. This fluctuating expression pattern during the molt cycle suggests PmChi-2′s direct role in molting [131][47], [132][48]. Thus, the repertoire of differentially regulated chitinase isoforms in gill and hepatopancreas are expected to be involved in chitin digestion for nutritional purpose, molting or other diversified roles of chitinase in M. rosenbergii. Polyunsaturated fatty acids (PUFA) are suggested to facilitate lipid storage and energy metabolism in the hepatopancreas in M. rosenbergii [133][49]. Linoleic acid (18:2 n-6 PUFA, LNA) has been shown to improve fecundity in female prawns and ammonia stress tolerance in the larvae [134][50]. In the present study, LNA was extracted from lecithin, a major constituent contributing phospholipids to the freshwater prawn diet, and metabolised into epoxyoctadecenoic acid by cytochrome p450. The freshwater prawn's metabolic ability to convert LNA to arachidonic acid (20:4 n-6 PUFA, ARA) similar to marine shrimps has been proposed in the nutritional study [135][51]. On the other hand, another study had shown that ARA in freshwater prawn Macrobrachium borelli was not synthesized endogenously [136][52]. The present study showed absence of enzymatic reaction to convert LNA to ARA although ARA (obtained from lecithin) metabolism was present in both tissues. Thus, it is more likely that M. rosenbergii ARA body composition is a product of exogenous ARA rather than through endogenous synthesis from shorter chain PUFA. Carotenoid, the precursor for retinoids in crustaceans, has been established to play roles in pigmentation, as the source of dietary provitamin A, as an antioxidant, as well asbeing involved in sexual maturation, early development and growth. Retinoids, which can be converted to various retinoic acid isomers, is suggested to be involved in embryonic development and cell differentiation [137][53]. In M. rosenbergii, carotenoid supplementation has been documented to result in the reduction of the incidence of Appendage Deformity Syndrome (ADS) physically characterized by poor growth, deformed rostrum and appendages, irregular carapace, cut antennae and low survival [138][54]. All-trans retinoate treated M. rosenbergii embryos and larvae show primordial germ cells increase in the embryo and earlier gonad development in the larvae [139][55]. The present study showed that β-carotene was the main source of various isomers of retinol, retinal and retinoate metabolites in the tissues. Up-regulation of enzymes in producing all-trans-retinol derivatives was observed exclusively in hepatopancreas while enzymes for retinoic acid synthesis were up-regulated in the gill. Enzymatic system for synthesis of all-trans-retinoate derivatives were actively regulated in both tissues. These observations show that freshwater prawns might have the enzymatic mechanism to metabolize β-carotene into retinoid metabolites. Although the specific function of retinoid metabolites which are differentially synthesized in gill and hepatopancreas is not clear, the physiological function in general might involve activation of hormonal nuclear receptors (Retinoid X receptors, Retinoic Acid Receptor) [140][53]. Steroid hormone biosynthesis and “progesterone-mediated” oocyte maturation Vertebrate-like steroid hormone synthesis enzymes have been detected in the ovary and hepatopancreas of Marsupenaeus japonicus [141][56] as well as in M. rosenbergii [142][57] Since the hepatopancreas in crustaceans is the equivalent of the liver in vertebrates, it was suggested that this tissue might play a role in catabolism of steroid hormones similar to the vertebrate liver [143][56]. Using the vertebrate steroid hormone biosynthesis pathway as the reference, the present study showed the possible presence of an enzymatic system involved in steroid hormone biosynthesis for progesterone (3β-hydroxysteroid hydrogenase, 3β-HSD) and androgen (17β-HSD) both of which were mainly up-regulated in hepatopancreas tissue. However, there was no possible presence of endogenous 17α-hydroxylase (conversion of progesterone to 17α-hydroxyprogesterone), C[17]–C[20] lyase (conversion of 17α-hydroxyprogesterone to androgens) and aromatase activity (conversion of androgen to estrogen) in both gill and hepatopancreas tissues. These results were largely in agreement with the study conducted by Swevers et al., (1999) [144][58] where substantial enzyme activities of HSD but not aromatase were observed in other crustaceans [145][58]. Summavielle et al., (2003) [146][56] in their study suggested that progesterone conversion in the shrimp hepatopancreas might be activated by a different bioconversion pathway [147][56]. Effects of the vertebrate-like steroid hormones on reproductive process such as oocyte maturation in crustaceans still remain unresolved. In M. rosenbergii, hepatopancreas plays a vital role in reproduction, as it has been reported to be the principal extra-ovarian synthesis site for vitellogenin (Vg), which is the yolk precursor of an important nutrient for developing embryos – vitellin [148][59], [149][60], [150][61]. The extraovarian Vg, which has been cleaved into two subunits in the hepatopancreas, is transported through the haemolymph to the developing oocyte before it is sequestered and processed into vitellin by the ovary to induce oocyte maturation [151][62]. In M. rosenbergii, two previous studies have shown fluctuations of 17β-estradiol and progesterone during reproductive molt cycle but 17β-estradiol remained basal and progesterone was absent during non-reproductive molt cycles respectively [152][63], [153][64]. In the present study, KEGG's progesterone-mediated oocyte maturation pathways were significantly enriched in the differential gene expression analysis between hepatopancreas and two other tissues. In the pathway enrichment, most genes downstream of the translation of maternal mRNAs were up-regulated in the hepatopancreas, indicating a major role of this tissue in oocyte maturation. In vertebrates, the regulation of oocyte maturation is mediated by innumerable intracellular signals. cAMP maintains meiotic arrest while translation of the MOS gene, which is regulated by Hsp90, activates the MAPK cascade, cyclin-dependent kinase 1 and the positive feedback loop to trigger maturation [154][65]. Several factors, namely cytoplasmic polyadenylation element binding protein (CPEB), Hsp90 and suppression of cAMP/PKA signaling were shown to initiate maturation. High Hsp90 relative gene expression in hepatopancreas was consistent with the observation achieved by Wu and Chu (2008) [155][66] in which the Hsp90 expression in hepatopanceas was found to be higher than in gill especially during ovarian maturation. In addition, estrogen level and heat shock protein 90 (Hsp90) gene expressions in the ovary and hepatopancreas strongly correlates during ovarian development in Metapenaeus ensis [156][66]. These observations suggest a similarity, to a certain extent, in the mechanism of vitellogenesis regulation between the vertebrates and crustaceans. However, since the presence of steroid hormone receptors in crustaceans is still in dispute, the vertebrate-like steroid hormone mediated oocyte maturation in M. rosenbergii remains as an open question. Infectious disease from Vibrio and innate immune system Gills in crustaceans are vital organs that play important roles in respiration, regulation of osmotic and ionic balance, detoxification, excretion and immune defense against microbial pathogens which is demonstrated by a marked increase of haemocytes in the gill during infection [157][67], [158][68]. A previous study also showed that the gene expression in haemocytes and gills are quite similar, suggesting that the circulating haemocytes are infiltrating the shrimp gill tissue [159][69]. Aside from hepatopancreas and lymphoid organs, gills were also shown to be the main accumulation site for intact and/or degraded products of bacteria such as Vibrio. It was proposed that gills trap the bacterial-haemocyte nodules when they reach a certain size before phagocytosis occurs [160][68]. Larger accumulation of Vibrio in gills was observed in gills under the oxygen stress (hypoxia) possibly because hypoxia can cause the suppression of the immune defense mechanism of the haemocytes [161][70]. In contrast, a study conducted in crab showed that immune response in gill would decrease the efficiency of gill in gas exchange that may lead to suffocation and death in the organism [162][71]. Pathways enriched for the differential study involving gills were mainly categorized in infectious disease, signaling transduction, as well as in the immune system (Table S5 in [163]File S2). This can be attributed to the gills' direct contact with biotic and abiotic factors in the external environment that requires constant detection (or recognition) and signaling transduction to mediate the response of the epithelial cells lining the gills against these external factors. Biotic factors or opportunistic pathogens and parasites (e.g. Gram negative bacteria within the genera Vibrio, filamentous bacteria, protozoan and fungi) are commonly found in the freshwater culture system in the hatcheries. With the high exposure to these opportunistic pathogens, Crustaceans in general respond by releasing pattern recognition proteins, antimicrobial peptides, clotting proteins, and prophenoloxidase as part of its immune defense against pathogens and parasites. Selected genes such as antimicrobial peptide, propenoloxidase nad recognition proteins were analyzed functionally using real-time PCR expression. (Arockriaj et al., 2011 and 2012). Xenobiotics metabolism, oxidative stress and ‘replication and repair Gills and hepatopancreas have been the target organs for the study of environmental stress, that has mainly focused on heavy metals and xenobiotics toxicity in aquatic crustaceans, because gill is in direct contact with and accumulates pollutants in the water in the mitochondria and because hepatopancreas is the main organ for accumulation and detoxification of these compound in the lysosomes [164][71], [165][72], [166][73], [167][74], [168][75]. The R cells in hepatopancreas in particular have the ability to perform biotransformation using enzymes such as cytochrome p450, to sequester and detoxify xenobiotics [169][76]. Heavy metal and xenobiotic toxicity in aquatic organisms is characterized by mitochondrial dysfunction, oxidative stress, and DNA repair impairment which are also the hallmark of neurodegenerative diseases such as amyotrophic lateral sclerosis, Parkinson and Huntington disease in the vertebrates [170][77], [171][78]. Oxidative stress induced by either high exposure to reactive oxygen species (ROS) or decreased antioxidant defense by antioxidant enzymes such as metallothioneins, sodium dismutase and glutathione-S-transferase [172][79] can cause lipid peroxidation, cell injuries or death, disruption of calcium homeostasis, and genotoxicity (damage to the genetic materials) [173][80], [174][81], [175][82]. Since genotoxicants can cause irreversible effects, namely transcriptional errors, mutagenesis and cell death, the exposed organisms need efficient mechanisms to repair the DNA molecules to reverse the damage [176][83], [177][84]. The responses of gill and hepatopancreas tissues to environmental toxicity were very well represented in the biological pathway enrichment in our study where pathways involving ‘xenobiotics biodegradation and metabolism’, ‘neurodegenerative diseases’, ‘transport and catabolism’, as well as ‘replication and repair’ were significantly enriched in the differential gene study involving these two tissues. High gene expression of hepatopancreas in the context of ‘replication and repair’ is consistent with the experimental observation of the effects of cadmium to the repair systems in gills and hepatopancreas in marine crabs (Charybdis japonica), where significant drops in the ratio of intact double stranded DNA to total DNA in gill were observed in comparison to insignificant changes in the hepatopancreas [178][82]. Both the experimental data and the molecular study here showed that hepatopancreas had higher capability to perform DNA repair to cope with the damage caused by environmental genotoxicants. Even though it was not designed to specifically address toxicological aspects of M. rosenbergii, this study has demonstrated the possible utility of large scale gene expression studies using RNA-seq to address research questions by taking advantage of the shared features between vertebrates and crustaceans. Although several transcriptome profiling studies using DNA microarray technology for ecotoxicogenomics have been attempted in Daphia magna [179][85], [180][86], the novel discoveries were limited to the availability of the sequence data of the organism of interest. Therefore the results, especially the molecular data, from this study provide the opportunity to pursue further research in M. rosenbergii ecotoxicogenomics [181][87]–[182][92]. Conclusions The present study represents a step forward in identifying a number of possible conserved genes that are likely to be involved in various important biological activities by providing an extensive catalogue of 102,323 expressed genes from freshwater shrimp. Analyzing the prawn transcriptome data through homology search and modeling based methods GO and KEGG pathway has led to the recapitulation of various experimental studies done on prawns or crustaceans and revelation of possible shared function with model organisms that have been well studied. With the rapid advance of experimental annotation of functional genomic data for organisms with available genome data, the utility of this method is still expanding. However, careful selection of the model based representations of the transcriptome utilizing current knowledge from model vertebrates is needed. Thus validation of any novel discovery experimentally need to be done to avoid misinterpretation of the functional genomics data. This is because of the extensive divergence of some genes and biological pathways between the model vertebrates and invertebrates since their most recent common ancestor. The functional genomics resource generated from this study provides the basis for generating hypotheses to guide future molecular research in freshwater shrimp as well as for the assembly and annotation of the Macrobrachium rosenbergii genome. Supporting Information File S1 Table S1, NCBI Nr, Swissprot, KEGG, COG and GO annotation. The annotation from the BLAST results of unigenes with significant hits (E-value ≤ 10^−5) to known protein coding genes in Nr, Swissprot, KEGG and COG databases are provided. Unigenes with positive BLAST results were further classified according to its associated GO terms. Table S2, Raw and normalized expression value of unigenes in three libraries (Gill, Hepatopancreas and Healthy Muscle). Raw expression value of each unigene represented by number of clean reads from each library successfully mapped to the corresponding unigene. Normalized expression value was represented by the RPKM value. The length of each unigene was also included. (XLS) [183]Click here for additional data file.^ (19MB, xls) File S2 Table S1, Novel protein coding genes predicted by ESTScan. The coding sequence start site, end site, length and strands are provided. Table S2, Non-coding RNA annotation. Non-coding RNA annotation of the unigenes is performed using INFERNAL cmsearch. The matched start and end site, E-value, score, GC content, Rfam accession number and ID from the results are provided. Table S3, Primer sets designed to flank EST-SSR that is associated with annotated (known) protein coding genes. Table S4, Significant differentially expressed genes between two libraries. Log[2] of the ratio of normalized expression value of unigene, up or down regulation, p-value and FDR value are provided. Table S5, KEGG pathway enrichment results. Number of differentially regulated unigenes and total number of unigenes in the transcriptome mapped to each pathway, hypergeometic p-value, FDR and the pathway ID are provided. (XLS) [184]Click here for additional data file.^ (15.6MB, xls) File S3 Results of significantly enriched KEGG pathways for differential expression study comparing gill and hepatopancreas. (DOC) [185]Click here for additional data file.^ (65.5KB, doc) Acknowledgments