Abstract Recently, earthworms have become a useful model for research into the immune system, and it is expected that results obtained using this model will shed light on the sophisticated vertebrate immune system and the evolution of the immune response, and additionally help identify new biomolecules with therapeutic applications. However, for earthworms to be used as a genetic model of the invertebrate immune system, basic molecular and genetic resources, such as an expressed sequence tag (EST) database, must be developed for this organism. Next-generation sequencing technologies have generated EST libraries by RNA-seq in many model species. In this study, we used Illumina RNA-sequence technology to perform a comprehensive transcriptome analysis using an RNA sample pooled from sterile-cultured Eisenia andrei. All clean reads were assembled de novo into 41,423 unigenes using the Trinity program. Using this transcriptome data, we performed BLAST analysis against the GenBank non-redundant (NR) database and obtained a total of 12,285 significant BLAST hits. Furthermore, gene ontology (GO) analysis assigned 78 unigenes to 24 immune class GO terms. In addition, we detected a unigene with high similarity to beta-1,3-glucuronyltransferase 1 (GlcAT-P), which mediates a glucuronyl transfer reaction during the biosynthesis of the carbohydrate epitope HNK-1 (human natural killer-1, also known as CD57), a marker of NK cells. The identified transcripts will be used to facilitate future research into the immune system using E. andrei. Introduction Earthworms are well known as a key organism in the study of environmental toxicology [[37]1–[38]3]. Because they are easy to maintain in a laboratory setting [[39]4,[40]5], a large number of studies have been conducted using earthworms. Two earthworm species, Eisenia fetida and Eisenia andrei, have been used in European pesticide marketing authorization experiments since the early 1980s [[41]6]. In these studies, several bioactive proteins were identified in the coelomic fluid of earthworms [[42]7–[43]9]. These proteins exhibit a variety of antibacterial, hemolytic, cytotoxic, hemagglutinating, and proteolytic activities [[44]10]. Some reports have suggested that coelomocytes in the coelomic fluid play a role in immune defense [[45]11–[46]13]. For example, it was demonstrated that some coelomocytes possess cytotoxic activity similar to natural killer (NK) cells by co-culture with K592 cells, a human NK-sensitive cell line [[47]14]. In that study, when the coelomocytes were co-cultured with K592 cells, they extended numerous small pseudopodia that bound to and killed K592 cells. In addition, cross-reactivity of coelomocytes with monoclonal antibodies against several human CD markers of immune-related cells was also reported [[48]15]. Thus, earthworms are ideal models for conducting immune system research. Knowledge of the apparently less complex invertebrate immune system can contribute to our understanding of the more sophisticated vertebrate immune system, the evolution of the immune response and potentially lead to the identification of new biomolecules with therapeutic use [[49]10,[50]16]. However, as earthworms have only recently gained attention as a genetic model for immune system research, many basic molecular and genetic resources, such as expressed sequence tag (EST) databases, have yet to be established. EST information is an important resource for genetic and genomic studies, such as gene identification, verification of gene prediction, and gene sequence determination [[51]17]. Traditional EST libraries are usually generated through the construction of an EST library and sequencing, which is time-consuming and expensive. Recently, with the development of the next-generation sequencing technologies, EST libraries have been successfully constructed from total RNAs in many model species [[52]18–[53]20]. Gui [[54]21] analyzed the whole transcriptome of the Indo-Pacific humpback dolphin leukocyte, which resulted in the identification of putative genes involved in aquatic adaptation and the immune response. In addition, Xu [[55]22] performed de novo assembly of the transcriptome of Setaria viridis, an emerging model species for genetic studies of C4 photosynthesis, and identified 60,751 transcripts. Their study is expected to provide the basis for future genetic studies of C4 photosynthesis. Thus, to support immune system research in earthworms, we used Illumina RNA-seq technology to perform a comprehensive transcriptome analysis using RNA samples pooled from sterile-cultured E. andrei. This report presents a description of the identified genes expressed in E. andrei and their functional annotation. All clean reads were assembled de novo using the Trinity program into 41,423 unigenes. Using this transcriptome data, we performed BLAST analysis against the GenBank non-redundant (NR) database and obtained a total of 12,285 significant BLAST hits. Furthermore, gene ontology (GO) analysis assigned 78 unigenes to 24 immune class GO terms. These identified transcripts can be used to facilitate future immune system research using E. andrei. Materials and Methods E. andrei materials, RNA isolation, and Illumina sequencing The egg capsules (cocoons) of E. andrei were purchased from KIRYU Int. Co. (Miyazaki, Japan). After rinsing with PBS containing 50 μg/ml ampicillin, the cocoons were seeded onto 100-mm disposable petri dishes (Falcon), each coated with 100 ml of 1.0% agar supplemented with 0.5 mM NaCl, 0.05 mM KCl, 0.4 mM CaCl[2], 0.2 mM NaHCO[3] and 50 μg/ml ampicillin. The cultures were kept at 20°C. After the eggs hatched out, the worms were fed powdered skimmed milk every 2 weeks, and grew to a length of 3–5 cm in approximately 3 month-cultures. The adult worms were immediately frozen in liquid nitrogen and stored at −80°C until use. The frozen whole worms were ground in liquid nitrogen and total RNA was prepared using the Fastpure RNA Kit (Takara Bio Inc., Otsu, Japan) according to the manufacturer’s instructions. RNA integrity was assessed by agarose gel electrophoresis and the concentration was quantified using a Nanodrop ND-1000 spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, USA). The qualified RNA samples were analyzed by the Illumina Sequencing Services of Hokkaido System Science Co., Ltd. The cDNA library for sequencing was constructed as described previously [[56]21]. Sequencing was performed on an Illumina HiSeq 2000 [Q30 = 97.55%]. Data pre-processing, filtering, and de novo assembly For transcriptome assembly, raw reads were filtered and adapter sequences, non-coding RNA (e.g., rRNA), low-quality read sequences with ambiguous bases ‘N’, and raw reads with average lengths of less than 20 bases were removed. The Trinity program [[57]23] was utilized for de novo transcriptome assembly, combining read sequences of a certain length of overlap to form longer fragments without N gaps, called contigs. These contigs were further processed for read alignment and abundance estimation with Bowtie [[58]24] and RSEM [[59]25], and these sequences were defined as unigenes. Calculation of unigene expression was performed using the Fragments Per kilo base of exon per Million mapped fragments (FPKM) method, which was able to exclude sequencing discrepancies in the calculation of gene expression and the influence of different gene lengths. The number of unigenes was 41,423 with a threshold of more than FPKM = 2. All raw read sequences are available at the DDBJ Sequence Read Archive [[60]26] with the accession number: DRA002587. Functional annotation of unigenes and simple sequence repeats (SSR) identification A homology search against the NCBI non-redundant (NR) protein database ([61]http://www.ncbi.nlm.nih.gov; formatted on April 7, 2014) based on the BLASTx program [[62]27] was performed for all unigenes using a cutoff E-value < 1E^−6, and the best aligning results were selected to annotate the unigenes. To further annotate the unigenes, the Blast2GO program v. 2.7.1 [[63]28] was used to obtain gene ontology (GO) [[64]29] terms and the Kyoto Encyclopedia of Genes and Genomes (KEGG) [[65]30] information according to the NR results. The BGI WEGO program [[66]31] was used to perform GO functional classification of all unigenes to visualize the distribution of gene functions. The identification of immune-related genes was performed as described by Pereiro et al. [[67]32]. Furthermore, to detect more genes belonging to the relevant immune pathways in the transcriptome sequences, KEGG reference pathways identified in a previous report were utilized [[68]21]. Microsatellite identification tool (MISA) ([69]http://pgrc.ipk-gatersleben.de/misa/) with default parameters was used to detect microsatellites in the unigenes. Prediction of the tertiary structure of E. andrei GlcAT-P The tertiary structure of E. andrei GlcAT-P was predicted by homology modeling with SWISS-MODEL server [[70]33]. In the homology modeling, the GlcAT-P structure from human (Protein Data Bank (PDB) entry 1v83) [[71]34] was used as a template. Structural Figs. were drawn using PyMOL [[72]35]. Results and Discussion Illumina sequencing and sequence assembly In total, Illumina sequencing yielded 47,555,164 reads comprising 4,803 M bases from the mRNA pool isolated from sterile-cultured E. andrei. After the removal of adaptor sequences, ambiguous reads and low-quality reads, we obtained 47,070,010 clean reads. The Q30 percentage (sequencing error rate, 0.01%) was 97.55%. As shown in [73]Table 1, 47,070,010 clean reads were assembled de novo using the Trinity program. The clean reads were further assembled into 151,929 contigs with an average length of 968 bp and an N50 of 1,855 bp. The length and GC content distribution for all contigs are shown in [74]Fig. 1A and 1B. In total, 72,539 contigs were longer than 500 bp and the GC of 17,320 contigs was greater than 50%. From the contigs, 41,423 unigenes were obtained with an average length of 1,633 bp and an N50 of 1,567 bp ([75]Table 1). The length and GC distributions of all assembled unigenes are shown in [76]Fig. 1C and 1D. In total, 7,219 unigenes were less than 500 bp and 5,488 unigenes were longer than 3,000 bp. The GC content of 4,722 unigenes was greater than 50%. Table 1. Summary of the sequence assembly after Illumina sequencing. Total Number Total length (bp) Average length (bp) Average GC (%) N50 (bp) Q30 (%) Raw sequencing reads 47,555,164 4,803,000,000 97.55 Total clean reads 47,070,010 Total contigs 151,929 147,023,550 968 42.7 1,855 Total unigenes 41423 76,015,064 1,633 42.8 1,567 [77]Open in a new tab Fig 1. Overview of E. andrei transcriptome assembly. [78]Fig 1 [79]Open in a new tab (A and B) The lengths and GC distributions of the contigs obtained from de novo assembly of high-quality clean reads. (C and D) The lengths and GC distributions of the unigenes produced from further contig assembly. Functional annotation To validate and annotate the assembled unigenes, all assembled unigenes were searched against the GenBank non-redundant (NR) database using the BLASTx program (E-value < 1E^-6). The results showed that 12,285 unigene sequences had BLAST hits to annotated proteins in NR. Analysis of the distributions of E-values indicated that 82.7% of the aligned sequences showed significant homology to the entries in the NR database (E-value < 1E^-6) ([80]Fig. 2A). Further analysis of the similarity distributions indicated that 73.3% of matched sequences had alignment identities greater than 80% ([81]Fig. 2B). A large number of the hits matched the sequences of Capitella teleta (24.8%) and Helobdella robusta (18.1%); other hits were identified within the reference protein databases of Crassostrea gigas (7.3%), Lottia gigantea (5.7%), Aplysia californica (5.4%), Saccoglossus kowalevskii (4.8%), and Strongylocentrotus purpuratus (4.7%) ([82]Fig. 2C). There were also many unigenes with no BLAST hits; these might represent additional genes that were not represented in the annotated protein databases or sequences that were too short to produce hits. Fig 2. Characterization of the assembled unigenes based on an NR protein database search. [83]Fig 2 [84]Open in a new tab (A) E-value distribution of BLAST hits for the assembled unigenes with a cutoff of E-value < 1E^-6. (B) Similarity distribution of the top BLAST hits for the assembled unigenes with a cutoff of E-value < 1E^-6. (C) Species distribution of the top BLAST hits for the assembled unigenes with a cutoff of E-value < 1E^-6. GO (gene ontology) is an international system for the classification of standardized gene functions and is used to annotate and analyze gene functions and gene products in any organism. GO contains three main independent ontologies: biological process, molecular function, and cellular component. To predict their possible functions, we used the Blast2GO program to analyze the GO annotations of the assembled unigenes, and then used WEGO software to perform GO functional classifications. Based on NR annotation, 12,285 unigenes were assigned to 99 GO terms belonging to the three main GO ontologies ([85]Fig. 3). Further analysis of the 99 GO terms showed that the dominant terms were “cell”, “cell part”, “binding”, “catalytic”, “cellular process”, and “metabolic process”. Within the biological process group, the great majority of unigenes were related to the terms “cellular process” and “metabolic process”. Within the cellular component, most unigenes were assigned to “cells” and “cell parts”, followed by “binding” and “catalytic activity”. Fig 3. GO assignments for assembled unigenes. [86]Fig 3 [87]Open in a new tab The results are summarized in three main categories: biological process, cellular component, and molecular function. In total, 12,285 unigenes were categorized by GO term. Classified gene objects are depicted as absolute numbers of the total number of gene objects with GO assignments. KEGG, a pathway-based categorization of orthologous genes, provides useful information for predicting the functional profiles of genes. Therefore, to better understand the function of the E. andrei transcriptome, the unigenes were assigned against the KEGG pathway. In total, 1,322 unigenes were grouped into 100 KEGG pathways ([88]S1 Table). The top three ranking pathways were purine metabolism (155 unigenes), thiamine metabolism (53 unigenes), and glycolysis/gluconeogenesis (49 unigenes) ([89]Fig. 4). Interestingly, 47 unigenes mapped to the T cell receptor signaling pathway, which is involved in immune response ([90]Fig. 4). Fig 4. Pathway enrichment analysis of assembled unigenes. [91]Fig 4 [92]Open in a new tab In total, 1,322 unigenes were grouped into 100 KEGG pathways. The top 30 pathways containing unigenes are shown. SSR markers identification Microsatellites, also known as SSRs, are classes of repetitive DNA sequences ubiquitous in eukaryotic genomes [[93]36]. SSRs are ideal markers for use in paternity determination, population genetics studies, genetic diversity assessment and genetic map development [[94]37]. The results of microsatellite analysis are shown in [95]Table 2 and [96]S2 Table. From the unigene sequences, 14,717 SSRs (10,764 with simple repeats and 722 with compound formation) were identified in 12,539 unique sequences, of which, 1,775 sequences contained more than one SSR. All SSRs were further classified by the number of repeat units. The results showed that the number of most potential SSRs were composed of three repeat units (54.1%), followed by two (31.2%), four (7.7%), and four (6.8%) repeat units. The SSRs of E. andrei could provide potential genetic markers for studies of population genetics, comparative genomics, as well as gene-based association studies aimed at understanding the genetic control of adaptive traits. Table 2. Number of SSRs identified in the transcriptome of E. andrei. Total number of identified SSRs: 14717 Number of SSR containing sequence: 12539 Number of sequences containing more than one SSR: 1775 [97]Open in a new tab Identification of sequences related to the immune response A keyword list and GO immune-related terms were used to search for genes putatively involved in the immune system of E. andrei ([98]S3 Table). The results of this search identified a number of immune-related genes that were involved in well-recognized immune pathways, such as toll-like receptor (TLR) and interferon-gamma-mediated signaling. The toll receptor, as the signal transducer of the Toll pathway, plays a crucial role in the innate immune response. In this study, we identified a gene encoding TLR6 in the transcriptome datasets, as well as other genes belonging to the TLR signaling pathway, such as myeloid differentiation primary response protein (MyD) 88 and mitogen-activated protein kinases (MAPKs) [[99]Fig. 5]. Members of the JAK (Janus kinase) family are intracellular, non-receptor tyrosine kinases that transduce cytokine-mediated signals via the JAK-STAT pathway [[100]38]. Many studies have shown that signal transducers and activators of transcription (STAT) proteins are involved in the development and function of the immune system and play a role in maintaining immune tolerance and tumor surveillance [[101]39]. In our study, a unigene was identified with high similarity to mammalian JAK 2 but not to other members of the STAT family (STAT1, STAT2, STAT3, STAT4, STAT5A/5B, and STAT6). Identification of additional JAK-STAT pathway-related genes will be useful for learning more about the complexities of immune responses in E. andrei. In addition, signaling and interaction molecules, such as cytokines and cytokine receptors, were also identified in the transcriptome, as were proteases, protease inhibitors, and stress proteins such as heat shock proteins. Fig 5. TLR signaling pathway. [102]Fig 5 [103]Open in a new tab The red boxes indicate BLAST hits against the NR protein database. Identification of markers of NK cell NK cells are a type of cytotoxic lymphocyte that plays an important role in the innate immune system. The development of NK cells was a critical event in the evolution of the immune system [[104]40]. Despite their importance, the evolutionary origin of NK cells has remained unknown [[105]41]. Previous reports have suggested that earthworms have NK-like cells [[106]12–[107]14]. Therefore, these animals may occupy a key position in the evolution of the NK cell-dependent immune system. Thus, we next searched our unigene database for markers of NK cells, such as CD16 and CD56, and NK-related genes, using the BLASTx program (E-value < 1E^-6). As shown in [108]Fig. 6A, we detected a unigene with high similarity to beta-1,3-glucuronyltransferase 1 (GlcAT-P), which mediates a glucuronyl transfer reaction during biosynthesis of the carbohydrate epitope HNK-1 (human natural killer-1, also known as CD57, an NK cell marker). The similarity was much higher than that of other unigenes to human NK cell markers. Thus, we further investigated this unigene. We found that the unigene has 44% amino acid sequence identity and 55% amino acid sequence similarity with human GlcAT-P. The structure of human GlcAT-P and the structural basis for acceptor substrate recognition of human GlcAT-P in the biosynthesis of HNK-1 have been reported [[109]30]. However, because newly synthesized polypeptide chains must undergo folding into tertiary conformations and post-translational modifications before becoming a functional protein, a high level of identity in primary amino acid sequence between species does not automatically confer identical tertiary structure. Thus, as a preliminary analysis of the tertiary structure, we estimated the structure of earthworm GlcAT-P using homology modeling with the human GlcAT-P, and obtained the tertiary structure of amino acid residues 101–355 of E. andrei GlcAT-P, including the uridine diphosphate (UDP) binding region which plays an important role in enzyme activity of GlcAT-P. As shown in [110]Fig. 6B, [111]S1 and [112]S2 Movies, a high level of conservation of the UDP binding region was predicted between these two species in the tertiary structures of GlcAT-P. This suggests that the carbohydrate epitope HNK-1 is synthesized in E. andrei and that an anti-CD57 antibody could be used to identify NK-like cells in E. andrei. Fig 6. Identification of E. andrei GlcAT-P. [113]Fig 6 [114]Open in a new tab (A) Comparison of the amino acid sequences of human and E. andrei GlcAT-P proteins. Amino acid sequences are shown as a single-letter code. Amino acids that are conserved between human and A. andrei are shaded. (B) The predicted tertiary structure of E. andrei GlcAT-P. The tertiary structure of E. andrei GlcAT-P was modeled using the structure of human GlcAT-P as a template. In each GlcAT-P structure, UDP binding regions and amino acid residues in the active site are shown as a ball and stick models, respectively. The position of the UDP binding region in E. andrei GlcAT-P is based on the human GlcAT-P structure. Epigeic species of earthworms, including E. andrei, live in top layer of the soil, which is rich in decaying organic matter and characterized by a wide variety of microbiota. These environmental contaminants and microbiota, including viruses, bacteria, and protozoans, could interfere with the abundance of epigeic earthworms by inducing a high mortality rate, decreasing reproductive success or synergistically increasing the virulence of other diseases. Therefore, to be successful as an epigeic earthworm, E. andrei needs to be resistant to various microorganisms present in the top layer of the soil. However, the mechanisms responsible for immune responses to these environmental contaminants and microbiota in E. andrei are not fully understood. Innate immunity is the first line of host defense against pathogens. Many immune cells, such as monocytes, macrophages, leukocytes (PMN), and NK cells, are involved in the detection and removal of microbial pathogens [[115]42,[116]43]. Compared to other model organisms such as mice and fruit flies, few immune-related genes have been identified in E. andrei. Our results revealed a large number of innate immune-related genes, covering almost all known innate immune pathways, including pathogen recognition, modulation, and signaling. These findings will facilitate our comprehensive understanding of the mechanisms involved in the immune response to environmental contaminants and microbiota in E. andrei. In conclusion, we characterized the transcriptome of E. andrei and identified many SSRs and abundant specific gene families involved in the immune response. To our knowledge, this is the first report to analyze the whole transcriptome of this epigeic species of earthworm. The dataset generated in this study will provide a substantial transcriptome-level resource for the study of earthworm biology and help further our understanding of the molecular mechanisms of various pathways, including those of the immune response. Supporting Information S1 Movie. Tertiary structure of human GlcAT-P. The human GlcAT-P structure was derived from Protein Data Bank (PDB) entry 1v83. (MPG) [117]Click here for additional data file.^ (3.6MB, mpg) S2 Movie. Tertiary structure of E. andrei GlcAT-P. The tertiary structure of E. andrei GlcAT-P was predicted by homology modeling with SWISS-MODEL server. In the homology modeling, the GlcAT-P structure from human was used as a template. (MPG) [118]Click here for additional data file.^ (3.7MB, mpg) S1 Table. Pathway enrichment analysis of assembled unigenes of E. andrei. The unigenes were assigned against the KEGG pathway. (XLSX) [119]Click here for additional data file.^ (51.9KB, xlsx) S2 Table. List of SSRs detected in the unigenes of E. andrei. Microsatellite identification tool (MISA) with default parameters was used to detect SSRs in the unigenes. (XLSX) [120]Click here for additional data file.^ (1.7MB, xlsx) S3 Table. GO terms used for identifying immune-related genes. (XLSX) [121]Click here for additional data file.^ (963KB, xlsx) Acknowledgments