Abstract

   Recently, earthworms have become a useful model for research into the
   immune system, and it is expected that results obtained using this
   model will shed light on the sophisticated vertebrate immune system and
   the evolution of the immune response, and additionally help identify
   new biomolecules with therapeutic applications. However, for earthworms
   to be used as a genetic model of the invertebrate immune system, basic
   molecular and genetic resources, such as an expressed sequence tag
   (EST) database, must be developed for this organism. Next-generation
   sequencing technologies have generated EST libraries by RNA-seq in many
   model species. In this study, we used Illumina RNA-sequence technology
   to perform a comprehensive transcriptome analysis using an RNA sample
   pooled from sterile-cultured Eisenia andrei. All clean reads were
   assembled de novo into 41,423 unigenes using the Trinity program. Using
   this transcriptome data, we performed BLAST analysis against the
   GenBank non-redundant (NR) database and obtained a total of 12,285
   significant BLAST hits. Furthermore, gene ontology (GO) analysis
   assigned 78 unigenes to 24 immune class GO terms. In addition, we
   detected a unigene with high similarity to
   beta-1,3-glucuronyltransferase 1 (GlcAT-P), which mediates a glucuronyl
   transfer reaction during the biosynthesis of the carbohydrate epitope
   HNK-1 (human natural killer-1, also known as CD57), a marker of NK
   cells. The identified transcripts will be used to facilitate future
   research into the immune system using E. andrei.

Introduction

   Earthworms are well known as a key organism in the study of
   environmental toxicology [[37]1–[38]3]. Because they are easy to
   maintain in a laboratory setting [[39]4,[40]5], a large number of
   studies have been conducted using earthworms. Two earthworm species,
   Eisenia fetida and Eisenia andrei, have been used in European pesticide
   marketing authorization experiments since the early 1980s [[41]6]. In
   these studies, several bioactive proteins were identified in the
   coelomic fluid of earthworms [[42]7–[43]9]. These proteins exhibit a
   variety of antibacterial, hemolytic, cytotoxic, hemagglutinating, and
   proteolytic activities [[44]10]. Some reports have suggested that
   coelomocytes in the coelomic fluid play a role in immune defense
   [[45]11–[46]13]. For example, it was demonstrated that some
   coelomocytes possess cytotoxic activity similar to natural killer (NK)
   cells by co-culture with K592 cells, a human NK-sensitive cell line
   [[47]14]. In that study, when the coelomocytes were co-cultured with
   K592 cells, they extended numerous small pseudopodia that bound to and
   killed K592 cells. In addition, cross-reactivity of coelomocytes with
   monoclonal antibodies against several human CD markers of
   immune-related cells was also reported [[48]15]. Thus, earthworms are
   ideal models for conducting immune system research. Knowledge of the
   apparently less complex invertebrate immune system can contribute to
   our understanding of the more sophisticated vertebrate immune system,
   the evolution of the immune response and potentially lead to the
   identification of new biomolecules with therapeutic use
   [[49]10,[50]16]. However, as earthworms have only recently gained
   attention as a genetic model for immune system research, many basic
   molecular and genetic resources, such as expressed sequence tag (EST)
   databases, have yet to be established. EST information is an important
   resource for genetic and genomic studies, such as gene identification,
   verification of gene prediction, and gene sequence determination
   [[51]17]. Traditional EST libraries are usually generated through the
   construction of an EST library and sequencing, which is time-consuming
   and expensive. Recently, with the development of the next-generation
   sequencing technologies, EST libraries have been successfully
   constructed from total RNAs in many model species [[52]18–[53]20]. Gui
   [[54]21] analyzed the whole transcriptome of the Indo-Pacific humpback
   dolphin leukocyte, which resulted in the identification of putative
   genes involved in aquatic adaptation and the immune response. In
   addition, Xu [[55]22] performed de novo assembly of the transcriptome
   of Setaria viridis, an emerging model species for genetic studies of C4
   photosynthesis, and identified 60,751 transcripts. Their study is
   expected to provide the basis for future genetic studies of C4
   photosynthesis. Thus, to support immune system research in earthworms,
   we used Illumina RNA-seq technology to perform a comprehensive
   transcriptome analysis using RNA samples pooled from sterile-cultured
   E. andrei. This report presents a description of the identified genes
   expressed in E. andrei and their functional annotation. All clean reads
   were assembled de novo using the Trinity program into 41,423 unigenes.
   Using this transcriptome data, we performed BLAST analysis against the
   GenBank non-redundant (NR) database and obtained a total of 12,285
   significant BLAST hits. Furthermore, gene ontology (GO) analysis
   assigned 78 unigenes to 24 immune class GO terms. These identified
   transcripts can be used to facilitate future immune system research
   using E. andrei.

Materials and Methods

E. andrei materials, RNA isolation, and Illumina sequencing

   The egg capsules (cocoons) of E. andrei were purchased from KIRYU Int.
   Co. (Miyazaki, Japan). After rinsing with PBS containing 50 μg/ml
   ampicillin, the cocoons were seeded onto 100-mm disposable petri dishes
   (Falcon), each coated with 100 ml of 1.0% agar supplemented with 0.5 mM
   NaCl, 0.05 mM KCl, 0.4 mM CaCl[2], 0.2 mM NaHCO[3] and 50 μg/ml
   ampicillin. The cultures were kept at 20°C. After the eggs hatched out,
   the worms were fed powdered skimmed milk every 2 weeks, and grew to a
   length of 3–5 cm in approximately 3 month-cultures. The adult worms
   were immediately frozen in liquid nitrogen and stored at −80°C until
   use. The frozen whole worms were ground in liquid nitrogen and total
   RNA was prepared using the Fastpure RNA Kit (Takara Bio Inc., Otsu,
   Japan) according to the manufacturer’s instructions. RNA integrity was
   assessed by agarose gel electrophoresis and the concentration was
   quantified using a Nanodrop ND-1000 spectrophotometer (Thermo Fisher
   Scientific, Wilmington, DE, USA). The qualified RNA samples were
   analyzed by the Illumina Sequencing Services of Hokkaido System Science
   Co., Ltd. The cDNA library for sequencing was constructed as described
   previously [[56]21]. Sequencing was performed on an Illumina HiSeq 2000
   [Q30 = 97.55%].

Data pre-processing, filtering, and de novo assembly

   For transcriptome assembly, raw reads were filtered and adapter
   sequences, non-coding RNA (e.g., rRNA), low-quality read sequences with
   ambiguous bases ‘N’, and raw reads with average lengths of less than 20
   bases were removed. The Trinity program [[57]23] was utilized for de
   novo transcriptome assembly, combining read sequences of a certain
   length of overlap to form longer fragments without N gaps, called
   contigs. These contigs were further processed for read alignment and
   abundance estimation with Bowtie [[58]24] and RSEM [[59]25], and these
   sequences were defined as unigenes. Calculation of unigene expression
   was performed using the Fragments Per kilo base of exon per Million
   mapped fragments (FPKM) method, which was able to exclude sequencing
   discrepancies in the calculation of gene expression and the influence
   of different gene lengths. The number of unigenes was 41,423 with a
   threshold of more than FPKM = 2. All raw read sequences are available
   at the DDBJ Sequence Read Archive [[60]26] with the accession number:
   DRA002587.

Functional annotation of unigenes and simple sequence repeats (SSR)
identification

   A homology search against the NCBI non-redundant (NR) protein database
   ([61]http://www.ncbi.nlm.nih.gov; formatted on April 7, 2014) based on
   the BLASTx program [[62]27] was performed for all unigenes using a
   cutoff E-value < 1E^−6, and the best aligning results were selected to
   annotate the unigenes. To further annotate the unigenes, the Blast2GO
   program v. 2.7.1 [[63]28] was used to obtain gene ontology (GO)
   [[64]29] terms and the Kyoto Encyclopedia of Genes and Genomes (KEGG)
   [[65]30] information according to the NR results. The BGI WEGO program
   [[66]31] was used to perform GO functional classification of all
   unigenes to visualize the distribution of gene functions. The
   identification of immune-related genes was performed as described by
   Pereiro et al. [[67]32]. Furthermore, to detect more genes belonging to
   the relevant immune pathways in the transcriptome sequences, KEGG
   reference pathways identified in a previous report were utilized
   [[68]21]. Microsatellite identification tool (MISA)
   ([69]http://pgrc.ipk-gatersleben.de/misa/) with default parameters was
   used to detect microsatellites in the unigenes.

Prediction of the tertiary structure of E. andrei GlcAT-P

   The tertiary structure of E. andrei GlcAT-P was predicted by homology
   modeling with SWISS-MODEL server [[70]33]. In the homology modeling,
   the GlcAT-P structure from human (Protein Data Bank (PDB) entry 1v83)
   [[71]34] was used as a template. Structural Figs. were drawn using
   PyMOL [[72]35].

Results and Discussion

Illumina sequencing and sequence assembly

   In total, Illumina sequencing yielded 47,555,164 reads comprising 4,803
   M bases from the mRNA pool isolated from sterile-cultured E. andrei.
   After the removal of adaptor sequences, ambiguous reads and low-quality
   reads, we obtained 47,070,010 clean reads. The Q30 percentage
   (sequencing error rate, 0.01%) was 97.55%. As shown in [73]Table 1,
   47,070,010 clean reads were assembled de novo using the Trinity
   program. The clean reads were further assembled into 151,929 contigs
   with an average length of 968 bp and an N50 of 1,855 bp. The length and
   GC content distribution for all contigs are shown in [74]Fig. 1A and
   1B. In total, 72,539 contigs were longer than 500 bp and the GC of
   17,320 contigs was greater than 50%. From the contigs, 41,423 unigenes
   were obtained with an average length of 1,633 bp and an N50 of 1,567 bp
   ([75]Table 1). The length and GC distributions of all assembled
   unigenes are shown in [76]Fig. 1C and 1D. In total, 7,219 unigenes were
   less than 500 bp and 5,488 unigenes were longer than 3,000 bp. The GC
   content of 4,722 unigenes was greater than 50%.

Table 1. Summary of the sequence assembly after Illumina sequencing.

   Total Number Total length (bp) Average length (bp) Average GC (%) N50
   (bp) Q30 (%)
   Raw sequencing reads 47,555,164 4,803,000,000 97.55
   Total clean reads 47,070,010
   Total contigs 151,929 147,023,550 968 42.7 1,855
   Total unigenes 41423 76,015,064 1,633 42.8 1,567
   [77]Open in a new tab

Fig 1. Overview of E. andrei transcriptome assembly.

   [78]Fig 1
   [79]Open in a new tab

   (A and B) The lengths and GC distributions of the contigs obtained from
   de novo assembly of high-quality clean reads. (C and D) The lengths and
   GC distributions of the unigenes produced from further contig assembly.

Functional annotation

   To validate and annotate the assembled unigenes, all assembled unigenes
   were searched against the GenBank non-redundant (NR) database using the
   BLASTx program (E-value < 1E^-6). The results showed that 12,285
   unigene sequences had BLAST hits to annotated proteins in NR. Analysis
   of the distributions of E-values indicated that 82.7% of the aligned
   sequences showed significant homology to the entries in the NR database
   (E-value < 1E^-6) ([80]Fig. 2A). Further analysis of the similarity
   distributions indicated that 73.3% of matched sequences had alignment
   identities greater than 80% ([81]Fig. 2B). A large number of the hits
   matched the sequences of Capitella teleta (24.8%) and Helobdella
   robusta (18.1%); other hits were identified within the reference
   protein databases of Crassostrea gigas (7.3%), Lottia gigantea (5.7%),
   Aplysia californica (5.4%), Saccoglossus kowalevskii (4.8%), and
   Strongylocentrotus purpuratus (4.7%) ([82]Fig. 2C). There were also
   many unigenes with no BLAST hits; these might represent additional
   genes that were not represented in the annotated protein databases or
   sequences that were too short to produce hits.

Fig 2. Characterization of the assembled unigenes based on an NR protein
database search.

   [83]Fig 2
   [84]Open in a new tab

   (A) E-value distribution of BLAST hits for the assembled unigenes with
   a cutoff of E-value < 1E^-6. (B) Similarity distribution of the top
   BLAST hits for the assembled unigenes with a cutoff of E-value < 1E^-6.
   (C) Species distribution of the top BLAST hits for the assembled
   unigenes with a cutoff of E-value < 1E^-6.

   GO (gene ontology) is an international system for the classification of
   standardized gene functions and is used to annotate and analyze gene
   functions and gene products in any organism. GO contains three main
   independent ontologies: biological process, molecular function, and
   cellular component. To predict their possible functions, we used the
   Blast2GO program to analyze the GO annotations of the assembled
   unigenes, and then used WEGO software to perform GO functional
   classifications. Based on NR annotation, 12,285 unigenes were assigned
   to 99 GO terms belonging to the three main GO ontologies ([85]Fig. 3).
   Further analysis of the 99 GO terms showed that the dominant terms were
   “cell”, “cell part”, “binding”, “catalytic”, “cellular process”, and
   “metabolic process”. Within the biological process group, the great
   majority of unigenes were related to the terms “cellular process” and
   “metabolic process”. Within the cellular component, most unigenes were
   assigned to “cells” and “cell parts”, followed by “binding” and
   “catalytic activity”.

Fig 3. GO assignments for assembled unigenes.

   [86]Fig 3
   [87]Open in a new tab

   The results are summarized in three main categories: biological
   process, cellular component, and molecular function. In total, 12,285
   unigenes were categorized by GO term. Classified gene objects are
   depicted as absolute numbers of the total number of gene objects with
   GO assignments.

   KEGG, a pathway-based categorization of orthologous genes, provides
   useful information for predicting the functional profiles of genes.
   Therefore, to better understand the function of the E. andrei
   transcriptome, the unigenes were assigned against the KEGG pathway. In
   total, 1,322 unigenes were grouped into 100 KEGG pathways ([88]S1
   Table). The top three ranking pathways were purine metabolism (155
   unigenes), thiamine metabolism (53 unigenes), and
   glycolysis/gluconeogenesis (49 unigenes) ([89]Fig. 4). Interestingly,
   47 unigenes mapped to the T cell receptor signaling pathway, which is
   involved in immune response ([90]Fig. 4).

Fig 4. Pathway enrichment analysis of assembled unigenes.

   [91]Fig 4
   [92]Open in a new tab

   In total, 1,322 unigenes were grouped into 100 KEGG pathways. The top
   30 pathways containing unigenes are shown.

SSR markers identification

   Microsatellites, also known as SSRs, are classes of repetitive DNA
   sequences ubiquitous in eukaryotic genomes [[93]36]. SSRs are ideal
   markers for use in paternity determination, population genetics
   studies, genetic diversity assessment and genetic map development
   [[94]37]. The results of microsatellite analysis are shown in [95]Table
   2 and [96]S2 Table. From the unigene sequences, 14,717 SSRs (10,764
   with simple repeats and 722 with compound formation) were identified in
   12,539 unique sequences, of which, 1,775 sequences contained more than
   one SSR. All SSRs were further classified by the number of repeat
   units. The results showed that the number of most potential SSRs were
   composed of three repeat units (54.1%), followed by two (31.2%), four
   (7.7%), and four (6.8%) repeat units. The SSRs of E. andrei could
   provide potential genetic markers for studies of population genetics,
   comparative genomics, as well as gene-based association studies aimed
   at understanding the genetic control of adaptive traits.

Table 2. Number of SSRs identified in the transcriptome of E. andrei.

   Total number of identified SSRs:                  14717
   Number of SSR containing sequence:                12539
   Number of sequences containing more than one SSR: 1775
   [97]Open in a new tab

Identification of sequences related to the immune response

   A keyword list and GO immune-related terms were used to search for
   genes putatively involved in the immune system of E. andrei ([98]S3
   Table). The results of this search identified a number of
   immune-related genes that were involved in well-recognized immune
   pathways, such as toll-like receptor (TLR) and
   interferon-gamma-mediated signaling. The toll receptor, as the signal
   transducer of the Toll pathway, plays a crucial role in the innate
   immune response. In this study, we identified a gene encoding TLR6 in
   the transcriptome datasets, as well as other genes belonging to the TLR
   signaling pathway, such as myeloid differentiation primary response
   protein (MyD) 88 and mitogen-activated protein kinases (MAPKs)
   [[99]Fig. 5]. Members of the JAK (Janus kinase) family are
   intracellular, non-receptor tyrosine kinases that transduce
   cytokine-mediated signals via the JAK-STAT pathway [[100]38]. Many
   studies have shown that signal transducers and activators of
   transcription (STAT) proteins are involved in the development and
   function of the immune system and play a role in maintaining immune
   tolerance and tumor surveillance [[101]39]. In our study, a unigene was
   identified with high similarity to mammalian JAK 2 but not to other
   members of the STAT family (STAT1, STAT2, STAT3, STAT4, STAT5A/5B, and
   STAT6). Identification of additional JAK-STAT pathway-related genes
   will be useful for learning more about the complexities of immune
   responses in E. andrei. In addition, signaling and interaction
   molecules, such as cytokines and cytokine receptors, were also
   identified in the transcriptome, as were proteases, protease
   inhibitors, and stress proteins such as heat shock proteins.

Fig 5. TLR signaling pathway.

   [102]Fig 5
   [103]Open in a new tab

   The red boxes indicate BLAST hits against the NR protein database.

Identification of markers of NK cell

   NK cells are a type of cytotoxic lymphocyte that plays an important
   role in the innate immune system. The development of NK cells was a
   critical event in the evolution of the immune system [[104]40]. Despite
   their importance, the evolutionary origin of NK cells has remained
   unknown [[105]41]. Previous reports have suggested that earthworms have
   NK-like cells [[106]12–[107]14]. Therefore, these animals may occupy a
   key position in the evolution of the NK cell-dependent immune system.
   Thus, we next searched our unigene database for markers of NK cells,
   such as CD16 and CD56, and NK-related genes, using the BLASTx program
   (E-value < 1E^-6). As shown in [108]Fig. 6A, we detected a unigene with
   high similarity to beta-1,3-glucuronyltransferase 1 (GlcAT-P), which
   mediates a glucuronyl transfer reaction during biosynthesis of the
   carbohydrate epitope HNK-1 (human natural killer-1, also known as CD57,
   an NK cell marker). The similarity was much higher than that of other
   unigenes to human NK cell markers. Thus, we further investigated this
   unigene. We found that the unigene has 44% amino acid sequence identity
   and 55% amino acid sequence similarity with human GlcAT-P. The
   structure of human GlcAT-P and the structural basis for acceptor
   substrate recognition of human GlcAT-P in the biosynthesis of HNK-1
   have been reported [[109]30]. However, because newly synthesized
   polypeptide chains must undergo folding into tertiary conformations and
   post-translational modifications before becoming a functional protein,
   a high level of identity in primary amino acid sequence between species
   does not automatically confer identical tertiary structure. Thus, as a
   preliminary analysis of the tertiary structure, we estimated the
   structure of earthworm GlcAT-P using homology modeling with the human
   GlcAT-P, and obtained the tertiary structure of amino acid residues
   101–355 of E. andrei GlcAT-P, including the uridine diphosphate (UDP)
   binding region which plays an important role in enzyme activity of
   GlcAT-P. As shown in [110]Fig. 6B, [111]S1 and [112]S2 Movies, a high
   level of conservation of the UDP binding region was predicted between
   these two species in the tertiary structures of GlcAT-P. This suggests
   that the carbohydrate epitope HNK-1 is synthesized in E. andrei and
   that an anti-CD57 antibody could be used to identify NK-like cells in
   E. andrei.

Fig 6. Identification of E. andrei GlcAT-P.

   [113]Fig 6
   [114]Open in a new tab

   (A) Comparison of the amino acid sequences of human and E. andrei
   GlcAT-P proteins. Amino acid sequences are shown as a single-letter
   code. Amino acids that are conserved between human and A. andrei are
   shaded. (B) The predicted tertiary structure of E. andrei GlcAT-P. The
   tertiary structure of E. andrei GlcAT-P was modeled using the structure
   of human GlcAT-P as a template. In each GlcAT-P structure, UDP binding
   regions and amino acid residues in the active site are shown as a ball
   and stick models, respectively. The position of the UDP binding region
   in E. andrei GlcAT-P is based on the human GlcAT-P structure.

   Epigeic species of earthworms, including E. andrei, live in top layer
   of the soil, which is rich in decaying organic matter and characterized
   by a wide variety of microbiota. These environmental contaminants and
   microbiota, including viruses, bacteria, and protozoans, could
   interfere with the abundance of epigeic earthworms by inducing a high
   mortality rate, decreasing reproductive success or synergistically
   increasing the virulence of other diseases. Therefore, to be successful
   as an epigeic earthworm, E. andrei needs to be resistant to various
   microorganisms present in the top layer of the soil. However, the
   mechanisms responsible for immune responses to these environmental
   contaminants and microbiota in E. andrei are not fully understood.
   Innate immunity is the first line of host defense against pathogens.
   Many immune cells, such as monocytes, macrophages, leukocytes (PMN),
   and NK cells, are involved in the detection and removal of microbial
   pathogens [[115]42,[116]43]. Compared to other model organisms such as
   mice and fruit flies, few immune-related genes have been identified in
   E. andrei. Our results revealed a large number of innate immune-related
   genes, covering almost all known innate immune pathways, including
   pathogen recognition, modulation, and signaling. These findings will
   facilitate our comprehensive understanding of the mechanisms involved
   in the immune response to environmental contaminants and microbiota in
   E. andrei.

   In conclusion, we characterized the transcriptome of E. andrei and
   identified many SSRs and abundant specific gene families involved in
   the immune response. To our knowledge, this is the first report to
   analyze the whole transcriptome of this epigeic species of earthworm.
   The dataset generated in this study will provide a substantial
   transcriptome-level resource for the study of earthworm biology and
   help further our understanding of the molecular mechanisms of various
   pathways, including those of the immune response.

Supporting Information

   S1 Movie. Tertiary structure of human GlcAT-P.

   The human GlcAT-P structure was derived from Protein Data Bank (PDB)
   entry 1v83.

   (MPG)
   [117]Click here for additional data file.^ (3.6MB, mpg)
   S2 Movie. Tertiary structure of E. andrei GlcAT-P.

   The tertiary structure of E. andrei GlcAT-P was predicted by homology
   modeling with SWISS-MODEL server. In the homology modeling, the GlcAT-P
   structure from human was used as a template.

   (MPG)
   [118]Click here for additional data file.^ (3.7MB, mpg)
   S1 Table. Pathway enrichment analysis of assembled unigenes of E.
   andrei.

   The unigenes were assigned against the KEGG pathway.

   (XLSX)
   [119]Click here for additional data file.^ (51.9KB, xlsx)
   S2 Table. List of SSRs detected in the unigenes of E. andrei.

   Microsatellite identification tool (MISA) with default parameters was
   used to detect SSRs in the unigenes.

   (XLSX)
   [120]Click here for additional data file.^ (1.7MB, xlsx)
   S3 Table. GO terms used for identifying immune-related genes.

   (XLSX)
   [121]Click here for additional data file.^ (963KB, xlsx)

Acknowledgments