Abstract

Background

   Transcriptomic sequencing (RNA-seq) related applications allow for
   rapid explorations due to their high-throughput and relatively fast
   experimental capabilities, providing unprecedented progress in gene
   functional annotation, gene regulation analysis, and environmental
   factor verification. However, with increasing amounts of sequenced
   reads and reference model species, the selection of appropriate
   reference species for gene annotation has become a new challenge.

Methods

   We proposed a novel approach for finding the most effective reference
   model species through taxonomic associations and ultra-conserved
   orthologous (UCO) gene comparisons among species. An online system for
   multiple species selection (MSS) for RNA-seq differential expression
   analysis was developed, and comprehensive genomic annotations from 291
   reference model eukaryotic species were retrieved from the RefSeq,
   KEGG, and UniProt databases.

Results

   Using the proposed MSS pipeline, gene ontology and biological pathway
   enrichment analysis can be efficiently achieved, especially in the case
   of transcriptomic analysis of non-model organisms. The results showed
   that the proposed method solved problems related to limitations in
   annotation information and provided a roughly twenty-fold reduction in
   computational time, resulting in more accurate results than those of
   traditional approaches of using a single model reference species or the
   large non-redundant reference database.

Conclusions

   Selection of appropriate reference model species helps to reduce
   missing annotation information, allowing for more comprehensive results
   than those obtained with a single model reference species. In addition,
   adequate model species selection reduces the computational time
   significantly while retaining the same order of accuracy. The proposed
   system indeed provides superior performance by selecting appropriate
   multiple species for transcriptomic analysis compared to traditional
   approaches.

Electronic supplementary material

   The online version of this article (10.1186/s12859-018-2278-z) contains
   supplementary material, which is available to authorized users.

   Keywords: RNA-seq, Reference model species, Differential expression
   analysis, Ultra-conserved orthologous gene, Gene ontology, Biological
   pathway

Background

   High-throughput RNA sequencing (RNA-seq) involves the sequencing of
   cDNA samples using next-generation sequencing instruments, thus
   permitting the analysis of complete transcriptomes from samples
   subjected to different experimental conditions. In addition to
   providing a snapshot of an organism’s transcriptional profile, RNA-seq
   can also be used to observe changes in gene expression levels by
   comparing transcriptomes obtained at various time points or from
   different tissues under different environmental settings [[37]1]. In
   recent years, owing to advances allowing for high-throughput,
   relatively fast, and low-cost experiments, RNA-seq has led to
   unprecedented development of studies focusing on gene functions and
   regulation, as well as correlations between genes and environmental
   factors. Compared to traditional microarray hybridization approaches
   for gene expression analysis, a major advantage of RNA-seq is that it
   allows for analysis of non-model target species without comprehensive
   gene annotations in advance [[38]2]. This is because RNA-seq can still
   accurately detect novel gene transcripts and expression levels in
   different tissues or under different conditions through de novo
   assembly of the sequenced reads. In addition, RNA-seq provides high
   dynamic expression ranges, and costs are relatively low for obtaining
   whole-transcriptome data [[39]3]. Based on these advantages, non-model
   organisms now have been frequently selected by the research community
   for extensive study. It is also due to the facts that non-model
   organisms possess many interesting and unique features compared to
   model organisms, and comprehensive transcriptome analysis reveals novel
   genes and helps to address genetic evolution and ecological issues from
   different aspects.

   Although transcriptional data can be obtained using RNA-seq without any
   prior knowledge of gene sequences, for functional genomic analysis of
   non-model species, a sequence similarity alignment must be performed
   with reference sequence datasets containing known gene functions in
   order to accurately annotate the gene functions of assembled sequences.
   The most popular gene sequence database is the International Nucleotide
   Sequence Database Collaboration (INSDC), which regularly updates
   sequence information from three databases: the National Center for
   Biotechnology Information (NCBI), the European Bioinformatics Institute
   (EMBL-EBI), and the DNA Data Bank of Japan (DDBJ). The non-redundant
   (nr) sequence database GenBank, maintained by NCBI, contains a popular
   reference dataset for sequence annotation; thus, several RNA-seq
   studies performed on novel species have thus far aligned sequences to
   the nr sequence database for gene function annotation [[40]4–[41]6].
   However, the volume of this database is enormous. As of June 2016, a
   total of 37,382,402 sequences had been indexed. Using this dataset as a
   reference for assembled sequence alignment and gene function annotation
   not only requires a large amount of computational resources, but the
   alignment results may also contain false positive annotations due to
   contaminated samples and false assembled contigs. An alternative method
   commonly used by biologists is to select the single reference model
   species that has the greatest evolutionary relevance to the target
   species. While this method is rapid and simple, the selected model
   species may not necessarily have complete gene annotations, and thus it
   is possible that certain information may be missed. Therefore, in this
   study, we aimed to develop a tool that can use multiple model species
   simultaneously as references for annotating genes from new RNA-seq