Abstract

Background

   The next generation sequencing technologies substantially increased the
   throughput of microbial genome sequencing. To functionally annotate
   newly sequenced microbial genomes, a variety of experimental and
   computational methods are used. Integration of information from
   different sources is a powerful approach to enhance such annotation.
   Functional analysis of microbial genomes, necessary for downstream
   experiments, crucially depends on this annotation but it is hampered by
   the current lack of suitable information integration and exploration
   systems for microbial genomes.

Results

   We developed a data warehouse system (INDIGO) that enables the
   integration of annotations for exploration and analysis of newly
   sequenced microbial genomes. INDIGO offers an opportunity to construct
   complex queries and combine annotations from multiple sources starting
   from genomic sequence to protein domain, gene ontology and pathway
   levels. This data warehouse is aimed at being populated with
   information from genomes of pure cultures and uncultured single cells
   of Red Sea bacteria and Archaea. Currently, INDIGO contains information
   from Salinisphaera shabanensis, Haloplasma contractile, and Halorhabdus
   tiamatea - extremophiles isolated from deep-sea anoxic brine lakes of
   the Red Sea. We provide examples of utilizing the system to gain new
   insights into specific aspects on the unique lifestyle and adaptations
   of these organisms to extreme environments.

Conclusions

   We developed a data warehouse system, INDIGO, which enables
   comprehensive integration of information from various resources to be
   used for annotation, exploration and analysis of microbial genomes. It
   will be regularly updated and extended with new genomes. It is aimed to
   serve as a resource dedicated to the Red Sea microbes. In addition,
   through INDIGO, we provide our Automatic Annotation of Microbial
   Genomes (AAMG) pipeline. The INDIGO web server is freely available at
   [35]http://www.cbrc.kaust.edu.sa/indigo.

Introduction

   The Next Generation Sequencing (NGS) technologies substantially
   increased the throughput of genome sequencing [[36]1-[37]3]. Annotation
   of newly sequenced genomes requires a variety of experimental and
   computational methods [[38]4,[39]5], as well as integration of diverse
   biological information from multiple sources. Annotations stemming from
   information integration can be potentially used as a powerful approach
   in functional genomics that facilitates downstream experiments
   [[40]6,[41]7]. Data warehouses based on integrated information
   [[42]8,[43]9] are particularly useful as they open the possibility to
   explore content based on queries from diverse annotation attributes
   (e.g. genes, proteins, families, protein domains, ontologies,
   pathways). InterMine [[44]10] is one of the frameworks that allows
   construction of such data warehouses. It has previously been applied to
   developing data warehouses of model genomes resulting in resources such
   as FlyMine, modMine, RatMine, YeastMine, etc. For more details on
   InterMine features and comparison to similar systems, see reference
   [[45]10] and its supplementary materials.

   Here, we introduce INDIGO (Integrated Data Warehouse of Microbial
   Genomes), a data warehouse for microbial genomes we developed, which
   allows integration of annotations for exploration and analysis of
   microbial genomes. Currently, INDIGO contains information from three
   species: two bacterial species, Salinisphaera shabanensis [[46]11] and
   Haloplasma contractile [[47]12], and one archaeal species, Halorhabdus
   tiamatea [[48]13], all isolated from deep-sea anoxic brine lakes of the
   Red Sea. INDIGO will be regularly updated and expanded by addition of
   new microbial genomes from Red Sea species.

   Our contributions in this study can be summarized as follows:
     * Introduction of our Automatic Annotation of Microbial Genomes
       (AAMG).
     * Automation of data warehouse development in a high throughput
       manner that minimizes the intermediate steps for processing of
       annotation results.
     * Provision to public annotations of microbial genomes being
       sequenced at KAUST from studies of the Red Sea environment. The
       number of genomes will gradually increase.

INDIGO data warehouse

   Generally, newly sequenced microbial genomes are submitted to archival
   databases such as GenBank [[49]14] or EMBL [[50]15] and later they
   become part of curated resources such as NCBI’s RefSeq database
   [[51]16,[52]17]. In order to help research on microbial genomes, a
   number of microbial data warehouses have been developed. A few examples
   are Integrated Microbial Genomes (IMG) [[53]18], MicrobesOnline
   [[54]19] Ensemble Genomes ([55]www.ensemblgenomes.org) and MicroScope
   [[56]20]. These publicly available data warehouses that contain
   microbial genomes information allow data browsing and comparison of
   genomes based on different sequence and functional features. On the
   other hand, these data warehouses are quite limited in capacity of
   query building and customized feature/attribute/entity list generation
   for more specific interrogation of information they contain.

   We developed INDIGO, a data warehouse for microbial genomes using the
   InterMine framework Smith et al. [[57]10] that allows extensive query
   building, customized feature/attribute/entity list creation and
   enrichment analysis for Gene Ontology (GO) concepts, protein domains
   and various pathways. In order to populate INDIGO with information from
   a newly sequenced genome, one needs a draft or complete genome assembly
   and functionally annotated the assembled genome. The INDIGO deployment
   requires the following five functions, namely, 1/ definition of a
   genomic data model of entities to be stored, 2/ data validation and
   population of the Postgres database, 3/ data integration, 4/ data
   post-processing, and 5/ web-application development. These five
   functions are synchronized through a project xml file that stores the
   location of different datasets, type of data sources and standard
   InterMine post-processing steps.

Results and Discussion

Genome assembly

   In our case, we reassembled previously reported [[58]11-[59]13], three
   genomes based on the NGS-generated data available from Roche and
   Illumina sequencers and using Roche 454 Newbler assembler
   ([60]www.454.com) with scaffolding option turned on in addition to
   using SOAPdenovo [[61]21] and Velvet [[62]22]. Furthermore, we use CISA
   [[63]23] to obtain consensus assemblies. We improved the resulting
   scaffolds using SSPACE [[64]24], GapFiller [[65]25] and GapCloser
   [[66]21]. Applying this procedure significantly improved the assemblies
   by reducing the number of contigs, improved N50 parameter of all three
   genomes. Consequently, the redundancy in the contigs observed
   previously using minimus [[67]26] is now resolved. These re-assembled
   contigs and associated annotations are deposited to NCBI with accession
   numbers [68]AFNU00000000, [69]AFNV00000000 and [70]AFNT00000000 for
   HLPCO, SSPSH and HLRTI strains, respectively.

Genome annotation

   In our study, we performed genome annotation through a series of steps
   described in a workflow depicted in [71]Figure 1. First, genomic
   sequences are passed through fastaclean (Exonerate package) [[72]27].
   Before the prediction of coding regions, the genome is masked for RNA
   using RNAmmer [[73]28]and tRNAscanSE [[74]29]. Predicted 16S rRNA genes
   are searched for in the NCBI prokaryotic 16S rRNA gene database to
   retrieve related taxonomic information that is later used in selecting
   the best BLAST hits. Open Reading Frame (ORF) prediction is performed
   using Prodigal [[75]30], GeneMark [[76]31] and MetageneAnnotator
   [[77]32]. A series of BLAST [[78]33,[79]34] searches are then performed
   against the GenBank non-redundant (nr) [[80]14], UniProt [[81]35] and
   Kyoto Encyclopedia of Genes and Genomes (KEGG, [[82]36]) databases
   including Reverse Position Specific (RPS) [[83]37] searches against
   Conserved Domain Databases (especially COG and Prokaryotic Protein
   Clusters (PRK)). KEGG ortholog IDs are used to map relevant pathways
   and to display their presence on KEGG pathway maps. Interproscan
   analysis is carried out for GO terms and protein signature domains
   [[84]38,[85]39]. A check for annotation results is carried out using
   NCBI’s tbl2asn and errors are manually corrected. To verify origin of
   each contig/genomic sequence, a global scan of BLAST results of all
   genes is carried out and Globally Best Taxonomies (GBT) are assigned
   based on species from high to low ranked top hits. Ties are broken
   based on the higher to lower total length of alignment reported in
   BLAST results by each of the top scoring species.

Figure 1. Workflow of annotation process and data warehousing.

   [86]Figure 1
   [87]Open in a new tab

   Here, the section marked (A) shows steps in the annotation process.
   Section (B) shows a PERL based conversion of annotations into an XML
   schema - validated using the class attributes and data types defined in
   the genomic model, and finally, section (C) shows the process of data
   warehouse development steps.

Benchmarking

   Recently, Triplet et al. [[88]40] thoroughly compared and benchmarked
   four data warehousing systems namely BioMart [[89]41], BioXRT
   (mentioned in [[90]40]), InterMine [[91]10] and Pathway Tools [[92]42]
   in a number of aspects covering accuracy, their computational
   requirements and development efforts. In that study, InterMine and
   Pathway Tools superseded other systems. InterMine obtained the highest
   score, where five different aspects of data retrieval for genomics
   research were considered, such as aggregation, algebra, graph, data
   integration and sequence handling. We developed INDIGO system using the
   InterMine framework, but we extended it by the following features not
   available in InterMine.
     * 1
       Development of an automatic high throughput data warehousing
       pipeline to process customized annotation and their validation from
       newly sequenced microbial genomes. As an example, we annotated and
       processed annotations from three extremophile genomes from Red Sea
       and added to INDIGO for public data mining.
     * 2
       Addition of Genome Browser functionality.
     * 3
       Addition of BLAST interface to allow comparison of external user
       specified sequence data to INDIGO dataset and integration of BLAST
       results to either explore hit genes annotations in the INDIGO data
       warehouse or the auxiliary genome browser.
     * 4
       We made available special hyperlinks for KEGG assigned INDIGO
       pathway gene sets to be shown on publication quality pathway
       diagrams at KEGG website.
     * 5
       and more importantly, we made available Automatic Annotation of
       Microbial Genomes (AAMG) pipeline for public use through the INDIGO
       server.

   We compared INDIGO system to InterMine and few other microbial genome
   data warehouses such as Integrated Microbial Genomes (IMG) [[93]18],
   MicrobesOnline [[94]19] and MicroScope [[95]20]. [96]Table 1 shows the
   list of features compared as being present or not in a data warehouse.
   InterMine is also included in the comparison to show what are the
   differences between its basic framework and our INDIGO system. This
   comparison clearly shows the advantages of the INDIGO system
   complementing InterMine and providing more control to the user in
   integrating annotation information that is lacking in other microbial
   data warehouses. MicroScope microbial genome annotations data warehouse
   differs from INDIGO by providing a scope for manual annotation for each
   and every gene individually. However, it thus requires a lot of expert
   manpower to deal with increasing amount newly sequenced microbial
   genome data. MicroScope also has a number of similar features to
   INDIGO, but InterMine-based INDIGO system takes lead in providing
   several automated and powerful routes for user-defined data
   integration, particularly keyword, query builder or BLAST based
   user-controlled gene lists making, which lead to statistically robust
   GO, pathway or protein domain enrichment analyses.

Table 1. A comparison of features from different microbial data warehouses.

   INDIGO InterMine Integrated Microbial Genomes Microbes Online
   MicroScope
   Basic Data
   Chromosome/Contigs Yes Yes Yes Yes Yes
   Genes Yes Yes Yes Yes Yes
   Proteins Yes Yes Yes Yes Yes
   Expression data No Yes Yes Yes Yes
   Functional genomics
   Gene Ontology Yes Yes Yes Yes Yes
   KEGG Pathways Yes Yes Yes Yes Yes
   Interpro Domains Yes Yes Yes Yes Yes
   Cross references Yes Yes Yes Yes Yes