Abstract Background The next generation sequencing technologies substantially increased the throughput of microbial genome sequencing. To functionally annotate newly sequenced microbial genomes, a variety of experimental and computational methods are used. Integration of information from different sources is a powerful approach to enhance such annotation. Functional analysis of microbial genomes, necessary for downstream experiments, crucially depends on this annotation but it is hampered by the current lack of suitable information integration and exploration systems for microbial genomes. Results We developed a data warehouse system (INDIGO) that enables the integration of annotations for exploration and analysis of newly sequenced microbial genomes. INDIGO offers an opportunity to construct complex queries and combine annotations from multiple sources starting from genomic sequence to protein domain, gene ontology and pathway levels. This data warehouse is aimed at being populated with information from genomes of pure cultures and uncultured single cells of Red Sea bacteria and Archaea. Currently, INDIGO contains information from Salinisphaera shabanensis, Haloplasma contractile, and Halorhabdus tiamatea - extremophiles isolated from deep-sea anoxic brine lakes of the Red Sea. We provide examples of utilizing the system to gain new insights into specific aspects on the unique lifestyle and adaptations of these organisms to extreme environments. Conclusions We developed a data warehouse system, INDIGO, which enables comprehensive integration of information from various resources to be used for annotation, exploration and analysis of microbial genomes. It will be regularly updated and extended with new genomes. It is aimed to serve as a resource dedicated to the Red Sea microbes. In addition, through INDIGO, we provide our Automatic Annotation of Microbial Genomes (AAMG) pipeline. The INDIGO web server is freely available at [35]http://www.cbrc.kaust.edu.sa/indigo. Introduction The Next Generation Sequencing (NGS) technologies substantially increased the throughput of genome sequencing [[36]1-[37]3]. Annotation of newly sequenced genomes requires a variety of experimental and computational methods [[38]4,[39]5], as well as integration of diverse biological information from multiple sources. Annotations stemming from information integration can be potentially used as a powerful approach in functional genomics that facilitates downstream experiments [[40]6,[41]7]. Data warehouses based on integrated information [[42]8,[43]9] are particularly useful as they open the possibility to explore content based on queries from diverse annotation attributes (e.g. genes, proteins, families, protein domains, ontologies, pathways). InterMine [[44]10] is one of the frameworks that allows construction of such data warehouses. It has previously been applied to developing data warehouses of model genomes resulting in resources such as FlyMine, modMine, RatMine, YeastMine, etc. For more details on InterMine features and comparison to similar systems, see reference [[45]10] and its supplementary materials. Here, we introduce INDIGO (Integrated Data Warehouse of Microbial Genomes), a data warehouse for microbial genomes we developed, which allows integration of annotations for exploration and analysis of microbial genomes. Currently, INDIGO contains information from three species: two bacterial species, Salinisphaera shabanensis [[46]11] and Haloplasma contractile [[47]12], and one archaeal species, Halorhabdus tiamatea [[48]13], all isolated from deep-sea anoxic brine lakes of the Red Sea. INDIGO will be regularly updated and expanded by addition of new microbial genomes from Red Sea species. Our contributions in this study can be summarized as follows: * Introduction of our Automatic Annotation of Microbial Genomes (AAMG). * Automation of data warehouse development in a high throughput manner that minimizes the intermediate steps for processing of annotation results. * Provision to public annotations of microbial genomes being sequenced at KAUST from studies of the Red Sea environment. The number of genomes will gradually increase. INDIGO data warehouse Generally, newly sequenced microbial genomes are submitted to archival databases such as GenBank [[49]14] or EMBL [[50]15] and later they become part of curated resources such as NCBI’s RefSeq database [[51]16,[52]17]. In order to help research on microbial genomes, a number of microbial data warehouses have been developed. A few examples are Integrated Microbial Genomes (IMG) [[53]18], MicrobesOnline [[54]19] Ensemble Genomes ([55]www.ensemblgenomes.org) and MicroScope [[56]20]. These publicly available data warehouses that contain microbial genomes information allow data browsing and comparison of genomes based on different sequence and functional features. On the other hand, these data warehouses are quite limited in capacity of query building and customized feature/attribute/entity list generation for more specific interrogation of information they contain. We developed INDIGO, a data warehouse for microbial genomes using the InterMine framework Smith et al. [[57]10] that allows extensive query building, customized feature/attribute/entity list creation and enrichment analysis for Gene Ontology (GO) concepts, protein domains and various pathways. In order to populate INDIGO with information from a newly sequenced genome, one needs a draft or complete genome assembly and functionally annotated the assembled genome. The INDIGO deployment requires the following five functions, namely, 1/ definition of a genomic data model of entities to be stored, 2/ data validation and population of the Postgres database, 3/ data integration, 4/ data post-processing, and 5/ web-application development. These five functions are synchronized through a project xml file that stores the location of different datasets, type of data sources and standard InterMine post-processing steps. Results and Discussion Genome assembly In our case, we reassembled previously reported [[58]11-[59]13], three genomes based on the NGS-generated data available from Roche and Illumina sequencers and using Roche 454 Newbler assembler ([60]www.454.com) with scaffolding option turned on in addition to using SOAPdenovo [[61]21] and Velvet [[62]22]. Furthermore, we use CISA [[63]23] to obtain consensus assemblies. We improved the resulting scaffolds using SSPACE [[64]24], GapFiller [[65]25] and GapCloser [[66]21]. Applying this procedure significantly improved the assemblies by reducing the number of contigs, improved N50 parameter of all three genomes. Consequently, the redundancy in the contigs observed previously using minimus [[67]26] is now resolved. These re-assembled contigs and associated annotations are deposited to NCBI with accession numbers [68]AFNU00000000, [69]AFNV00000000 and [70]AFNT00000000 for HLPCO, SSPSH and HLRTI strains, respectively. Genome annotation In our study, we performed genome annotation through a series of steps described in a workflow depicted in [71]Figure 1. First, genomic sequences are passed through fastaclean (Exonerate package) [[72]27]. Before the prediction of coding regions, the genome is masked for RNA using RNAmmer [[73]28]and tRNAscanSE [[74]29]. Predicted 16S rRNA genes are searched for in the NCBI prokaryotic 16S rRNA gene database to retrieve related taxonomic information that is later used in selecting the best BLAST hits. Open Reading Frame (ORF) prediction is performed using Prodigal [[75]30], GeneMark [[76]31] and MetageneAnnotator [[77]32]. A series of BLAST [[78]33,[79]34] searches are then performed against the GenBank non-redundant (nr) [[80]14], UniProt [[81]35] and Kyoto Encyclopedia of Genes and Genomes (KEGG, [[82]36]) databases including Reverse Position Specific (RPS) [[83]37] searches against Conserved Domain Databases (especially COG and Prokaryotic Protein Clusters (PRK)). KEGG ortholog IDs are used to map relevant pathways and to display their presence on KEGG pathway maps. Interproscan analysis is carried out for GO terms and protein signature domains [[84]38,[85]39]. A check for annotation results is carried out using NCBI’s tbl2asn and errors are manually corrected. To verify origin of each contig/genomic sequence, a global scan of BLAST results of all genes is carried out and Globally Best Taxonomies (GBT) are assigned based on species from high to low ranked top hits. Ties are broken based on the higher to lower total length of alignment reported in BLAST results by each of the top scoring species. Figure 1. Workflow of annotation process and data warehousing. [86]Figure 1 [87]Open in a new tab Here, the section marked (A) shows steps in the annotation process. Section (B) shows a PERL based conversion of annotations into an XML schema - validated using the class attributes and data types defined in the genomic model, and finally, section (C) shows the process of data warehouse development steps. Benchmarking Recently, Triplet et al. [[88]40] thoroughly compared and benchmarked four data warehousing systems namely BioMart [[89]41], BioXRT (mentioned in [[90]40]), InterMine [[91]10] and Pathway Tools [[92]42] in a number of aspects covering accuracy, their computational requirements and development efforts. In that study, InterMine and Pathway Tools superseded other systems. InterMine obtained the highest score, where five different aspects of data retrieval for genomics research were considered, such as aggregation, algebra, graph, data integration and sequence handling. We developed INDIGO system using the InterMine framework, but we extended it by the following features not available in InterMine. * 1 Development of an automatic high throughput data warehousing pipeline to process customized annotation and their validation from newly sequenced microbial genomes. As an example, we annotated and processed annotations from three extremophile genomes from Red Sea and added to INDIGO for public data mining. * 2 Addition of Genome Browser functionality. * 3 Addition of BLAST interface to allow comparison of external user specified sequence data to INDIGO dataset and integration of BLAST results to either explore hit genes annotations in the INDIGO data warehouse or the auxiliary genome browser. * 4 We made available special hyperlinks for KEGG assigned INDIGO pathway gene sets to be shown on publication quality pathway diagrams at KEGG website. * 5 and more importantly, we made available Automatic Annotation of Microbial Genomes (AAMG) pipeline for public use through the INDIGO server. We compared INDIGO system to InterMine and few other microbial genome data warehouses such as Integrated Microbial Genomes (IMG) [[93]18], MicrobesOnline [[94]19] and MicroScope [[95]20]. [96]Table 1 shows the list of features compared as being present or not in a data warehouse. InterMine is also included in the comparison to show what are the differences between its basic framework and our INDIGO system. This comparison clearly shows the advantages of the INDIGO system complementing InterMine and providing more control to the user in integrating annotation information that is lacking in other microbial data warehouses. MicroScope microbial genome annotations data warehouse differs from INDIGO by providing a scope for manual annotation for each and every gene individually. However, it thus requires a lot of expert manpower to deal with increasing amount newly sequenced microbial genome data. MicroScope also has a number of similar features to INDIGO, but InterMine-based INDIGO system takes lead in providing several automated and powerful routes for user-defined data integration, particularly keyword, query builder or BLAST based user-controlled gene lists making, which lead to statistically robust GO, pathway or protein domain enrichment analyses. Table 1. A comparison of features from different microbial data warehouses. INDIGO InterMine Integrated Microbial Genomes Microbes Online MicroScope Basic Data Chromosome/Contigs Yes Yes Yes Yes Yes Genes Yes Yes Yes Yes Yes Proteins Yes Yes Yes Yes Yes Expression data No Yes Yes Yes Yes Functional genomics Gene Ontology Yes Yes Yes Yes Yes KEGG Pathways Yes Yes Yes Yes Yes Interpro Domains Yes Yes Yes Yes Yes Cross references Yes Yes Yes Yes Yes