Abstract Background Cotton (Gossypium spp.) is the most important fiber and oil crop in the world. With the emergence of huge -omics data sets, it is essential to have an integrated functional genomics database that allows worldwide users to quickly and easily fetch and visualize genomic information. Currently available cotton-related databases have some weakness in integrating multiple kinds of -omics data from multiple Gossypium species. Therefore, it is necessary to establish an integrated functional genomics database for cotton. Description We developed CottonFGD (Cotton Functional Genomic Database, [37]https://cottonfgd.org), an integrated database that includes genomic sequences, gene structural and functional annotations, genetic marker data, transcriptome data, and population genome resequencing data for all four of the sequenced Gossypium species. It consists of three interconnected modules: search, profile, and analysis. These modules make CottonFGD enable both single gene review and batch analysis with multiple kinds of -omics data and multiple species. CottonFGD also includes additional pages for data statistics, bulk data download, and a detailed user manual. Conclusion Equipped with specialized functional modules and modernized visualization tools, and populated with multiple kinds of -omics data, CottonFGD provides a quick and easy-to-use data analysis platform for cotton researchers worldwide. Electronic supplementary material The online version of this article (doi:10.1186/s12870-017-1039-x) contains supplementary material, which is available to authorized users. Keywords: Cotton, Database, RNA-seq, Functional annotation, Variation, Genetic marker Background As a natural fiber and oilseed crop, cotton (Gossypium spp.) plays an important role in daily life and industrial material. In addition, the polyploidy of currently cultivated cottons, and its close relationship with ancestral diploid donor species makes it an excellent model organism for studies of polyploidization. These two aspects have resulted in demand for an integrated genomics database that provides gene information resources for researchers engaged in molecular breeding and in evolutionary studies. Compared with other model organisms such as Arabidopsis thaliana, rice (Oryza sativa), and maize (Zea mays), the genome sequences of cotton species were released much later. The first cotton genome assembly for G. raimondii, a diploid species that donated the D-subgenome of cultivated polyploid cotton, was released in 2012 by two independent groups [[38]1, [39]2]. Genomes of three other important cotton species, G. arboreum (diploid), G. hirsutum and G. barbadense (both polyploid), were just released in the last two years [[40]3–[41]7] (See review [[42]8] for details). Likely due to this rather late start, the information about cotton genomics is not readily available in popular general plant sequence databases. Among the 58 general plant databases included in the Nucleic Acids Research Molecular Biology Database Collection [[43]9], only seven include cotton genes’ information. Moreover, among these, six only include data for a single diploid species, G. raimondii.. In addition to the general plant databases, there are also three databases specifically designed for cotton. CottonGen [[44]10] collects cotton genome sequences, genetic markers, and breeding germplasm accessions. GraP [[45]11] is a G. raimondii-specific database for gene functional annotation and expression data. ccNet [[46]12] displays co-expression networks from diploid G. arboreum and polyploid G. hirsutum. While these databases filled in many gaps in cotton genome and -omics data analysis, the decentralized distribution of these databases make it a complex task to access this information in the course of practical research work. Researchers need ready access to a variety data types from multiple Gossypium species, including information relating to genetics, genomics, functional annotations, transcriptomics and sequence variation data. Thus, an integrated functional genomics database similar to the IC4R rice database [[47]13] is necessary to systematically gather current cotton genomics data together for easy use. Here, we developed CottonFGD, an integrated functional genomics database for cotton. CottonFGD features three notable attributes: comprehensiveness, integrity, and user-friendliness. First, it covers all of the available cotton genomes and a variety of genetics and -omics data, including genetic marker annotations, structural annotations, functional annotations, RNA-seq expression data sets, and population resequencing data. Second, CottonFGD integrates gene searching, cross-database referencing, and gene list analysis in an easy and natural way. Last, but not least, CottonFGD employs modern visualization tools that make its user interface accessible via any type of device. We hope that CottonFGD will emerge as the fundamental database for the cotton functional genomics and breeding research community. Construction and content Data sources and processing Genome assemblies and gene annotations Seven cotton genome assemblies representing four Gossypium species and their respective gene annotations were downloaded from relevant database websites (Additional file [48]1). After checking the annotation consistency between the GFF files and the provided CDS or protein sequences, we found that the HAU assembly (v1.0) and annotation (v1.0) of G. barbadense [[49]6] contain systemic errors; it was therefore not included in CottonFGD (Additional file [50]1). In total, six assemblies were used in CottonFGD (Table [51]1). In order to make the annotation data from different species more consistent, several subtle changes were implemented (Additional file [52]1). All the patched annotation files are available for download from CottonFGD. Table 1. Cotton genome assemblies included in CottonFGD Species^a Date Provider Assembly Size (Mb) Chromosome Number^b Annotated Genes Diploid G. raimondii (Ulbr.) Joint Genome Institute (JGI) [[53]1] 761.4 13 (+1020) 37,505 G. raimondii (D[5]–3) Beijing Genome Institute (BGI) [[54]2] 775.2 13 (+4434) 40,976 G. arboreum (Shixiya1) Beijing Genome Institute (BGI) [[55]3] 1694.6 13 (+75,581) 41,331 Tetraploid G. hirsutum (Tm-1) Nanjing Agricultural University (NAU) [[56]7] 2447.0 26 (+38,951) 70,478 G. hirsutum (Tm-1) Beijing Genome Institute (BGI) [[57]4] 2150.9 26 (+9128) 76,943 G. barbadense (Xinhai-21) Nanjing Agricultural University (NAU) [[58]5] 2263.5 26 (+2013) 77,358 [59]Open in a new tab ^aSequenced strains are listed in brackets. ^bUnplaced scaffold numbers are listed in brackets Gene functional annotations Each gene name and description was defined by its best protein homolog from NCBI BLAST+ [[60]14] (v2.2.31) searching against the UniProtKB/SwissProt database [[61]15] (last accessed December, 2015) with an e-value of 1e-05. Predicted protein properties such as molecular weight, isoelectric point, and hydropathy were calculated using EMBOSS [[62]16] (v6.5.7.0) and BioPerl [[63]17] (v1.6.924). Included protein motif/domain regions and associated Gene Ontology [[64]18] (GO) and InterPro [[65]19] items were annotated using InterProScan [[66]20] (v5.16–55.0) with the default parameters. Related pathways were annotated using the KEGG Automatic Annotation Server [[67]21] (KAAS) with the bi-directional best hit method, against of all the available plant species. Homologs within Gossypium and across other representative plant species were defined by BLAST+ with e-values of 1e-10 and 1e-5, respectively. In addition, we also collect functional annotation data from the original sequencing projects and the CottonGen [[68]10] database. Detailed data source can be viewed from the help document for CottonFGD ([69]https://cottonfgd.org/about/help/). Genetic Marker Annotations Genetic marker sequences of 279 insertion/deletion sites (INDELs), 3451 restricted fragment length polymorphisms (RFLPs), and 65,412 simple sequence repeats (SSRs) were downloaded from CottonGen [[70]10]. Each marker was mapped to every Gossypium genome assembly to define its physical location using BLAT [[71]22] (v36). By default, only BLAT hits with ≥95% query coverage and ≥90% identity were shown in the final user interface. Expression data By searching the Sequence Read Archive [[72]23] (SRA) database of NCBI, we collected and downloaded 168 RNA-seq analyses, the majority of which had more than 20× transcriptome sequencing depth and read lengths longer than 75 bp. These RNA-seq analyses constitute 20 experiment groups (Additional file [73]2) covering all four of the Gossypium species in CottonFGD, and cover a variety of biological processes like stress responses and developmental series such as seed germination and fiber development, as well as multiple tissue expression atlases. Raw RNA-seq reads were filtered using the NGS QC Toolkit [[74]24] (v2.3.3) and were then trimmed by Trimmomatic [[75]25] (v0.3.3) to generate clean reads for further analysis. The resulting clean RNA-seq reads were mapped to their respective reference genomes using TopHat [[76]26] (v2.1.1). The transcript abundance of annotated genes was quantified by Cufflinks [[77]27] (v2.2.1) and then the differentially-expressed genes (DEGs) were defined within each experiment group. Detailed parameters for the software used here are listed in the help document for CottonFGD ([78]https://cottonfgd.org/about/help/). Variation data Whole Genome Shot-gun (WGS) resequencing data were also searched and downloaded from the NCBI SRA database. 122 WGS analyses containing 85 G. hirsutum strains and 103 analyses containing 57 G. barbadense strains were selected (both datasets were from study SRP047301). Raw WGS reads were filtered using the same methods used for our filtering of RNA-seq reads. The filtered reads were mapped to the relevant reference genomes using BWA [[79]28] (v0.7.12). In order to reduce false positive variant calling, we only used WGS analyses with more than 50% clean reads remaining after quality filtering and for which more than 80% of reads were properly mapped. These criteria yielded 96 analyses containing 79 G. hirsutum strains and 83 analyses containing 52 G. barbadense strains (Additional file [80]3). SNPs and INDELs were called using Samtools [[81]29] (v1.3) and Bcftools [[82]29] (v1.3). The possible effects of SNPs were annotated using SnpEff [[83]30] (v4.3). Detailed parameters for this analysis pipeline are listed in the help document for CottonFGD ([84]https://cottonfgd.org/about/help/). Development of database and webserver The processed sequence, annotation, expression, and variation data were stored in our MySQL (v5.6.26) server. A user-friendly web interface was constructed to enable end users to conveniently access CottonFGD data. The web interface was developed using the Twitter Bootstrap framework based on modern HTML5 and JavaScript. This enables users to access CottonFGD through any modern browser on any kind of device. Multiple JavaScript tools were used to visualize the searched data (See the Utility and discussion section for details). PHP (v5.6.6) was used to submit users’ query searches and to dynamically generate report pages. Both the database and the website are hosted on our Supermicro® server running CentOS 6.8. Website structure The main structure of CottonFGD is shown in Fig. [85]1. It consists of three main modules: search, profile, and analysis. The search module gives users three methods to search for cotton genes: browsing by genomic regions (the “Browse” page), searching by sequence similarity (the “BLAST” page), and searching by gene properties such as names, associated domains, or expression patterns (the “Search” page). After receiving users’ queries, the search module generates a list of cotton genes as results. Users can then either click the attached link in each gene to view the relevant profile page one-by-one, or they can choose and select multiple gene IDs from the lists and launch the analysis module. In the analysis module, users can fetch information for every selected gene or conduct analysis of selected gene sets. Such analysis includes enrichment analysis, multiple sequence alignment (MSA) & phylogenetic tree construction, or gene lists comparison. All three of the modules are integrated by hyperlinks and action buttons. Therefore, it is also feasible to use CottonFGD on hand-held devices such as mobile phones, where it is not as easy to do copy and paste as it is on personal computers. Fig. 1. Fig. 1 [86]Open in a new tab The website structure of CottonFGD. CottonFGD consists of three main modules: search, profile, and analysis. The search module accepts users’ queries and searches for cotton genes by genomic region, sequence similarity, or gene properties. The profile module displays an information page for a specified gene or transcript, including multiple properties such as gene structure, homology, gene function, and expression and sequence variation data. The analysis module can accept a list of gene IDs and generate relevant information lists; it can also conduct analyses of entire gene sets Utility and discussion The search module: browse, BLAST, or search cotton genes CottonFGD provides three methods to search for cotton genes: by genomic regions, by sequence similarity, or by gene properties. The “Browse page” (Fig. [87]2a and Additional file [88]4) displays annotated cotton genes in a specified genomic region. When first visiting the Browse page, it automatically displays all the annotated genes located from A01: 1,000,000–3,000,000 of the NAU assembly for G. hirsutum). Users can change the target species and the genomic regions to whatever they want, and can update the displayed gene lists. Regions can be defined by either genomic coordinates (physical position) or genetic markers (map position). User-altered parameters are stored in the users’ web browsers, and are automatically applied at the time of the next visit. In addition to the gene list table, CottonFGD also displays a snapshot of the gene distribution pattern in the current specified region rendered by JBrowse [[89]31], a modern genome browser. Fig. 2. Fig. 2 [90]Open in a new tab Structure of the search module. a The Browse page: search by genomic region (position or marker); (b) The BLAST page: search by sequence similarity through an embedded SequenceServer App [[91]32]. c The Search page: search by names, function, or expression; (d) A snapshot of an interactive result table. Users can either click the hyperlink in each gene ID to view the relevant profile page or can choose and select multiple gene IDs to import into the analysis module The “BLAST page” (Fig. [92]2b and Additional file [93]4) conducts sequence similarity searches against cotton gene sets or whole genome sequences. CottonFGD uses the latest stable version of NCBI BLAST+ [[94]14] (currently v2.5.0) as the backend BLAST executable program and the SequenceServer app [[95]32] (v1.0.8) as the frontend interface. This makes BLAST searching fast, stable, and appealing. The “Search page” (Fig. [96]2c and Additional file [97]4) conducts gene searches using a variety of methods, including: by gene ID or name, by associated domains, by gene function items (GO, InterPro, or pathway), or by selected expression experiments. Users can switch among different search methods using the navigation tabs. When searching by domains or gene function names, CottonFGD implements a two-step search (Fig. [98]2c and Additional file [99]4): in the first step, CottonFGD lists all the function items that matched a user’s input. In the second step, users select the sub-items they want, and CottonFGD then returns a final associated gene list. This type of two-step searching method greatly reduces the number of redundant results that can arise from fuzzy matching of users’ search terms. In all three of the search methods, CottonFGD renders search results in an interactive gene list table (Fig. [100]2d). Users can view each gene or transcript profile by clicking the relevant hyperlink in the gene ID, can download the table to their local devices in one of several formats, or can select the genes they want and do further analysis by clicking on relevant buttons located above the result table. The profile module: view gene/transcript profiles Each annotated gene and its main transcript has a profile page in CottonFGD where a variety of related information is displayed. It can be accessed by hyperlinks in the search result tables or directly by input URLs. For example, the profile page of gene Gh_A01G0139 in G. hirsutum can be accessed via [101]https://cottonfgd.org/profiles/gene/Gh_A01G0139/, and its main transcript Gh_A01G0139.1 can be accessed via [102]https://cottonfgd.org/profiles/transcript/Gh_A01G0139.1/. The profile page for a given gene displays basic information (name, description, location, and genomic DNA sequence), associated transcripts, genomic context, and cross-database references (Fig.