Abstract

Background

   Cotton (Gossypium spp.) is the most important fiber and oil crop in the
   world. With the emergence of huge -omics data sets, it is essential to
   have an integrated functional genomics database that allows worldwide
   users to quickly and easily fetch and visualize genomic information.
   Currently available cotton-related databases have some weakness in
   integrating multiple kinds of -omics data from multiple Gossypium
   species. Therefore, it is necessary to establish an integrated
   functional genomics database for cotton.

Description

   We developed CottonFGD (Cotton Functional Genomic Database,
   [37]https://cottonfgd.org), an integrated database that includes
   genomic sequences, gene structural and functional annotations, genetic
   marker data, transcriptome data, and population genome resequencing
   data for all four of the sequenced Gossypium species. It consists of
   three interconnected modules: search, profile, and analysis. These
   modules make CottonFGD enable both single gene review and batch
   analysis with multiple kinds of -omics data and multiple species.
   CottonFGD also includes additional pages for data statistics, bulk data
   download, and a detailed user manual.

Conclusion

   Equipped with specialized functional modules and modernized
   visualization tools, and populated with multiple kinds of -omics data,
   CottonFGD provides a quick and easy-to-use data analysis platform for
   cotton researchers worldwide.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12870-017-1039-x)
   contains supplementary material, which is available to authorized
   users.

   Keywords: Cotton, Database, RNA-seq, Functional annotation, Variation,
   Genetic marker

Background

   As a natural fiber and oilseed crop, cotton (Gossypium spp.) plays an
   important role in daily life and industrial material. In addition, the
   polyploidy of currently cultivated cottons, and its close relationship
   with ancestral diploid donor species makes it an excellent model
   organism for studies of polyploidization. These two aspects have
   resulted in demand for an integrated genomics database that provides
   gene information resources for researchers engaged in molecular
   breeding and in evolutionary studies.

   Compared with other model organisms such as Arabidopsis thaliana, rice
   (Oryza sativa), and maize (Zea mays), the genome sequences of cotton
   species were released much later. The first cotton genome assembly for
   G. raimondii, a diploid species that donated the D-subgenome of
   cultivated polyploid cotton, was released in 2012 by two independent
   groups [[38]1, [39]2]. Genomes of three other important cotton species,
   G. arboreum (diploid), G. hirsutum and G. barbadense (both polyploid),
   were just released in the last two years [[40]3–[41]7] (See review
   [[42]8] for details). Likely due to this rather late start, the
   information about cotton genomics is not readily available in popular
   general plant sequence databases. Among the 58 general plant databases
   included in the Nucleic Acids Research Molecular Biology Database
   Collection [[43]9], only seven include cotton genes’ information.
   Moreover, among these, six only include data for a single diploid
   species, G. raimondii..

   In addition to the general plant databases, there are also three
   databases specifically designed for cotton. CottonGen [[44]10] collects
   cotton genome sequences, genetic markers, and breeding germplasm
   accessions. GraP [[45]11] is a G. raimondii-specific database for gene
   functional annotation and expression data. ccNet [[46]12] displays
   co-expression networks from diploid G. arboreum and polyploid G.
   hirsutum. While these databases filled in many gaps in cotton genome
   and -omics data analysis, the decentralized distribution of these
   databases make it a complex task to access this information in the
   course of practical research work. Researchers need ready access to a
   variety data types from multiple Gossypium species, including
   information relating to genetics, genomics, functional annotations,
   transcriptomics and sequence variation data. Thus, an integrated
   functional genomics database similar to the IC4R rice database [[47]13]
   is necessary to systematically gather current cotton genomics data
   together for easy use.

   Here, we developed CottonFGD, an integrated functional genomics
   database for cotton. CottonFGD features three notable attributes:
   comprehensiveness, integrity, and user-friendliness. First, it covers
   all of the available cotton genomes and a variety of genetics and
   -omics data, including genetic marker annotations, structural
   annotations, functional annotations, RNA-seq expression data sets, and
   population resequencing data. Second, CottonFGD integrates gene
   searching, cross-database referencing, and gene list analysis in an
   easy and natural way. Last, but not least, CottonFGD employs modern
   visualization tools that make its user interface accessible via any
   type of device. We hope that CottonFGD will emerge as the fundamental
   database for the cotton functional genomics and breeding research
   community.

Construction and content

Data sources and processing

Genome assemblies and gene annotations

   Seven cotton genome assemblies representing four Gossypium species and
   their respective gene annotations were downloaded from relevant
   database websites (Additional file [48]1). After checking the
   annotation consistency between the GFF files and the provided CDS or
   protein sequences, we found that the HAU assembly (v1.0) and annotation
   (v1.0) of G. barbadense [[49]6] contain systemic errors; it was
   therefore not included in CottonFGD (Additional file [50]1). In total,
   six assemblies were used in CottonFGD (Table [51]1). In order to make
   the annotation data from different species more consistent, several
   subtle changes were implemented (Additional file [52]1). All the
   patched annotation files are available for download from CottonFGD.

Table 1.

   Cotton genome assemblies included in CottonFGD
   Species^a Date Provider Assembly Size (Mb) Chromosome Number^b
   Annotated Genes
   Diploid
   G. raimondii (Ulbr.) Joint Genome Institute (JGI) [[53]1] 761.4 13
   (+1020) 37,505
   G. raimondii (D[5]–3) Beijing Genome Institute (BGI) [[54]2] 775.2 13
   (+4434) 40,976
   G. arboreum (Shixiya1) Beijing Genome Institute (BGI) [[55]3] 1694.6 13
   (+75,581) 41,331
   Tetraploid
   G. hirsutum (Tm-1) Nanjing Agricultural University (NAU) [[56]7] 2447.0
   26 (+38,951) 70,478
   G. hirsutum (Tm-1) Beijing Genome Institute (BGI) [[57]4] 2150.9 26
   (+9128) 76,943
   G. barbadense (Xinhai-21) Nanjing Agricultural University (NAU) [[58]5]
   2263.5 26 (+2013) 77,358
   [59]Open in a new tab

   ^aSequenced strains are listed in brackets.

   ^bUnplaced scaffold numbers are listed in brackets

Gene functional annotations

   Each gene name and description was defined by its best protein homolog
   from NCBI BLAST+ [[60]14] (v2.2.31) searching against the
   UniProtKB/SwissProt database [[61]15] (last accessed December, 2015)
   with an e-value of 1e-05. Predicted protein properties such as
   molecular weight, isoelectric point, and hydropathy were calculated
   using EMBOSS [[62]16] (v6.5.7.0) and BioPerl [[63]17] (v1.6.924).
   Included protein motif/domain regions and associated Gene Ontology
   [[64]18] (GO) and InterPro [[65]19] items were annotated using
   InterProScan [[66]20] (v5.16–55.0) with the default parameters. Related
   pathways were annotated using the KEGG Automatic Annotation Server
   [[67]21] (KAAS) with the bi-directional best hit method, against of all
   the available plant species. Homologs within Gossypium and across other
   representative plant species were defined by BLAST+ with e-values of
   1e-10 and 1e-5, respectively. In addition, we also collect functional
   annotation data from the original sequencing projects and the CottonGen
   [[68]10] database. Detailed data source can be viewed from the help
   document for CottonFGD ([69]https://cottonfgd.org/about/help/).

Genetic Marker Annotations

   Genetic marker sequences of 279 insertion/deletion sites (INDELs), 3451
   restricted fragment length polymorphisms (RFLPs), and 65,412 simple
   sequence repeats (SSRs) were downloaded from CottonGen [[70]10]. Each
   marker was mapped to every Gossypium genome assembly to define its
   physical location using BLAT [[71]22] (v36). By default, only BLAT hits
   with ≥95% query coverage and ≥90% identity were shown in the final user
   interface.

Expression data

   By searching the Sequence Read Archive [[72]23] (SRA) database of NCBI,
   we collected and downloaded 168 RNA-seq analyses, the majority of which
   had more than 20× transcriptome sequencing depth and read lengths
   longer than 75 bp. These RNA-seq analyses constitute 20 experiment
   groups (Additional file [73]2) covering all four of the Gossypium
   species in CottonFGD, and cover a variety of biological processes like
   stress responses and developmental series such as seed germination and
   fiber development, as well as multiple tissue expression atlases. Raw
   RNA-seq reads were filtered using the NGS QC Toolkit [[74]24] (v2.3.3)
   and were then trimmed by Trimmomatic [[75]25] (v0.3.3) to generate
   clean reads for further analysis. The resulting clean RNA-seq reads
   were mapped to their respective reference genomes using TopHat [[76]26]
   (v2.1.1). The transcript abundance of annotated genes was quantified by
   Cufflinks [[77]27] (v2.2.1) and then the differentially-expressed genes
   (DEGs) were defined within each experiment group. Detailed parameters
   for the software used here are listed in the help document for
   CottonFGD ([78]https://cottonfgd.org/about/help/).

Variation data

   Whole Genome Shot-gun (WGS) resequencing data were also searched and
   downloaded from the NCBI SRA database. 122 WGS analyses containing
   85 G. hirsutum strains and 103 analyses containing 57 G. barbadense
   strains were selected (both datasets were from study SRP047301). Raw
   WGS reads were filtered using the same methods used for our filtering
   of RNA-seq reads. The filtered reads were mapped to the relevant
   reference genomes using BWA [[79]28] (v0.7.12). In order to reduce
   false positive variant calling, we only used WGS analyses with more
   than 50% clean reads remaining after quality filtering and for which
   more than 80% of reads were properly mapped. These criteria yielded 96
   analyses containing 79 G. hirsutum strains and 83 analyses containing
   52 G. barbadense strains (Additional file [80]3). SNPs and INDELs were
   called using Samtools [[81]29] (v1.3) and Bcftools [[82]29] (v1.3). The
   possible effects of SNPs were annotated using SnpEff [[83]30] (v4.3).
   Detailed parameters for this analysis pipeline are listed in the help
   document for CottonFGD ([84]https://cottonfgd.org/about/help/).

Development of database and webserver

   The processed sequence, annotation, expression, and variation data were
   stored in our MySQL (v5.6.26) server. A user-friendly web interface was
   constructed to enable end users to conveniently access CottonFGD data.
   The web interface was developed using the Twitter Bootstrap framework
   based on modern HTML5 and JavaScript. This enables users to access
   CottonFGD through any modern browser on any kind of device. Multiple
   JavaScript tools were used to visualize the searched data (See the
   Utility and discussion section for details). PHP (v5.6.6) was used to
   submit users’ query searches and to dynamically generate report pages.
   Both the database and the website are hosted on our Supermicro® server
   running CentOS 6.8.

Website structure

   The main structure of CottonFGD is shown in Fig. [85]1. It consists of
   three main modules: search, profile, and analysis. The search module
   gives users three methods to search for cotton genes: browsing by
   genomic regions (the “Browse” page), searching by sequence similarity
   (the “BLAST” page), and searching by gene properties such as names,
   associated domains, or expression patterns (the “Search” page). After
   receiving users’ queries, the search module generates a list of cotton
   genes as results. Users can then either click the attached link in each
   gene to view the relevant profile page one-by-one, or they can choose
   and select multiple gene IDs from the lists and launch the analysis
   module. In the analysis module, users can fetch information for every
   selected gene or conduct analysis of selected gene sets. Such analysis
   includes enrichment analysis, multiple sequence alignment (MSA) &
   phylogenetic tree construction, or gene lists comparison. All three of
   the modules are integrated by hyperlinks and action buttons. Therefore,
   it is also feasible to use CottonFGD on hand-held devices such as
   mobile phones, where it is not as easy to do copy and paste as it is on
   personal computers.

Fig. 1.

   Fig. 1
   [86]Open in a new tab

   The website structure of CottonFGD. CottonFGD consists of three main
   modules: search, profile, and analysis. The search module accepts
   users’ queries and searches for cotton genes by genomic region,
   sequence similarity, or gene properties. The profile module displays an
   information page for a specified gene or transcript, including multiple
   properties such as gene structure, homology, gene function, and
   expression and sequence variation data. The analysis module can accept
   a list of gene IDs and generate relevant information lists; it can also
   conduct analyses of entire gene sets

Utility and discussion

The search module: browse, BLAST, or search cotton genes

   CottonFGD provides three methods to search for cotton genes: by genomic
   regions, by sequence similarity, or by gene properties.

   The “Browse page” (Fig. [87]2a and Additional file [88]4) displays
   annotated cotton genes in a specified genomic region. When first
   visiting the Browse page, it automatically displays all the annotated
   genes located from A01: 1,000,000–3,000,000 of the NAU assembly for G.
   hirsutum). Users can change the target species and the genomic regions
   to whatever they want, and can update the displayed gene lists. Regions
   can be defined by either genomic coordinates (physical position) or
   genetic markers (map position). User-altered parameters are stored in
   the users’ web browsers, and are automatically applied at the time of
   the next visit. In addition to the gene list table, CottonFGD also
   displays a snapshot of the gene distribution pattern in the current
   specified region rendered by JBrowse [[89]31], a modern genome browser.

Fig. 2.

   Fig. 2
   [90]Open in a new tab

   Structure of the search module. a The Browse page: search by genomic
   region (position or marker); (b) The BLAST page: search by sequence
   similarity through an embedded SequenceServer App [[91]32]. c The
   Search page: search by names, function, or expression; (d) A snapshot
   of an interactive result table. Users can either click the hyperlink in
   each gene ID to view the relevant profile page or can choose and select
   multiple gene IDs to import into the analysis module

   The “BLAST page” (Fig. [92]2b and Additional file [93]4) conducts
   sequence similarity searches against cotton gene sets or whole genome
   sequences. CottonFGD uses the latest stable version of NCBI BLAST+
   [[94]14] (currently v2.5.0) as the backend BLAST executable program and
   the SequenceServer app [[95]32] (v1.0.8) as the frontend interface.
   This makes BLAST searching fast, stable, and appealing.

   The “Search page” (Fig. [96]2c and Additional file [97]4) conducts gene
   searches using a variety of methods, including: by gene ID or name, by
   associated domains, by gene function items (GO, InterPro, or pathway),
   or by selected expression experiments. Users can switch among different
   search methods using the navigation tabs. When searching by domains or
   gene function names, CottonFGD implements a two-step search (Fig.
   [98]2c and Additional file [99]4): in the first step, CottonFGD lists
   all the function items that matched a user’s input. In the second step,
   users select the sub-items they want, and CottonFGD then returns a
   final associated gene list. This type of two-step searching method
   greatly reduces the number of redundant results that can arise from
   fuzzy matching of users’ search terms.

   In all three of the search methods, CottonFGD renders search results in
   an interactive gene list table (Fig. [100]2d). Users can view each gene
   or transcript profile by clicking the relevant hyperlink in the gene
   ID, can download the table to their local devices in one of several
   formats, or can select the genes they want and do further analysis by
   clicking on relevant buttons located above the result table.

The profile module: view gene/transcript profiles

   Each annotated gene and its main transcript has a profile page in
   CottonFGD where a variety of related information is displayed. It can
   be accessed by hyperlinks in the search result tables or directly by
   input URLs. For example, the profile page of gene Gh_A01G0139 in G.
   hirsutum can be accessed via
   [101]https://cottonfgd.org/profiles/gene/Gh_A01G0139/, and its main
   transcript Gh_A01G0139.1 can be accessed via
   [102]https://cottonfgd.org/profiles/transcript/Gh_A01G0139.1/.

   The profile page for a given gene displays basic information (name,
   description, location, and genomic DNA sequence), associated
   transcripts, genomic context, and cross-database references (Fig.