Abstract

   The pigment gland is a morphological characteristic of Gossypium and
   its related genera. Gossypium bickii (G[1]) is characterized by delayed
   pigment gland morphogenesis in the cotyledons. In this study, a
   reference-grade genome of G[1] was generated, and comparative genomics
   analysis showed that G[1] was closest to Gossypium australe (G[2]),
   followed by A- and D-genome species. Two large fragment translocations
   in chromosomes 5 and 13 were detected between the G genome and other
   Gossypium genomes and were unique to the G[1] and G[2] genomes.
   Compared with the G[2] genome, two large fragment inversions in
   chromosomes 12 and 13 were detected in G[1]. According to the
   phylogeny, divergence time, and similarity analysis of nuclear and
   chloroplast genomes, G[1] was formed by hybridization between Gossypium
   sturtianum (C[1]) and a common ancestor of G[2] and Gossypium nelsonii
   (G[3]). The coordinated expression patterns of pigment gland formation
   (GoPGF) and gossypol biosynthesis genes in G[1] were verified to be
   consistent with its phenotype, and nine genes that were related to the
   process of pigment gland formation were identified. A novel gene,
   GbiCYP76B6, regulated by GoPGF, was found to affect gossypol
   biosynthesis. These findings offer insights into the origin and
   evolution of G[1] and its mechanism of pigment gland formation and
   gossypol biosynthesis.

   Key words: Gossypium bickii, genome, evolution, gossypol, pigment
   glands
     __________________________________________________________________

   This study presents a reference-grade genome assembly and annotation
   for Gossypium bickii and suggests that G. bickii evolved from an
   interspecific hybridization between Gossypium sturtianum and a common
   ancestor of Gossypium australe and Gossypium nelsonii and that
   GbiCYP76B6 is regulated by GoPGF to affect gossypol biosynthesis in
   G. bickii.

Introduction

   Cotton is one of the most critical cash crops worldwide, providing
   natural fibers as well as high-quality seed oil and proteins.
   Cottonseeds contain approximately 21% oil and 23% protein ([51]Tian
   et al., 2018), which can meet the protein demands of 500 million people
   in the world annually ([52]Lusas and Jividen, 1987). However, gossypol,
   the main component of pigment glands, is toxic to humans and
   monogastric animals, limiting the consumption of cottonseeds. On the
   other hand, gossypol and its derivatives are important phytoalexins
   that play a significant role in the defense system of cotton plants.

   Gossypol, in its pure form, consists of dimeric sesquiterpenes, yellow
   in color, and is biosynthesized in cotton roots and then transported to
   stems, followed by leaves and seeds ([53]Smith, 1961). According to the
   literature, gossypol biosynthesis belongs to the mevalonic
   acid pathway, and genes encoding key enzymes that participate in this
   pathway include CDNC, CYP706B1, DH1, CYP82D113, CYP36A196, 2-ODD-1, and
   SPG ([54]Tian et al., 2018; [55]Huang et al., 2020b). Although gossypol
   in cottonseeds can be inhibited by genetic engineering, this inhibition
   has not yet reached the standard for edibility or feeding (≤0.04%)
   ([56]Hu et al., 2019a).

   Pigment glands are symbolically presented as brown or dark brown dotted
   organs that store gossypol and its derivatives in glanded cotton. A
   positive correlation has been found between the number of pigment
   glands and the gossypol content in cotton ([57]Ma et al., 2016). By
   contrast, glandless cotton has a low gossypol content because there are
   no pigment glands for gossypol storage in any part of the plant,
   including seeds. To effectively utilize cottonseeds, low-gossypol
   cotton was developed several decades ago. By transferring the glandless
   genes gl[2] and gl[3] from Hopi cotton ([58]Mcmichael, 1959, [59]1960)
   and Gle 2 from Bahtim 110 ([60]Afifi et al., 1966) into glanded upland
   cotton by conventional methods, many glandless cotton cultivars have
   been developed and used in several countries for cotton production.
   However, with the reduction in seed gossypol content, the natural
   defense systems of the cotton plants were also lost. Therefore, the
   study of cotton germplasms with glandless seeds or low seed gossypol
   content will be very important for fully utilizing cottonseeds.

   Several wild diploid cotton species from Australia, such as Gossypium
   bickii (G[1]), exhibit delayed pigment-gland morphogenesis in
   cotyledons; their dormant cottonseeds are glandless but glanded during
   the seed germination stages ([61]Figure 1A) ([62]Fryxell, 1965).
   Therefore, this delayed pigment gland morphogenesis trait has great
   significance for both cottonseed utilization and plant stress
   tolerance. After hybridization of G[1] with Gossypium arboreum cv.
   Shixiya1 (A[2]) and chromosome duplication, the allotetraploid retained
   its target trait ([63]Zhu, 1999). However, because of the genetic
   distance between artificial allotetraploid and upland cotton (Gossypium
   hirsutum, (AD)[1]), it is difficult to obtain stable upland cotton
   germplasms with the target traits through traditional breeding methods.

Figure 1.

   [64]Figure 1
   [65]Open in a new tab

   Morphology and chromosomal and chloroplast features of G. bickii.

   (A) The character of pigment glands in G[1]. (i) The dormant seed, (ii)
   the partially enlarged dormant seed, (iii) the seed after germination
   for 36 h, (iv) the partially enlarged cotyledon after germination for
   36 h, (v) the true leaf of the seedling, (vi) the partially enlarged
   true leaf of the seedling, (vii) the hypocotyls of the seedling, and
   (viii) the partially enlarged hypocotyls of the seedling. Scale bars,
   0.5 mm.

   (B) Chromosomal features of G[1]. (i) 13 pseudochromosomes of the G[1]
   genome, (ii) gene density, (iii) non-coding RNA (ncRNA) density, (iv)
   transposable element (TE) density, (v) Gypsy density, (vi) Copia
   density, and (vii) GC content. The density was calculated in 500-kb
   intervals across the chromosomes.

   (C) Chloroplast genome features of G[1]. The inner circle line
   indicates the range of the inverted repeats (IRa and IRb), which
   separate the genome into small single-copy and large single-copy
   regions. The inside of the inner circle represents the GC content. The
   outer circle line represents the gene location information for each
   class, classified by different colors. The components on the outside of
   the outer circle are transcribed in the clockwise direction, and on the
   inside, the components are transcribed in the counterclockwise
   direction.

   With the development of sequencing technologies such as PacBio,
   high-throughput chromosome conformation capture (Hi-C), and BioNano,
   the genomes of five Gossypium allotetraploid species, including G.
   hirsutum, and several Gossypium diploid species, including A-, D-, G-,
   and F-genome species, have been well sequenced, assembled, and
   annotated ([66]Udall et al., 2019a; [67]Grover et al., 2019; [68]Huang
   et al., 2020a; [69]Cai et al., 2020), revealing the genome variations
   among different cotton species and their evolutionary relationships.
   Here, a high-quality genome sequence of G[1] was generated (total
   length 1.77 Gb) and successfully oriented to 13 pseudochromosomes based
   on PacBio long reads, Illumina paired-end reads, and Hi-C technologies.
   To reveal the evolution and origin of G[1], comparative genomics
   analysis was performed based on nuclear and chloroplast genomes. In
   addition, to reveal the unique traits of gossypol and the pigment
   gland, several functional genes were identified and validated using
   transcriptome data and virus-induced gene silencing (VIGS), providing
   insights into the mechanism of gossypol biosynthesis and pigment gland
   formation.

Results

Sequencing and assembly of a high-quality G. bickii genome

   In this study, the G[1] genome was sequenced and assembled for the
   first time. Through K-mer distribution analysis, the genome size was
   estimated to be 1706.79 Mb with 0.18% heterozygosity and 74.19%
   repetitive sequences ([70]Supplemental Figures 1 and [71]2,
   [72]Supplemental Table 1), indicating that the G[1] genome was highly
   repetitive. To obtain a high-quality reference genome, we integrated
   different sequencing technologies, including PacBio single-molecule
   real-time sequencing, Illumina paired-end sequencing, and Hi-C
   technologies. For the initial assembly, a total of 248.64 Gb of PacBio
   long reads were generated, representing approximately 142.08-fold
   genome coverage ([73]Supplemental Table 2), and used to assemble a
   genome of 1765.96 Mb with 1574 contigs and a contig N50 of 4.62 Mb
   ([74]Table 1, [75]Supplemental Tables 2 and [76]3). The assembled
   contigs were then polished by aligning PacBio long reads to the initial
   assembly. To increase the consensus accuracy of the assembly, the
   initial assembly was also corrected using 299.22 Gb of high-quality
   Illumina reads, representing about 176.01-fold genome coverage
   ([77]Supplemental Table 2).

Figure 2.

   [78]Figure 2
   [79]Open in a new tab

   Genome evolutionary and structural variation of Gossypium.

   (A) Phylogenetic tree and divergence times among Gossypium and other
   species in Malvaceae.

   (B) The divergence between G[1] and other species by K[s] value for
   orthologous genes between G[1] and other cotton species.

   (C) Characterization of genomic structural variation among the cotton
   species. Genomic synteny blocks are connected by gray lines. The large
   translocations between G groups (G[1] and G[2]) and other cotton
   species are highlighted by red links, the large inversion in chromosome
   12 between G[1] and G[2] is highlighted by blue links, and the
   inversions in chromosome 13 between G[1] and other Gossypium species
   are highlighted by cyan links.

   (D) Hi-C heatmap verification of chromosome inversion and translocation
   fragments. The Hi-C-assisted assembly data for the G[1] genome were
   used to scan chromosomes 5 and 13 of G[1] and A[1] and chromosomes 12
   and 13 of G[1] and G[2], respectively, to verify inversion and
   translocation fragments.

   (E) The top 10 enriched KEGG pathways of the translocation and
   inversion genes in the G[1] genome. All translocation genes in
   chromosomes 5 and 13 and inversion genes in chromosomes 12 and 13 were
   selected for KEGG pathway enrichment analysis.

Table 1.

   Genomic features of Gossypium bickii.
             Genomic feature            G. bickii
   Total length of contigs (Mb)         1765.96
   Total length of scaffolds (Mb)       1766.07
   Proportion of anchored sequences (%) 96.5
   Number of contigs                    1574
   Contig N50 (Mb)                      4.62
   Number of scaffolds                  445
   Scaffold N50 (Mb)                    133.9
   Number of genes                      43 790
   TE repeats (%)                       70.22
   GC content (%)                       36.01
   [80]Open in a new tab

   To construct the chromosome-level genome, a total of 277.57 Gb of
   high-quality Hi-C fragments, representing about 163.27-fold genome
   coverage with 915.85 million read pairs, were used to categorize and
   order the assembled contigs ([81]Supplemental Tables 2 and [82]4). The
   Hi-C heatmap shows 13 distinct chromosomal groups ([83]Supplemental
   Figure 3). Finally, approximately 1704.25 Mb of the assembly (96.51% of
   the total assembled genome size) was anchored into 13 pseudochromosomes
   with 445 scaffolds and a scaffold N50 of 133.90 Mb, which captured
   1766.07 Mb of the G[1] genome ([84]Table 1, [85]Supplemental Tables 3
   and [86]5). As the results show, the size of the G[1] genome was
   similar to that of the Gossypium australe (G[2]) genome (1752.74 Mb)
   ([87]Cai et al., 2020), slightly higher than that of the A genome
   ([88]Huang et al., 2020a), and more than twice that of the D genome
   ([89]Paterson et al., 2012; [90]Wang et al., 2019) ([91]Supplemental
   Figure 4).

   The integrity and accuracy of the assembly were evaluated by alignment
   of the Illumina short-insert reads, resulting in a 98.88% mapping rate
   and 96.07% genome coverage ([92]Supplemental Table 6). The GC content
   of the G[1] genome was 36.01%, which was comparable to that of other
   cotton genomes ([93]Table 1, [94]Supplemental Table 7). Core Eukaryotic
   Genes Mapping Approach (CEGMA) ([95]Parra et al., 2007) evaluation
   indicated that 233 of 248 (93.95%) core eukaryotic genes were captured
   ([96]Supplemental Table 8), and the Benchmarking Universal Single-Copy
   Orthologs (BUSCOs) ([97]Kriventseva et al., 2019) method revealed that
   1594 of 1614 (98.7%) Embryophyta single-copy orthologous genes were
   captured in the genome ([98]Supplemental Table 9).

Annotation of the G. bickii genome

   A total of 43,790 genes were annotated in the G[1] genome by combining
   de novo prediction, homolog-based prediction, and transcription data,
   with an average gene length of 2979.93 bp and an average of 4.63 exons
   per gene ([99]Table 1, [100]Supplemental Table 10, [101]Supplemental
   Figure 5). Among the predicted protein-coding genes, approximately
   41 776 (95.4%) genes were annotated by at least one of the NR,
   SwissProt, KEGG, and InterPro protein databases ([102]Supplemental
   Table 11), and 41 893 genes were anchored to 13 pseudochromosomes
   ([103]Supplemental Table 12). In addition, 1450 tRNAs, 324 microRNAs,
   1928 rRNAs, and 10 521 small nuclear RNAs were identified in the G[1]
   genome ([104]Supplemental Table 13). The repetitive sequences in the
   G[1] genome were annotated using de novo and homology-based
   predictions. As much as 70.22% of the G[1] genome was made up of
   transposable elements (TEs) ([105]Table 1, [106]Supplemental Table 14),
   similar to that of the G[2] genome (72.58%) ([107]Cai et al., 2020),
   lower than that of the A genome ([108]Wang et al., 2019; [109]Huang
   et al., 2020a), and much higher than that of the D genome ([110]Wang
   et al., 2012, [111]2019) ([112]Supplemental Table 15). Long terminal
   repeat (LTR) retrotransposons accounted for the highest proportion of
   TEs (67.08%) in the G[1] genome; the Gypsy (59.29%) and Copia (4.37%)
   superfamilies were dominant ([113]Supplemental Table 15) and were
   mainly concentrated in the middle regions of chromosomes, as opposed to
   the gene-rich regions ([114]Figure 1B).

Resequencing and assembly of four chloroplast genomes

   The genomes of G[1] and its related wild diploid species G[2],
   Gossypium nelsonii (G[3]), and Gossypium sturtianum (C[1]) were
   sequenced using the Illumina HiSeq platform. Totals of 24.04, 34.12,
   35.43, and 35.23 Gb of Illumina short reads were generated with average
   depths of 13.67, 19.57, 22.19, and 25.77, respectively
   ([115]Supplemental Table 16). The resequencing data of these four
   Gossypium species were then mapped onto the G[1] genome to further
   construct a phylogenetic tree. Illumina reads from these four species,
   G[1], G[2], G[3], and C[1], were also used to assemble their complete
   chloroplast genomes, which were 159.46, 159.58, 159.58, and 159.54 kb
   in size, respectively. These four chloroplast genomes consisted of four
   regions: the large single copy, small single copy, inverted repeat a,
   and inverted repeat b regions ([116]Figure 1C). The large single copy
   region was the main region responsible for differences in genome size,
   accounting for more than 55% of the chloroplast genome ([117]Table 1,
   [118]Supplemental Figure 6). All genes in the four species were
   classified into four categories: 50 photosynthesis-related genes, 30
   transcription-related genes, 44 RNA genes, and 8 other genes
   ([119]Supplemental Table 17).

Evolution of the G. bickii genome

   Ten species of Malvaceae, including G[1], were selected for gene family
   analysis ([120]Supplemental Table 18), and 30 329 gene family members
   were identified ([121]Supplemental Figure 7, [122]Supplemental
   Table 19). A total of 3532 common single-copy orthologs present in all
   species were used to construct a phylogenetic tree ([123]Supplemental
   Table 20). The phylogenetic tree showed that all Gossypium species were
   clustered in one clade, which diverged from the sister genus
   Gossypioides kirkii and Theobroma cacao ([124]Figure 2A). The
   divergence time of T. cacao and the other species was estimated to be
   68.5 mya (65.2–75.0 mya), and Gs. kirkii diverged from the A, G, and D
   genomes around 9.0 mya (7.3–14.2 mya) ([125]Figure 2A). Within
   Gossypium, the divergence time for G[1] and G[2] was estimated to be
   4.4 mya (3.4–5.0 mya), following the divergence time of the common
   ancestor of the G genomes and the common ancestor of the A genomes
   around 5.9 mya (5.2–7.0 mya) ([126]Figure 2A). The common ancestor of
   the A and G genomes diverged from the D genomes around 6.6 mya (5.8–7.7
   mya). In comparison with the common ancestor of the G genome, there
   were 1689 expanded and 605 contracted gene families in G[1]
   ([127]Figure 2A). These expanded gene families in the G[1] genome were
   significantly enriched in terms of photosynthesis by GO and KEGG
   enrichment. Other terms related to respiration, such as electron
   transport chain and oxidative phosphorylation, were also significantly
   enriched ([128]Supplemental Figure 8) ([129]Laisk et al., 2010).

   TEs, particularly LTRs, have been reported to be related to genome
   expansion, which also contributes to the evolution of Gossypium
   ([130]Huang et al., 2020a; [131]Wang et al., 2021). The TE and LTR
   contents of each Gossypium species, including the G genome, were
   compared with their respective genome sizes, and the correlation
   coefficient between the number of TEs and the genome size was 0.9693
   ([132]Supplemental Figure 9). LTRs, the main class of TEs, showed a
   higher correlation coefficient with genome size (R^2 = 0.9715)
   ([133]Supplemental Figure 9). It is worth noting that the D genome was
   less than half the size of the other genomes and had the lowest
   proportion of TEs ([134]Supplemental Table 15). LTR expansion occurred
   in all the investigated species, including A, D, and G. This expansion
   time was estimated to be 0–5 mya ([135]Supplemental Figure 10), after
   the divergence time of the D and A/G genomes (5.8–7.7 mya) and that of
   the A and G genomes (5.2–7.0 mya). The divergence times estimated by Ks
   values of homologous gene pairs across syntenic blocks between G[1] and
   other species were similar to those estimated from the phylogenetic
   tree ([136]Figure 2A and 2B). Whole-genome duplication (WGD) events are
   known to have occurred during the evolution of cotton species
   ([137]Paterson et al., 2012; [138]Li et al., 2014; [139]Grover et al.,
   2019), as was also found here for G[1] (13–20 mya) ([140]Supplemental
   Figure 11).

Specific structural variation in the G. bickii genome

   Chromosome 13 of G[1] had a large fragment translocation (47.21 Mb)
   with chromosome 5 of other cotton species; however, no such
   translocation was found between G[1] and G[2] ([141]Figure 2C). A total
   of 2728 functional genes were found in this translocation region
   ([142]Supplemental Table 21), accounting for 50.83% of all genes on
   chromosome 13. Another translocation was detected between chromosome 5
   in G[1] (5.12 Mb) and chromosome 13 in other genomes and was not found
   between G[1] and G[2] ([143]Figure 2C); 409 functional genes were found
   in this region, accounting for 11.68% of all genes on chromosome 5
   ([144]Supplemental Table 22). These two translocations may be unique
   events in G[1] and G[2].

   A large inversion was detected on chromosome 13 (65.48 Mb)
   of G[1] ([145]Figure 2C); it was found only in G[1] compared with other
   analyzed Gossypium species. A total of 634 genes were found in this
   region ([146]Supplemental Table 23), such as fiber elongation-related
   transcription factor MYB25 ([147]Machado et al., 2009), and
   it accounted for 23.24% of all genes on chromosome 13. Another
   inversion, which included 1588 genes, was detected on chromosome 12
   (95.34 Mb) of G[1] ([148]Figure 2C and [149]Supplemental Table 24) and
   was detected only between G[1] and G[2]. Gene collinearity analysis
   further confirmed the presence of these translocation and inversion
   genes ([150]Supplemental Figure 12), and the variation fragments were
   confirmed by a Hi-C heatmap and PCR amplification ([151]Figure 2D and
   [152]Supplemental Figure 13). These translocation and inversion genes
   were used for KEGG enrichment analysis. The most common enriched term
   was photosynthesis, which appeared in the chromosome 5 translocation
   and the chromosome 12 and 13 inversions in G[1] ([153]Figure 2E). In
   addition, glycerophospholipid metabolism appeared in the chromosome 5
   translocation and the chromosome 12 inversion, and plant hormone signal
   transduction appeared in the chromosome 12 translocation and the
   chromosome 13 inversion ([154]Figure 2E).

Speciation of the G. bickii genome

   To confirm the biphyletic ancestry and incongruence between the nuclear
   and chloroplast phylogenies of G[1] ([155]Wendel et al., 1991), we used
   G[1] and three other closely related Gossypium species, G[2], G[3], and
   C[1] ([156]Figure 3A). Phylogenetic trees of nuclear and chloroplast
   genomes of these Gossypium species were constructed based on SNPs in
   the nuclear genomes and single-copy genes in the chloroplast genomes.
   The results showed that the G[1] nuclear genome was closer to those of
   G[2]/G[3] than that of C[1], which diverged from G[2]/G[3] around 3.9
   mya (3.3–4.4 mya) ([157]Figure 3B). However, the chloroplast genome of
   G[1] was similar to that of C[1], and they diverged from each other 1.2
   mya (1.2–1.4 mya), later than the divergence time of G[1] and
   G[2]/G[3], 5.1 mya (4.7–5.2 mya) ([158]Figure 3B), similar to a
   previous study ([159]Chen et al., 2017). The nuclear genome of G[1] was
   similar to those of G[2] and G[3], but the chloroplast genome of G[1]
   was more similar to that of C[1] ([160]Figure 3C). Therefore, we
   hypothesized that there was a common ancestor of G[2] and G[3],
   referred to here as G[0], and that G[1] might have formed initially by
   interspecific crossing of the ancestor of C[1] as the female parent and
   G[0] as the male parent 3.9 mya (3.3–4.4 mya) ([161]Figure 3D). The
   chloroplast genome of C[1] was maternally inherited in the hybrid, and
   the hybrid at later generations did not evolve into G[1] until
   approximately 1.2 mya ([162]Figure 3D), perhaps as a result of
   backcrossing with G[0] ([163]Wendel et al., 1991).

Figure 3.

   [164]Figure 3
   [165]Open in a new tab

   Evolutionary analysis of nuclear and chloroplast genomes of C[1], G[1],
   G[2], and G[3].

   (A) Fiber, leaf, sepal, bract, and petal phenotypes of four species in
   Gossypium.

   (B) Nuclear and chloroplast genome phylogenetic trees and divergence
   time analysis of four species in Gossypium. The phylogenetic tree for
   the nuclear genome was constructed based on SNPs, and the chloroplast
   genome was based on 72 single-copy genes from chloroplast genomes.

   (C) Identity scores of nuclear and chloroplast genomes for four species
   in Gossypium. The calculation of nuclear genome similarity was based on
   SNPs, and the chloroplast genome was based on whole-genome pairwise
   comparison.

   (D) Model for the evolution of four species in Gossypium. Green and
   yellow represent nuclear and chloroplast genomes, respectively. The
   blue numbers represent the divergence times.

   In addition, the fiber length of G[1] was similar to those of G[2] and
   G[3], which were longer than that of C[1]. The fiber morphology of G[1]
   was similar to that of C[1], which differed from the straight and spiky
   fibers of G[2] and G[3] ([166]Figure 3A). The shapes and sizes of the
   leaves, sepals, and bracts in G[1] were similar to those in G[2] and
   G[3] but different from those in C[1]. By contrast, the petal size and
   color were similar to those of C[1] but differed from those of G[2] and
   G[3] ([167]Figure 3A). These findings show that some phenotypes of G[1]
   were intermediate between those of C[1] and G[2]/G[3].

Genes related to gossypol and pigment glands in G. bickii

   Most of the cotton species formed pigment glands during ovule
   development, except for the wild diploid Gossypium species from
   Australia, which had delayed pigment gland morphogenesis in the
   cotyledon. Cotyledons at the ovule development (10, 20, and 30 days
   post-anthesis (dpa)) and seed germination (0, 12, 24, and 36 h) stages
   of Gossypium herbaceum cv. Hongxing (A[1]), A[2], and G[1] were used
   for observation of gland formation processes ([168]Figure 4A). The
   ovules of A[1] and A[2] had visible pigment glands, although there were
   no pigment glands before 20 dpa ([169]Figure 4A and 4B). However, there
   were no pigment glands during ovule development or at the initial stage
   of germination (0, 12, and 24 h) in G[1], and visible pigment glands
   appeared at 36 h after seed germination. Similarly, gossypol was not
   detected before 10 dpa in A[1] and A[2], although it accumulated at 20
   (5.91 mg/g) and 30 dpa (7.85 mg/g); gossypol was not detected in G[1
   seeds] until the 12-h seed germination time point (0.054 mg/g)
   ([170]Figure 4C and [171]Supplemental Table 25).

Figure 4.

   [172]Figure 4
   [173]Open in a new tab

   Morphology, gossypol contents, and differential expression analysis of
   A[1], A[2], and G[1].

   (A) Morphology of ovules at 10, 20, and 30 dpa; dormant mature seed (0
   h); and seeds at 12, 24, 36, and 48 h after germination. Scale bars,
   0.5 mm.

   (B) Glanded and glandless cotyledons in A[1], A[2], and G[1]. The
   glandless cotyledons at 10 dpa and glanded cotyledons at 20 dpa in A[1]
   and A[2] and the glandless cotyledon at 24 h after germination and
   glanded cotyledon at 36 h after germination in G[1] are shown.

   (C) Gossypol contents in ovules at 10, 20, and 30 dpa and in seeds at
   0, 12, 24, 36, and 48 h after germination in A[1], A[2], and G[1] (mean
   ± SD, n = 3).

   (D) Expression of differentially expressed genes (DEGs) in the gossypol
   biosynthesis pathway, represented by a heatmap and estimated by
   computing fragments per kilobase of transcript per million mapped
   fragments (FPKM) values.

   (E) Common and specific genes in five DEG lists: four upregulated
   groups (A[1]_30 dpa versus A[1]_10 dpa, A[2]_30 dpa versus A[2]_10 dpa,
   G[1]_12 h versus G[1]_0 h, and G[1]_24 h versus G[1]_0 h) and one not
   differentially expressed group (G[1]_30 dpa versus G[1]_10 dpa).

   (F) Common and specific genes in the five common DEGs from (D) and DEGs
   of CCRI12 versus CCRI12gl.

   (G) Correlations between GoPGF and 18 other DEGs. These genes were
   selected by computing the Pearson correlation coefficient (r ≥ 0.5, P ≤
   0.05) from 29 common genes in (E).

   (H) Expression of nine functional genes selected in (G) represented by
   a heatmap and estimated by computing FPKM values.

   A total of 48 gossypol genes and 1 pigment gland formation gene were
   identified in G[1] by a BLAST search using reported gossypol and
   pigment gland-related genes ([174]Supplemental Table 26) ([175]Ma
   et al., 2016; [176]Tian et al., 2018; [177]Huang et al., 2020a); they
   were distributed across the seven chromosomes ([178]Supplemental
   Figure 14). Transcriptome analysis revealed that the expression
   patterns of genes related to pigment gland formation and gossypol
   biosynthesis differed significantly between A[1]/A[2] and G[1]. Most
   genes were upregulated during the process of ovule development in A[1]
   and A[2], whereas these genes were not upregulated in G[1] until seed
   germination ([179]Supplemental Figure 15, [180]Supplemental Tables 27
   and [181]28). A heatmap also showed that some genes in the gossypol
   biosynthesis pathway differed in expression pattern between G[1] and
   A[1]/A[2] ([182]Figure 4D). The expression patterns of these genes were
   further verified by qPCR and were consistent with the phenotypes of
   pigment gland formation and gossypol biosynthesis ([183]Supplemental
   Figure 16). The other two Australian Gossypium species, G[2] and G[3],
   were similar to G[1] in gene expression pattern. Gene expression in the
   glandless near-isogenic line (NIL) CCRI12gl was inhibited at all
   periods compared with the corresponding glanded NIL CCRI12
   ([184]Supplemental Figure 17).

   Through transcriptome sequencing and comparative analysis, 89 common
   differentially expressed genes (DEGs) were identified and shared by
   five groups of DEGs ([185]Figure 4E). These genes were upregulated in
   ovules of A[1] and A[2] at 30 dpa and in germinated seeds of G[1] at 12
   and 24 h after germination; they included GoPGF and 12 gossypol
   biosynthesis genes ([186]Supplemental Figure 18). Among them, 29 DEGs
   were also found in the glandless NILs group (CCRI12 versus CCRI12gl)
   ([187]Figure 4F). Among them, nine gossypol biosynthesis genes and nine
   other functional genes showed significant positive correlations in
   expression pattern with GoPGF based on Pearson’s correlation
   coefficient (r ≥ 0.5, P ≤ 0.05) ([188]Figure 4G and 4H). These nine
   functional genes were significantly positively correlated with GoPGF
   and may be related to pigment gland formation and gossypol
   biosynthesis.

Function and mechanism of CYP76B6

   VIGS assays were used to verify the functions of the nine genes that
   were positively correlated with GoPGF ([189]Figure 5A). Among them, the
   novel gene Gbi08G2110, identified as a homolog of CYP76B6 in
   G. hirsutum ([190]Smitha et al., 2019), had a significant effect on
   gossypol biosynthesis in G[1]. The gossypol content in leaves and stems
   of GbiCYP76B6-silenced seedlings was reduced to 0.347 and 0.049 mg/g,
   respectively, significantly lower than that in the negative control
   tobacco rattle virus (TRV):00 seedlings (0.748 mg/g in stems and
   0.113 mg/g in leaves) ([191]Figure 5A). The VIGS system verified that
   gossypol content was completely inhibited in stems and leaves of
   GoPGF-silenced seedlings, whose gossypol contents were significantly
   lower than those of TRV:00 ([192]Supplemental Figure 19).

Figure 5.

   [193]Figure 5
   [194]Open in a new tab

   Function and mechanism of GbiCYP76B6 in gossypol biosynthesis.

   (A) Relative GbiCYP76B6 expression levels and gossypol contents in
   leaves and stems of seedlings after the silencing of GbiCYP76B6 in
   G[1]. TRV:00 was used as a negative control. Independent experiments
   with mean ± SD, n = 3; ∗P < 0.05.

   (B) Relative expression levels of GbiCYP76B6 in GoPGF-silenced
   plants and FPKM expression values of GhCYP76B6 in CCRI12 (glanded) and
   CCRI12gl (glandless). ∗P < 0.05; ∗∗P < 0.01.

   (C) Relationship between CYP76B6 and gossypol biosynthesis in the
   terpene biosynthesis pathway.

   (D)cis-acting elements in the 2000-bp promoter upstream of the
   GbiCYP76B6 coding region.

   (E) Mechanistic model of GoPGF and CYP76B6 in gossypol biosynthesis.

   Providing further evidence of the relationship between GoPGF and
   CYP76B6, the expression of GbiCYP76B6 was significantly downregulated
   in GoPGF-silenced plants compared with TRV:00 ([195]Figure 5B). In
   addition, the homologous genes GhCYP76B6_A12 and GhCYP76B6_D12 were
   significantly downregulated in the glandless NIL CCRI12gl compared with
   the glanded NIL CCRI12 ([196]Figure 5B). These findings suggest that
   the expression of GbiCYP76B6 may be regulated by GoPGF.

   CYP76B6, a geraniol 8-hydroxylase, belongs to the CYP76 family of
   cytochrome P450 enzymes, which play a role in catalyzing the single or
   double oxidation of monoterpenols ([197]Ilc et al., 2016)
   ([198]Figure 5C). By predicting the function of cis-regulatory elements
   in the promoter sequence (2000 bp) of GbiCYP76B6 with PlantCARE
   ([199]Lescot et al., 2002), we found multiple sites related to abiotic
   stress, including elements that function in responses to light,
   gibberellin, and MeJA ([200]Figure 5D). Among them, a G-box
   light-responsive motif was found 1467 bp upstream of the coding region
   ([201]Figure 5D) and may interact with GoPGF ([202]Ma et al., 2016).
   GoPGF, from a class of basic helix–loop–helix transcription factors,
   may regulate the expression of GbiCYP76B6 by combining with its
   promoter element, thus affecting gossypol biosynthesis
   ([203]Figure 5E).

Discussion

   The whole genome of G[1] was successfully sequenced and assembled. A
   total of 1704.25 Mb (96.51%) of sequences were mapped onto 13
   pseudochromosomes with a contig N50 of 4.62 Mb and a scaffold N50 of
   133.90 Mb. The genome size of G[1] was 1776.06 Mb, close to that of
   G[2]but larger than that of diploid A and D species ([204]Udall et al.,
   2019a; [205]Wang et al., 2019; [206]Huang et al., 2020a; [207]Cai
   et al., 2020).

   A phylogenetic tree was constructed, and the divergence time of the A
   and G genomes was estimated to be 5.9 mya, after the divergence of the
   D and A/G genomes (6.6 mya), showing that the G genome had a closer
   relationship with the A genome than the D genome. In addition, the
   content and proportion of TEs and LTRs in the G genome were similar to
   those in the A genome and were more than twice those in the D genome.
   These findings show a potential relationship between genome
   similarities and TEs/LTRs in Gossypium, and previous studies have
   indicated that LTR insertion events occurred during the evolution of
   Gossypium, contributing to the divergence and evolution of the genus
   ([208]Huang et al., 2020a; [209]Cai et al., 2020). In addition, in
   accordance with other Gossypium species, the formation of the G genome
   was also accompanied by a specific WGD event within 13–20 mya ([210]Li
   et al., 2014; [211]Cai et al., 2020) that did not occur in the
   evolution of T. cacao. This WGD event may have shaped the genomes of
   Gossypium after their divergence from Theobroma.

   Structural variations play a key role in the divergence between
   G. hirsutum and G. barbadense, contributing to the stronger
   adaptability and higher yield of G. hirsutum and the better fiber
   quality and disease resistance of G. barbadense ([212]Hu et al., 2019b;
   [213]Wang et al., 2019). Our findings revealed the presence of unique
   translocations in the G genome that were formed before the divergence
   between G[1] and G[2]. These two translocations contained genes related
   to gossypol biosynthesis, including CDNC and CYP82D113 ([214]Tian
   et al., 2018). In addition, genes related to fiber elongation and
   secondary wall thickening, including ADF1, PAG1, Sus1, RDL1, and CFE1A,
   were also located in this translocation region ([215]Ji et al., 2003;
   [216]Lee et al., 2007), which might be related to fiber development in
   G[1]. Moreover, two inversions were found between G[1] and G[2] and
   contained genes such as MYB25 and FSN1 ([217]Machado et al., 2009;
   [218]Zhang et al., 2018), which are involved in fiber formation and
   secondary wall thickening.

   By comparing the nuclear and chloroplast genomic and phenotypic
   characteristics among G[1], G[2], G[3], and C[1], we hypothesized that
   G[1] evolved from interspecific hybridization between the ancestor of
   C[1] as the female parent and the common ancestor of G[2] and G[3] as
   the male parent, consistent with the previous view that G[1] has a
   biphyletic ancestry ([219]Wendel et al., 1991). The Australian
   Gossypium of C and G genomes had closer genetic relationships and no
   geographic isolation, which provided them with more opportunities for
   crossing. This could explain why the phenotypes of G[1] were
   intermediate between those of C[1] and G[2]/G[3], further confirming
   that G[1] was the product of hybridization between G and C ancestors.
   Compared with G[2] and G[3], G[1] was more attractive for potential
   breeding, not only because of its delayed pigment gland morphogenesis
   in cotyledons but also because of some other useful traits such as high
   boll setting, biotic and abiotic stress tolerance, fiber quality, and
   insensitivity of maturity and flowering to seasonal temperature and
   light ([220]Zhu et al., 1998).

   Several genes related to pigment glands have been identified in
   previous studies ([221]Ma et al., 2016; [222]Janga et al., 2018;
   [223]Cai et al., 2020; [224]Gao et al., 2020). However, knocking out
   only GoPGF completely blocks the process of gland formation ([225]Ma
   et al., 2016; [226]Janga et al., 2018). Eight functional genes (CDNC,
   CYP706B1, DH1, CYP82D113, CYP71BE79, 2-ODD-1, SPG, and CYP736A196) have
   been reported to participate in the gossypol biosynthesis pathway
   ([227]Sunilkumar et al., 2006; [228]Tian et al., 2018; [229]Huang
   et al., 2020a). The expression patterns of these genes in G[1] were
   different from those in A[1] and A[2] and were related to its traits of
   pigment gland formation and gossypol accumulation. Furthermore, the
   novel gene GbiCYP76B6 was identified and shown to reduce the gossypol
   content after silencing. It has been identified as a geraniol
   hydroxylase and is involved in the biosynthesis of monoterpenes
   ([230]Ilc et al., 2016). GbiCYP76B6 expression was significantly
   positively correlated with that of GoPGF. GoPGF, in a class of
   transcription factors, interacts with the upstream cis-acting G-box
   element ([231]Ma et al., 2016). These results show that GoPGF may
   affect the biosynthesis of gossypol by regulating the expression of
   GbiCYP76B6 in G[1]. In addition, GoPGF can inhibit gossypol content by
   regulating other pathways, such as interacting with TPS and WRKY
   ([232]Ma et al., 2016). These findings provide insights into the
   mechanisms by which GoPGF and GbiCYP76B6 function in gossypol
   biosynthesis.

Methods

Plant material sampling

   The experimental materials, including G[1], G[2], G[3], C[1], A[1], and
   A[2], as well as a pair of glanded and glandless NILs of G. hirsutum,
   CCRI12 (glanded) and CCRI12gl (glandless), were used in the study.
   G[1], G[2], G[3], C[1], A[1], and A[2] were provided by the Wild Cotton
   Species Garden in Sanya, Hainan. CCRI12 and CCRI12gl were developed by
   crossing Hai-1 (G. barbadense, glandless) with CCRI12, an upland cotton
   cultivar, and backcrossing to CCRI12, with selection for the glandless
   trait, for more than 10 generations. All plant materials were planted
   in the greenhouse of the Agricultural Station, Zhejiang University,
   Hangzhou, China.

Sample preparation and sequencing

   High-quality genomic DNA was extracted from fresh leaves of G[1] using
   the cetyl trimethyl ammonium bromide method ([233]Porebski et al.,
   1997). Two paired-end libraries (250 and 450 bp) were constructed and
   sequenced on an Illumina HiSeq X Ten platform (Illumina, San Diego, CA,
   USA) ([234]Nair et al., 2018), and a PCR-free library (450 bp) was
   constructed and sequenced on an Illumina HiSeq 2500 system
   ([235]Minoche et al., 2011). For PacBio sequencing, genomic DNA was
   sheared into approximately 20-kb targeted size, and long-read libraries
   were constructed with the SMRTbell Template Prep Kit (PacBio) and
   sequenced on the PacBio RS II platform ([236]Chin et al., 2016). For
   Hi-C sequencing, genomic DNA was cross-linked, digested with the
   restriction enzyme MboI, and labeled with biotinylated residues
   ([237]Belton et al., 2012). Biotinylated circles were then enriched,
   sheared, and sequenced on the Illumina HiSeq X Ten platform ([238]Nair
   et al., 2018).

Genome survey and assembly

   Jellyfish ([239]https://genome.umd.edu/jellyfish.html) was used to
   estimate genome size based on K-mer distribution using the high-quality
   reads from the short-insert (250 and 450 bp) paired-end libraries.
   Contig assembly of G[1] was performed with the PacBio subreads using
   FALCON (v.0.7) with the following parameters: length_cutoff_pr=5,000,
   max_diff=150, max_cov=150 ([240]Chin et al., 2013). The assembled
   contigs were then polished using Quiver software (smrtlink 7.0.1)
   ([241]Chin et al., 2013) with default parameters. To increase the
   accuracy of the assembly, Illumina short reads were used to perform
   error correction with default parameters of Pilon (v.1.22) ([242]Walker
   et al., 2014). To obtain the chromosome-level genome assembly, all Hi-C
   reads were aligned onto the contigs with HiC-Pro (v.2.10.0) software
   ([243]Servant et al., 2015). LACHESIS (v.2.0) ([244]Burton et al.,
   2013) was used to cluster the contigs into chromosomes with the
   following parameters: CLUSTER_N=13, CLUSTER_MIN_RE_SITES=160,
   CLUSTER_MAX_LINK_DENSITY=2, CLUSTER_NONINFORMATIVE_RATIO=2,
   ORDER_MIN_N_RES_IN_TRUNK=50, ORDER_MIN_N_RES_IN_SHREDS=10. Juicebox
   (v.1.18) ([245]Robinson et al., 2018) was used for manual fine-tuning
   with a graphical and interactive approach to obtain the final
   chromosome-level assembly.

Genome quality evaluation

   To evaluate the accuracy of the genome assembly, high-quality short
   paired-end reads were aligned to the G[1] genome using the
   Burrows–Wheeler aligner ([246]http://bio-bwa.sourceforge.net/) with
   parameters -k 32 -w 10 -B 3 -O 11 -E 4. The completeness of the genome
   assembly was then evaluated using CEGMA (v.2.5) ([247]Parra et al.,
   2007) and BUSCOs (v.4.1.2) ([248]Kriventseva et al., 2019).

Genome annotation

   To annotate repetitive sequences in the G[1] genome, RepeatModeler
   (v.1.0.5) ([249]Price et al., 2005), RepeatScout (v.1.0.5) ([250]Lian
   et al., 2016), Piler (v.1.0) ([251]Edgar and Myers, 2005), and
   LTR_FINDER (v.1.07) ([252]Xu and Wang, 2007) were used with default
   settings to build a non-redundant de novo repeat library. RepeatMasker
   (v.4.0.7) and RepeatProteinMask (v.4.0.7) ([253]Xu and Wang, 2007) were
   used for homologous repeat identification by running against the
   Repbase library. Tandem repeats were identified using TRF (v.4.07b)
   ([254]Benson, 1999). To estimate insertion times, LTR retrotransposon
   ends were aligned with MUSCLE (v.3.8.31) ([255]Edgar, 2004). The
   transition–transversion ratio was used to calculate nucleotide distance
   (D) in EMBOSS ([256]http://emboss.sourceforge.net/), and the insertion
   time of LTR retrotransposons was estimated by T = D/2r. The r value was
   obtained from a previous report ([257]Hu et al., 2019b).

   De novo predictions, homology-based predictions, and
   transcriptome-based predictions were used for gene structure annotation
   in the G[1] genome. Five ab initio gene prediction programs, Augustus
   (v.3.0.2) ([258]Stanke and Waack, 2003), Genescan (v.1.0)
   ([259]Salamov, 2000), geneid (v.1.4.4) ([260]Parra et al., 2000),
   GlimmerHMM (v.3.0.2) ([261]Majoros et al., 2004), and SNAP (v.11.29)
   ([262]Korf, 2004), were used to predict coding regions in the
   repeat-masked genome. Homologous proteins from sequenced species used
   for homology-based prediction referred to in [263]Supplemental Table 10
   were downloaded from the National Center for Biotechnology Information
   (NCBI; [264]www.ncbi.nlm.nih.gov) for homology-based predictions and
   aligned to the G[1] genome using TBLASTN (v.2.2.26) ([265]Altschul
   et al., 1997) (E value ≤ 1E−05). RNA-sequencing data from different
   tissues were mapped to the G[1] genome using Cufflinks (v.2.1.1)
   ([266]Trapnell et al., 2009), and PASA (v.2.1.1) ([267]Trapnell et al.,
   2010) was used to assemble transcripts and predict gene structures. All
   predicted genes from the above three methods were integrated into a
   comprehensive non-redundant gene set with EVidenceModeler (v.1.1.1)
   ([268]Haas et al., 2008).

   Functional annotation of the protein-coding genes in the G[1] genome
   was performed using BLASTP (E value ≤ 1E−05) against the SwissProt
   ([269]Apweiler et al., 2004), TrEMBL ([270]Bairoch and Apweiler, 1999),
   and NR databases ([271]ftp://ftp.ncbi.nih.gov/blast/db) and KEGG
   ([272]Kanehisa and Goto, 2000). The best hits were used to assign
   homology-based gene functions with the parameter "blastp
   -max_target_seqs". The software tRNAscan-SE ([273]Lowe and Eddy, 1997)
   was used with default parameters to predict tRNA genes in the G[1]
   genome. The rRNA, microRNA, and small nuclear RNA fragments were
   identified using INFERNAL software ([274]Nawrocki et al., 2009) with
   searches against the Rfam database ([275]Griffiths-Jones et al., 2005).

Genome resequencing

   High-quality genomic DNA was extracted from fresh leaves of G[2], G[3],
   and C[1] and sequenced on the Illumina HiSeq X Ten platform (Illumina,
   San Diego, CA, USA). The raw reads were trimmed using Trimmomatic
   (v.0.36) ([276]Bolger et al., 2014) with default parameters, and the
   clean reads were aligned to the G[1] genome using the Burrows–Wheeler
   aligner (v.0.7.8) with the parameters -t 4 -k 32 -M
   ([277]http://bio-bwa.sourceforge.net/). SAMtools (v.1.3) ([278]Li and
   Durbin, 2009) was used with default parameters for sorting and removing
   duplicates. The alignment results were then used to call variants with
   the HaplotypeCaller module of GATK (v.4.0) ([279]McKenna et al., 2010).
   VCFtools (v.0.1.14) ([280]Danecek et al., 2011) was used for further
   filtering of SNPs with the parameters --max-missing 1 --maf 0.05.
   Phylogenetic analysis was performed with the filtered SNPs using the
   software TreeBeST (v.1.9.2) with 1000
   bootstraps ([281]https://github.com/Ensembl/treebest). Identity scores
   were calculated to compare the genome similarity between cotton species
   according to the SNPs within 100-kb windows and 50-kb step sizes.
   Divergence time was estimated using the MCMCTree program in PAML
   (v.4.6) ([282]Yang, 2007) with the main parameters burn-in = 10,000,
   sample-number = 100,000, and sample-frequency = 2.

Chloroplast genome assembly

   To assemble the chloroplast genomes of G[1], G[2], G[3], and C[1],
   Illumina short paired-end reads generated from genome resequencing were
   trimmed using Trimmomatic (v.0.36) ([283]Bolger et al., 2014) with
   default parameters. ABySS software (v.2.3.4)
   ([284]http://www.bcgsc.ca/platform/bioinfo/software/abyss) was used for
   initial assembly. GapCloser (v.1.12) ([285]Luo et al., 2012) was then
   used for gap filling with default parameters to obtain the final
   assembled sequences. Subsequently, CHLOROBOX
   ([286]http://chlorobox.mpimp-golm.mpg.de/index.html) was used with
   default parameters to annotate the chloroplast genomes, and OGDRAW
   (v.1.3.1) ([287]Lohse et al., 2007) was used to draw the genome maps.
   MUSCLE (v.3.8.31) ([288]Edgar, 2004) was used to perform multiple
   sequence alignment with default parameters for each single-copy gene,
   and the alignment results were concatenated into a super alignment
   matrix. The concatenated sequences were then used to infer maximum
   likelihood trees using RAxML (v.8.0.19) ([289]Stamatakis, 2014) with
   1000 bootstraps. Divergence time was estimated using the MCMCTree
   program in PAML (v.4.6) with the main parameters burn-in = 10,000,
   sample-number = 100,000, and sample-frequency = 2.

Chromosomal collinearity analysis

   Genome syntenic blocks were detected using NUCmer software
   (v.4.0.0) ([290]Marcais et al., 2018) with parameters --mum,
   --maxgap=1000, --mincluster=90, --minmatch=40. Delta-filter was then
   used with default parameters to obtain one-to-one alignments, and
   syntenic regions were plotted using MUMmerplot. Inversions and
   translocations were further confirmed using Hi-C data from G[1] with
   3D-DNA software (v.180114) ([291]Dudchenko et al., 2017), which was
   used to calculate links among different bins with a resolution of 500
   kb. Gene synteny blocks were detected using MCScanX software
   ([292]https://github.com/wyp1125/MCScanX) with parameters --cscore 0.8,
   --minspan=30. PCR amplification was then used to verify the chromosome
   structural variations between Gossypium species by designing a pair of
   PCR primers at both ends of each inverted and translocated fragment
   ([293]Supplemental Table 29). PCR reactions were performed according to
   the protocol of 2× Phanta Max Master Mix (Vazyme, Nanjing, China), and
   the products were examined by 1.0% agarose gel electrophoresis.

Gene family clustering and phylogenetic tree construction

   Whole protein-coding genes from G[1] and eight other species,
   G[2] (G2_CRI) ([294]Cai et al., 2020), A[1] (A1_WHU) ([295]Huang
   et al., 2020a), A[2] (A2_WHU) ([296]Huang et al., 2020a), G. raimondii
   (D5_NSF) ([297]Udall et al., 2019a), G. thurberi (D1-35) ([298]Grover
   et al., 2019), G. hirsutum (HAU v1) ([299]Wang et al., 2019),
   G. barbadense (3-79_HAU.2) ([300]Wang et al., 2019), and Gs. kirkii
   (kirkii_ISU) ([301]Udall et al., 2019b), were downloaded from CottonGen
   ([302]Yu et al., 2021) ([303]https://www.cottongen.org), and those of
   T. cacao were downloaded from NCBI ([304]https://www.ncbi.nlm.nih.gov)
   for gene family analysis. All-against-all BLASTP (E-value ≤ 1E−07) was
   used to determine the similarities between the retained protein
   sequences. OrthoFinder (v.2.3.3) ([305]Emms and Kelly, 2019) was used
   to cluster genes from the above species into gene families with default
   parameters. Single-copy orthologs from the gene family analysis were
   then used to construct a phylogenetic tree with default parameters of
   MUSCLE software (v.3.8.31) ([306]Edgar, 2004). The protein alignments
   were converted to CDS alignments and then concatenated into a super
   alignment matrix. RAxML (v.8.0.19) ([307]Stamatakis, 2014) was used for
   phylogenetic tree construction by maximum likelihood methods with
   default parameters. The MCMCTree program in PAML (v.4.6) ([308]Yang,
   1997) was used for divergence time estimation with the following
   parameters: burn-in=100,000, sample-number=100,000, and
   sample-frequency=2. The calibration points used in the study were
   obtained from previous studies ([309]Huang et al., 2020a; [310]Cai
   et al., 2020).

Expansion and contraction of gene families

   The expansion and contraction of the gene families were determined
   using CAFÉ (v.4.2) with default parameters ([311]Bie et al., 2006). A
   random birth and death model was used to model gene gain and loss along
   each lineage of the phylogenetic tree. A probabilistic graphical model
   was introduced to calculate the probability of transitions in gene
   family size from parent to child nodes in the phylogeny ([312]Bie
   et al., 2006). Using conditional likelihoods as the test statistics,
   the corresponding P values in each lineage were calculated, and a P
   value of 0.05 was used to identify families that were significantly
   expanded and contracted.

Whole-genome duplication analysis and substitution rate estimation

   To study the evolution of the G[1] genome, we searched for WGD in the
   assembled genome. The protein sequences from G[1], G[2], A[1], A[2],
   and D[5] were compared within and between species using BLASTP with an
   E-value threshold of 1E−07. Syntenic blocks within a genome or between
   genomes were then detected based on detected gene pairs using MCScanX
   ([313]Tang et al., 2008).

   The synonymous substitution rate (Ks) values of gene pairs across
   syntenic blocks were calculated with the codeml program in PAML (v.4.6)
   ([314]Yang, 2007). The divergence times between species were calculated
   with the formula T = Ks/2r (r = 2.6 × 10^−9). The r value was obtained
   from a previous report ([315]Hu et al., 2019b).

qRT-PCR analysis

   Total RNA was extracted with the RNAprep Pure Plant Kit according to
   the manufacturer’s instructions (TianGen, Beijing, China) from ovules
   at 10, 20, and 30 dpa and seeds at 0, 12, 24, and 36 h after
   germination in G[1], G[2], G[3], A[1], A[2], CCRI12, and CCRI12gl;
   three biological replicates were obtained for each sample. Total RNA
   was then reverse transcribed using the PrimeScript First Strand cDNA
   Synthesis Kit (with gDNA remover) (TaKaRa, Dalian, China). All primers
   used for qRT-PCR were designed with Primer 6.0 software and are listed
   in [316]Supplemental Table 29, and UBQ7 ([317]Tan et al., 2013) was
   used as the internal control gene. qPCR was performed with the protocol
   of SYBR Premix Ex Taq (TaKaRa) on a LightCycler 96 real-time PCR
   detection system (Roche).

Transcriptome sequencing and analysis

   Total RNA samples were extracted from ovules at 10, 20, and 30 dpa and
   seeds at 0, 12, 24, and 36 h after germination in G[1], A[1], and A[2].
   Three biological replicates were obtained for each sample. Isolated RNA
   was used for cDNA library construction with the NEBNext Ultra RNA
   Library Prep Kit following the manufacturer’s instructions (Illumina,
   San Diego, CA, USA). All libraries were subjected to paired-end 150-bp
   sequencing on the Illumina HiSeq 4000 platform. Illumina short
   paired-end reads produced previously were trimmed using Trimmomatic
   (v.0.36) with default parameters. The high-quality reads were aligned
   to the G[1] genome using HISAT2 (v.2.0.4) ([318]Tang et al., 2008). The
   counts of reads aligned to each transcript were calculated with HTSeq
   (v.0.6.0) software ([319]Anders and Huber, 2010), and fragments per
   kilobase of transcript per million mapped fragments were used to
   estimate the expression level of each gene ([320]Trapnell et al.,
   2009). The DEGseq program ([321]Wang et al., 2010) was used to identify
   DEGs, and P values were adjusted with the Benjamini–Hochberg method for
   multiple testing ([322]Trapnell et al., 2009). Genes with adjusted P <
   0.05 and greater than or equal to two-fold changes in expression were
   considered to be differentially expressed.

Gossypol content analysis

   To measure the gossypol content in ovules and seeds, we collected
   ovules at 10, 20, and 30 dpa and seeds at 0, 12, 24, and 36 h after
   germination in G[1], A[1], and A[2]. The protocols for sample treatment
   method and gossypol content determination are described in a previous
   study ([323]Tian et al., 2018).

Virus-induced gene silencing system

   A VIGS assay was used to knock down the target genes CYP76B6 and GoPGF
   amplified from G[1] cDNA. The primers used are listed in
   [324]Supplemental Table 29. TRV-based vectors (pTRV1 and pTRV2) were
   used. The pTRV1 and recombinant pTRV2 vectors were introduced into
   Agrobacterium strain GV3101 and injected into 7-day-old cotyledons of
   G[1] seedlings. The instructions and protocols for vector construction
   and virus transformation were described in a previous study ([325]Li
   et al., 2019).

Funding

   This study was supported by the National Key Technology R&D Program of
   China (2016YFD0101404), the Chinese Agricultural Research System
   (CARS-18-25), and the Jiangsu Collaborative Innovation Center for
   Modern Crop Production.

Author contributions

   Z.S., C.J., and Z.T. designed the project. S.K., S.Y., C.Y., H.Y.,
   W.W., L.H., and H.Y. performed the experiments. S.K., L.M., M.U., L.C.,
   D.M., and Z.T. analyzed the results. S.K., Z.S., and Z.T. wrote or
   edited the manuscript. All authors read and approved the final
   manuscript.

   Published: August 10, 2022

Footnotes

   Published by the Plant Communications Shanghai Editorial Office in
   association with Cell Press, an imprint of Elsevier Inc., on behalf of
   CSPB and CEMPS, CAS.

   Supplemental information is available at Plant Communications Online.

Contributor Information

   Shuijin Zhu, Email: tlzhao@zju.edu.cn.

   Jinhong Chen, Email: jinhongchen@zju.edu.cn.

   Tianlun Zhao, Email: shjzhu@zju.edu.cn.

Accession numbers

   The G. bickii genome assembly and annotation data are available at the
   CottonGen database ([326]http://www.cottongen.org/). The raw genome and
   transcriptome reads have been deposited in the NCBI BioProject
   database. The genome sequencing data for G. bickii have been deposited
   under BioProject accession no. [327]PRJNA809539; the transcriptome
   sequencing data for G. bickii, G. herbaceum, and G. arboreum have been
   deposited under BioProject accession no. [328]PRJNA813940; and the
   resequencing data for G. australe, G. nelsonii, and G. sturtianum have
   been deposited under BioProject accession no. [329]PRJNA811802.

Supplemental information

   Document S1. Supplemental Figures 1–19
   [330]mmc1.pdf^ (8.3MB, pdf)
   Document S2. Supplemental Tables 1–29
   [331]mmc2.xlsx^ (2.6MB, xlsx)
   Document S3. Article plus supplemental information
   [332]mmc3.pdf^ (12.4MB, pdf)

References