Abstract The pigment gland is a morphological characteristic of Gossypium and its related genera. Gossypium bickii (G[1]) is characterized by delayed pigment gland morphogenesis in the cotyledons. In this study, a reference-grade genome of G[1] was generated, and comparative genomics analysis showed that G[1] was closest to Gossypium australe (G[2]), followed by A- and D-genome species. Two large fragment translocations in chromosomes 5 and 13 were detected between the G genome and other Gossypium genomes and were unique to the G[1] and G[2] genomes. Compared with the G[2] genome, two large fragment inversions in chromosomes 12 and 13 were detected in G[1]. According to the phylogeny, divergence time, and similarity analysis of nuclear and chloroplast genomes, G[1] was formed by hybridization between Gossypium sturtianum (C[1]) and a common ancestor of G[2] and Gossypium nelsonii (G[3]). The coordinated expression patterns of pigment gland formation (GoPGF) and gossypol biosynthesis genes in G[1] were verified to be consistent with its phenotype, and nine genes that were related to the process of pigment gland formation were identified. A novel gene, GbiCYP76B6, regulated by GoPGF, was found to affect gossypol biosynthesis. These findings offer insights into the origin and evolution of G[1] and its mechanism of pigment gland formation and gossypol biosynthesis. Key words: Gossypium bickii, genome, evolution, gossypol, pigment glands __________________________________________________________________ This study presents a reference-grade genome assembly and annotation for Gossypium bickii and suggests that G. bickii evolved from an interspecific hybridization between Gossypium sturtianum and a common ancestor of Gossypium australe and Gossypium nelsonii and that GbiCYP76B6 is regulated by GoPGF to affect gossypol biosynthesis in G. bickii. Introduction Cotton is one of the most critical cash crops worldwide, providing natural fibers as well as high-quality seed oil and proteins. Cottonseeds contain approximately 21% oil and 23% protein ([51]Tian et al., 2018), which can meet the protein demands of 500 million people in the world annually ([52]Lusas and Jividen, 1987). However, gossypol, the main component of pigment glands, is toxic to humans and monogastric animals, limiting the consumption of cottonseeds. On the other hand, gossypol and its derivatives are important phytoalexins that play a significant role in the defense system of cotton plants. Gossypol, in its pure form, consists of dimeric sesquiterpenes, yellow in color, and is biosynthesized in cotton roots and then transported to stems, followed by leaves and seeds ([53]Smith, 1961). According to the literature, gossypol biosynthesis belongs to the mevalonic acid pathway, and genes encoding key enzymes that participate in this pathway include CDNC, CYP706B1, DH1, CYP82D113, CYP36A196, 2-ODD-1, and SPG ([54]Tian et al., 2018; [55]Huang et al., 2020b). Although gossypol in cottonseeds can be inhibited by genetic engineering, this inhibition has not yet reached the standard for edibility or feeding (≤0.04%) ([56]Hu et al., 2019a). Pigment glands are symbolically presented as brown or dark brown dotted organs that store gossypol and its derivatives in glanded cotton. A positive correlation has been found between the number of pigment glands and the gossypol content in cotton ([57]Ma et al., 2016). By contrast, glandless cotton has a low gossypol content because there are no pigment glands for gossypol storage in any part of the plant, including seeds. To effectively utilize cottonseeds, low-gossypol cotton was developed several decades ago. By transferring the glandless genes gl[2] and gl[3] from Hopi cotton ([58]Mcmichael, 1959, [59]1960) and Gle 2 from Bahtim 110 ([60]Afifi et al., 1966) into glanded upland cotton by conventional methods, many glandless cotton cultivars have been developed and used in several countries for cotton production. However, with the reduction in seed gossypol content, the natural defense systems of the cotton plants were also lost. Therefore, the study of cotton germplasms with glandless seeds or low seed gossypol content will be very important for fully utilizing cottonseeds. Several wild diploid cotton species from Australia, such as Gossypium bickii (G[1]), exhibit delayed pigment-gland morphogenesis in cotyledons; their dormant cottonseeds are glandless but glanded during the seed germination stages ([61]Figure 1A) ([62]Fryxell, 1965). Therefore, this delayed pigment gland morphogenesis trait has great significance for both cottonseed utilization and plant stress tolerance. After hybridization of G[1] with Gossypium arboreum cv. Shixiya1 (A[2]) and chromosome duplication, the allotetraploid retained its target trait ([63]Zhu, 1999). However, because of the genetic distance between artificial allotetraploid and upland cotton (Gossypium hirsutum, (AD)[1]), it is difficult to obtain stable upland cotton germplasms with the target traits through traditional breeding methods. Figure 1. [64]Figure 1 [65]Open in a new tab Morphology and chromosomal and chloroplast features of G. bickii. (A) The character of pigment glands in G[1]. (i) The dormant seed, (ii) the partially enlarged dormant seed, (iii) the seed after germination for 36 h, (iv) the partially enlarged cotyledon after germination for 36 h, (v) the true leaf of the seedling, (vi) the partially enlarged true leaf of the seedling, (vii) the hypocotyls of the seedling, and (viii) the partially enlarged hypocotyls of the seedling. Scale bars, 0.5 mm. (B) Chromosomal features of G[1]. (i) 13 pseudochromosomes of the G[1] genome, (ii) gene density, (iii) non-coding RNA (ncRNA) density, (iv) transposable element (TE) density, (v) Gypsy density, (vi) Copia density, and (vii) GC content. The density was calculated in 500-kb intervals across the chromosomes. (C) Chloroplast genome features of G[1]. The inner circle line indicates the range of the inverted repeats (IRa and IRb), which separate the genome into small single-copy and large single-copy regions. The inside of the inner circle represents the GC content. The outer circle line represents the gene location information for each class, classified by different colors. The components on the outside of the outer circle are transcribed in the clockwise direction, and on the inside, the components are transcribed in the counterclockwise direction. With the development of sequencing technologies such as PacBio, high-throughput chromosome conformation capture (Hi-C), and BioNano, the genomes of five Gossypium allotetraploid species, including G. hirsutum, and several Gossypium diploid species, including A-, D-, G-, and F-genome species, have been well sequenced, assembled, and annotated ([66]Udall et al., 2019a; [67]Grover et al., 2019; [68]Huang et al., 2020a; [69]Cai et al., 2020), revealing the genome variations among different cotton species and their evolutionary relationships. Here, a high-quality genome sequence of G[1] was generated (total length 1.77 Gb) and successfully oriented to 13 pseudochromosomes based on PacBio long reads, Illumina paired-end reads, and Hi-C technologies. To reveal the evolution and origin of G[1], comparative genomics analysis was performed based on nuclear and chloroplast genomes. In addition, to reveal the unique traits of gossypol and the pigment gland, several functional genes were identified and validated using transcriptome data and virus-induced gene silencing (VIGS), providing insights into the mechanism of gossypol biosynthesis and pigment gland formation. Results Sequencing and assembly of a high-quality G. bickii genome In this study, the G[1] genome was sequenced and assembled for the first time. Through K-mer distribution analysis, the genome size was estimated to be 1706.79 Mb with 0.18% heterozygosity and 74.19% repetitive sequences ([70]Supplemental Figures 1 and [71]2, [72]Supplemental Table 1), indicating that the G[1] genome was highly repetitive. To obtain a high-quality reference genome, we integrated different sequencing technologies, including PacBio single-molecule real-time sequencing, Illumina paired-end sequencing, and Hi-C technologies. For the initial assembly, a total of 248.64 Gb of PacBio long reads were generated, representing approximately 142.08-fold genome coverage ([73]Supplemental Table 2), and used to assemble a genome of 1765.96 Mb with 1574 contigs and a contig N50 of 4.62 Mb ([74]Table 1, [75]Supplemental Tables 2 and [76]3). The assembled contigs were then polished by aligning PacBio long reads to the initial assembly. To increase the consensus accuracy of the assembly, the initial assembly was also corrected using 299.22 Gb of high-quality Illumina reads, representing about 176.01-fold genome coverage ([77]Supplemental Table 2). Figure 2. [78]Figure 2 [79]Open in a new tab Genome evolutionary and structural variation of Gossypium. (A) Phylogenetic tree and divergence times among Gossypium and other species in Malvaceae. (B) The divergence between G[1] and other species by K[s] value for orthologous genes between G[1] and other cotton species. (C) Characterization of genomic structural variation among the cotton species. Genomic synteny blocks are connected by gray lines. The large translocations between G groups (G[1] and G[2]) and other cotton species are highlighted by red links, the large inversion in chromosome 12 between G[1] and G[2] is highlighted by blue links, and the inversions in chromosome 13 between G[1] and other Gossypium species are highlighted by cyan links. (D) Hi-C heatmap verification of chromosome inversion and translocation fragments. The Hi-C-assisted assembly data for the G[1] genome were used to scan chromosomes 5 and 13 of G[1] and A[1] and chromosomes 12 and 13 of G[1] and G[2], respectively, to verify inversion and translocation fragments. (E) The top 10 enriched KEGG pathways of the translocation and inversion genes in the G[1] genome. All translocation genes in chromosomes 5 and 13 and inversion genes in chromosomes 12 and 13 were selected for KEGG pathway enrichment analysis. Table 1. Genomic features of Gossypium bickii. Genomic feature G. bickii Total length of contigs (Mb) 1765.96 Total length of scaffolds (Mb) 1766.07 Proportion of anchored sequences (%) 96.5 Number of contigs 1574 Contig N50 (Mb) 4.62 Number of scaffolds 445 Scaffold N50 (Mb) 133.9 Number of genes 43 790 TE repeats (%) 70.22 GC content (%) 36.01 [80]Open in a new tab To construct the chromosome-level genome, a total of 277.57 Gb of high-quality Hi-C fragments, representing about 163.27-fold genome coverage with 915.85 million read pairs, were used to categorize and order the assembled contigs ([81]Supplemental Tables 2 and [82]4). The Hi-C heatmap shows 13 distinct chromosomal groups ([83]Supplemental Figure 3). Finally, approximately 1704.25 Mb of the assembly (96.51% of the total assembled genome size) was anchored into 13 pseudochromosomes with 445 scaffolds and a scaffold N50 of 133.90 Mb, which captured 1766.07 Mb of the G[1] genome ([84]Table 1, [85]Supplemental Tables 3 and [86]5). As the results show, the size of the G[1] genome was similar to that of the Gossypium australe (G[2]) genome (1752.74 Mb) ([87]Cai et al., 2020), slightly higher than that of the A genome ([88]Huang et al., 2020a), and more than twice that of the D genome ([89]Paterson et al., 2012; [90]Wang et al., 2019) ([91]Supplemental Figure 4). The integrity and accuracy of the assembly were evaluated by alignment of the Illumina short-insert reads, resulting in a 98.88% mapping rate and 96.07% genome coverage ([92]Supplemental Table 6). The GC content of the G[1] genome was 36.01%, which was comparable to that of other cotton genomes ([93]Table 1, [94]Supplemental Table 7). Core Eukaryotic Genes Mapping Approach (CEGMA) ([95]Parra et al., 2007) evaluation indicated that 233 of 248 (93.95%) core eukaryotic genes were captured ([96]Supplemental Table 8), and the Benchmarking Universal Single-Copy Orthologs (BUSCOs) ([97]Kriventseva et al., 2019) method revealed that 1594 of 1614 (98.7%) Embryophyta single-copy orthologous genes were captured in the genome ([98]Supplemental Table 9). Annotation of the G. bickii genome A total of 43,790 genes were annotated in the G[1] genome by combining de novo prediction, homolog-based prediction, and transcription data, with an average gene length of 2979.93 bp and an average of 4.63 exons per gene ([99]Table 1, [100]Supplemental Table 10, [101]Supplemental Figure 5). Among the predicted protein-coding genes, approximately 41 776 (95.4%) genes were annotated by at least one of the NR, SwissProt, KEGG, and InterPro protein databases ([102]Supplemental Table 11), and 41 893 genes were anchored to 13 pseudochromosomes ([103]Supplemental Table 12). In addition, 1450 tRNAs, 324 microRNAs, 1928 rRNAs, and 10 521 small nuclear RNAs were identified in the G[1] genome ([104]Supplemental Table 13). The repetitive sequences in the G[1] genome were annotated using de novo and homology-based predictions. As much as 70.22% of the G[1] genome was made up of transposable elements (TEs) ([105]Table 1, [106]Supplemental Table 14), similar to that of the G[2] genome (72.58%) ([107]Cai et al., 2020), lower than that of the A genome ([108]Wang et al., 2019; [109]Huang et al., 2020a), and much higher than that of the D genome ([110]Wang et al., 2012, [111]2019) ([112]Supplemental Table 15). Long terminal repeat (LTR) retrotransposons accounted for the highest proportion of TEs (67.08%) in the G[1] genome; the Gypsy (59.29%) and Copia (4.37%) superfamilies were dominant ([113]Supplemental Table 15) and were mainly concentrated in the middle regions of chromosomes, as opposed to the gene-rich regions ([114]Figure 1B). Resequencing and assembly of four chloroplast genomes The genomes of G[1] and its related wild diploid species G[2], Gossypium nelsonii (G[3]), and Gossypium sturtianum (C[1]) were sequenced using the Illumina HiSeq platform. Totals of 24.04, 34.12, 35.43, and 35.23 Gb of Illumina short reads were generated with average depths of 13.67, 19.57, 22.19, and 25.77, respectively ([115]Supplemental Table 16). The resequencing data of these four Gossypium species were then mapped onto the G[1] genome to further construct a phylogenetic tree. Illumina reads from these four species, G[1], G[2], G[3], and C[1], were also used to assemble their complete chloroplast genomes, which were 159.46, 159.58, 159.58, and 159.54 kb in size, respectively. These four chloroplast genomes consisted of four regions: the large single copy, small single copy, inverted repeat a, and inverted repeat b regions ([116]Figure 1C). The large single copy region was the main region responsible for differences in genome size, accounting for more than 55% of the chloroplast genome ([117]Table 1, [118]Supplemental Figure 6). All genes in the four species were classified into four categories: 50 photosynthesis-related genes, 30 transcription-related genes, 44 RNA genes, and 8 other genes ([119]Supplemental Table 17). Evolution of the G. bickii genome Ten species of Malvaceae, including G[1], were selected for gene family analysis ([120]Supplemental Table 18), and 30 329 gene family members were identified ([121]Supplemental Figure 7, [122]Supplemental Table 19). A total of 3532 common single-copy orthologs present in all species were used to construct a phylogenetic tree ([123]Supplemental Table 20). The phylogenetic tree showed that all Gossypium species were clustered in one clade, which diverged from the sister genus Gossypioides kirkii and Theobroma cacao ([124]Figure 2A). The divergence time of T. cacao and the other species was estimated to be 68.5 mya (65.2–75.0 mya), and Gs. kirkii diverged from the A, G, and D genomes around 9.0 mya (7.3–14.2 mya) ([125]Figure 2A). Within Gossypium, the divergence time for G[1] and G[2] was estimated to be 4.4 mya (3.4–5.0 mya), following the divergence time of the common ancestor of the G genomes and the common ancestor of the A genomes around 5.9 mya (5.2–7.0 mya) ([126]Figure 2A). The common ancestor of the A and G genomes diverged from the D genomes around 6.6 mya (5.8–7.7 mya). In comparison with the common ancestor of the G genome, there were 1689 expanded and 605 contracted gene families in G[1] ([127]Figure 2A). These expanded gene families in the G[1] genome were significantly enriched in terms of photosynthesis by GO and KEGG enrichment. Other terms related to respiration, such as electron transport chain and oxidative phosphorylation, were also significantly enriched ([128]Supplemental Figure 8) ([129]Laisk et al., 2010). TEs, particularly LTRs, have been reported to be related to genome expansion, which also contributes to the evolution of Gossypium ([130]Huang et al., 2020a; [131]Wang et al., 2021). The TE and LTR contents of each Gossypium species, including the G genome, were compared with their respective genome sizes, and the correlation coefficient between the number of TEs and the genome size was 0.9693 ([132]Supplemental Figure 9). LTRs, the main class of TEs, showed a higher correlation coefficient with genome size (R^2 = 0.9715) ([133]Supplemental Figure 9). It is worth noting that the D genome was less than half the size of the other genomes and had the lowest proportion of TEs ([134]Supplemental Table 15). LTR expansion occurred in all the investigated species, including A, D, and G. This expansion time was estimated to be 0–5 mya ([135]Supplemental Figure 10), after the divergence time of the D and A/G genomes (5.8–7.7 mya) and that of the A and G genomes (5.2–7.0 mya). The divergence times estimated by Ks values of homologous gene pairs across syntenic blocks between G[1] and other species were similar to those estimated from the phylogenetic tree ([136]Figure 2A and 2B). Whole-genome duplication (WGD) events are known to have occurred during the evolution of cotton species ([137]Paterson et al., 2012; [138]Li et al., 2014; [139]Grover et al., 2019), as was also found here for G[1] (13–20 mya) ([140]Supplemental Figure 11). Specific structural variation in the G. bickii genome Chromosome 13 of G[1] had a large fragment translocation (47.21 Mb) with chromosome 5 of other cotton species; however, no such translocation was found between G[1] and G[2] ([141]Figure 2C). A total of 2728 functional genes were found in this translocation region ([142]Supplemental Table 21), accounting for 50.83% of all genes on chromosome 13. Another translocation was detected between chromosome 5 in G[1] (5.12 Mb) and chromosome 13 in other genomes and was not found between G[1] and G[2] ([143]Figure 2C); 409 functional genes were found in this region, accounting for 11.68% of all genes on chromosome 5 ([144]Supplemental Table 22). These two translocations may be unique events in G[1] and G[2]. A large inversion was detected on chromosome 13 (65.48 Mb) of G[1] ([145]Figure 2C); it was found only in G[1] compared with other analyzed Gossypium species. A total of 634 genes were found in this region ([146]Supplemental Table 23), such as fiber elongation-related transcription factor MYB25 ([147]Machado et al., 2009), and it accounted for 23.24% of all genes on chromosome 13. Another inversion, which included 1588 genes, was detected on chromosome 12 (95.34 Mb) of G[1] ([148]Figure 2C and [149]Supplemental Table 24) and was detected only between G[1] and G[2]. Gene collinearity analysis further confirmed the presence of these translocation and inversion genes ([150]Supplemental Figure 12), and the variation fragments were confirmed by a Hi-C heatmap and PCR amplification ([151]Figure 2D and [152]Supplemental Figure 13). These translocation and inversion genes were used for KEGG enrichment analysis. The most common enriched term was photosynthesis, which appeared in the chromosome 5 translocation and the chromosome 12 and 13 inversions in G[1] ([153]Figure 2E). In addition, glycerophospholipid metabolism appeared in the chromosome 5 translocation and the chromosome 12 inversion, and plant hormone signal transduction appeared in the chromosome 12 translocation and the chromosome 13 inversion ([154]Figure 2E). Speciation of the G. bickii genome To confirm the biphyletic ancestry and incongruence between the nuclear and chloroplast phylogenies of G[1] ([155]Wendel et al., 1991), we used G[1] and three other closely related Gossypium species, G[2], G[3], and C[1] ([156]Figure 3A). Phylogenetic trees of nuclear and chloroplast genomes of these Gossypium species were constructed based on SNPs in the nuclear genomes and single-copy genes in the chloroplast genomes. The results showed that the G[1] nuclear genome was closer to those of G[2]/G[3] than that of C[1], which diverged from G[2]/G[3] around 3.9 mya (3.3–4.4 mya) ([157]Figure 3B). However, the chloroplast genome of G[1] was similar to that of C[1], and they diverged from each other 1.2 mya (1.2–1.4 mya), later than the divergence time of G[1] and G[2]/G[3], 5.1 mya (4.7–5.2 mya) ([158]Figure 3B), similar to a previous study ([159]Chen et al., 2017). The nuclear genome of G[1] was similar to those of G[2] and G[3], but the chloroplast genome of G[1] was more similar to that of C[1] ([160]Figure 3C). Therefore, we hypothesized that there was a common ancestor of G[2] and G[3], referred to here as G[0], and that G[1] might have formed initially by interspecific crossing of the ancestor of C[1] as the female parent and G[0] as the male parent 3.9 mya (3.3–4.4 mya) ([161]Figure 3D). The chloroplast genome of C[1] was maternally inherited in the hybrid, and the hybrid at later generations did not evolve into G[1] until approximately 1.2 mya ([162]Figure 3D), perhaps as a result of backcrossing with G[0] ([163]Wendel et al., 1991). Figure 3. [164]Figure 3 [165]Open in a new tab Evolutionary analysis of nuclear and chloroplast genomes of C[1], G[1], G[2], and G[3]. (A) Fiber, leaf, sepal, bract, and petal phenotypes of four species in Gossypium. (B) Nuclear and chloroplast genome phylogenetic trees and divergence time analysis of four species in Gossypium. The phylogenetic tree for the nuclear genome was constructed based on SNPs, and the chloroplast genome was based on 72 single-copy genes from chloroplast genomes. (C) Identity scores of nuclear and chloroplast genomes for four species in Gossypium. The calculation of nuclear genome similarity was based on SNPs, and the chloroplast genome was based on whole-genome pairwise comparison. (D) Model for the evolution of four species in Gossypium. Green and yellow represent nuclear and chloroplast genomes, respectively. The blue numbers represent the divergence times. In addition, the fiber length of G[1] was similar to those of G[2] and G[3], which were longer than that of C[1]. The fiber morphology of G[1] was similar to that of C[1], which differed from the straight and spiky fibers of G[2] and G[3] ([166]Figure 3A). The shapes and sizes of the leaves, sepals, and bracts in G[1] were similar to those in G[2] and G[3] but different from those in C[1]. By contrast, the petal size and color were similar to those of C[1] but differed from those of G[2] and G[3] ([167]Figure 3A). These findings show that some phenotypes of G[1] were intermediate between those of C[1] and G[2]/G[3]. Genes related to gossypol and pigment glands in G. bickii Most of the cotton species formed pigment glands during ovule development, except for the wild diploid Gossypium species from Australia, which had delayed pigment gland morphogenesis in the cotyledon. Cotyledons at the ovule development (10, 20, and 30 days post-anthesis (dpa)) and seed germination (0, 12, 24, and 36 h) stages of Gossypium herbaceum cv. Hongxing (A[1]), A[2], and G[1] were used for observation of gland formation processes ([168]Figure 4A). The ovules of A[1] and A[2] had visible pigment glands, although there were no pigment glands before 20 dpa ([169]Figure 4A and 4B). However, there were no pigment glands during ovule development or at the initial stage of germination (0, 12, and 24 h) in G[1], and visible pigment glands appeared at 36 h after seed germination. Similarly, gossypol was not detected before 10 dpa in A[1] and A[2], although it accumulated at 20 (5.91 mg/g) and 30 dpa (7.85 mg/g); gossypol was not detected in G[1 seeds] until the 12-h seed germination time point (0.054 mg/g) ([170]Figure 4C and [171]Supplemental Table 25). Figure 4. [172]Figure 4 [173]Open in a new tab Morphology, gossypol contents, and differential expression analysis of A[1], A[2], and G[1]. (A) Morphology of ovules at 10, 20, and 30 dpa; dormant mature seed (0 h); and seeds at 12, 24, 36, and 48 h after germination. Scale bars, 0.5 mm. (B) Glanded and glandless cotyledons in A[1], A[2], and G[1]. The glandless cotyledons at 10 dpa and glanded cotyledons at 20 dpa in A[1] and A[2] and the glandless cotyledon at 24 h after germination and glanded cotyledon at 36 h after germination in G[1] are shown. (C) Gossypol contents in ovules at 10, 20, and 30 dpa and in seeds at 0, 12, 24, 36, and 48 h after germination in A[1], A[2], and G[1] (mean ± SD, n = 3). (D) Expression of differentially expressed genes (DEGs) in the gossypol biosynthesis pathway, represented by a heatmap and estimated by computing fragments per kilobase of transcript per million mapped fragments (FPKM) values. (E) Common and specific genes in five DEG lists: four upregulated groups (A[1]_30 dpa versus A[1]_10 dpa, A[2]_30 dpa versus A[2]_10 dpa, G[1]_12 h versus G[1]_0 h, and G[1]_24 h versus G[1]_0 h) and one not differentially expressed group (G[1]_30 dpa versus G[1]_10 dpa). (F) Common and specific genes in the five common DEGs from (D) and DEGs of CCRI12 versus CCRI12gl. (G) Correlations between GoPGF and 18 other DEGs. These genes were selected by computing the Pearson correlation coefficient (r ≥ 0.5, P ≤ 0.05) from 29 common genes in (E). (H) Expression of nine functional genes selected in (G) represented by a heatmap and estimated by computing FPKM values. A total of 48 gossypol genes and 1 pigment gland formation gene were identified in G[1] by a BLAST search using reported gossypol and pigment gland-related genes ([174]Supplemental Table 26) ([175]Ma et al., 2016; [176]Tian et al., 2018; [177]Huang et al., 2020a); they were distributed across the seven chromosomes ([178]Supplemental Figure 14). Transcriptome analysis revealed that the expression patterns of genes related to pigment gland formation and gossypol biosynthesis differed significantly between A[1]/A[2] and G[1]. Most genes were upregulated during the process of ovule development in A[1] and A[2], whereas these genes were not upregulated in G[1] until seed germination ([179]Supplemental Figure 15, [180]Supplemental Tables 27 and [181]28). A heatmap also showed that some genes in the gossypol biosynthesis pathway differed in expression pattern between G[1] and A[1]/A[2] ([182]Figure 4D). The expression patterns of these genes were further verified by qPCR and were consistent with the phenotypes of pigment gland formation and gossypol biosynthesis ([183]Supplemental Figure 16). The other two Australian Gossypium species, G[2] and G[3], were similar to G[1] in gene expression pattern. Gene expression in the glandless near-isogenic line (NIL) CCRI12gl was inhibited at all periods compared with the corresponding glanded NIL CCRI12 ([184]Supplemental Figure 17). Through transcriptome sequencing and comparative analysis, 89 common differentially expressed genes (DEGs) were identified and shared by five groups of DEGs ([185]Figure 4E). These genes were upregulated in ovules of A[1] and A[2] at 30 dpa and in germinated seeds of G[1] at 12 and 24 h after germination; they included GoPGF and 12 gossypol biosynthesis genes ([186]Supplemental Figure 18). Among them, 29 DEGs were also found in the glandless NILs group (CCRI12 versus CCRI12gl) ([187]Figure 4F). Among them, nine gossypol biosynthesis genes and nine other functional genes showed significant positive correlations in expression pattern with GoPGF based on Pearson’s correlation coefficient (r ≥ 0.5, P ≤ 0.05) ([188]Figure 4G and 4H). These nine functional genes were significantly positively correlated with GoPGF and may be related to pigment gland formation and gossypol biosynthesis. Function and mechanism of CYP76B6 VIGS assays were used to verify the functions of the nine genes that were positively correlated with GoPGF ([189]Figure 5A). Among them, the novel gene Gbi08G2110, identified as a homolog of CYP76B6 in G. hirsutum ([190]Smitha et al., 2019), had a significant effect on gossypol biosynthesis in G[1]. The gossypol content in leaves and stems of GbiCYP76B6-silenced seedlings was reduced to 0.347 and 0.049 mg/g, respectively, significantly lower than that in the negative control tobacco rattle virus (TRV):00 seedlings (0.748 mg/g in stems and 0.113 mg/g in leaves) ([191]Figure 5A). The VIGS system verified that gossypol content was completely inhibited in stems and leaves of GoPGF-silenced seedlings, whose gossypol contents were significantly lower than those of TRV:00 ([192]Supplemental Figure 19). Figure 5. [193]Figure 5 [194]Open in a new tab Function and mechanism of GbiCYP76B6 in gossypol biosynthesis. (A) Relative GbiCYP76B6 expression levels and gossypol contents in leaves and stems of seedlings after the silencing of GbiCYP76B6 in G[1]. TRV:00 was used as a negative control. Independent experiments with mean ± SD, n = 3; ∗P < 0.05. (B) Relative expression levels of GbiCYP76B6 in GoPGF-silenced plants and FPKM expression values of GhCYP76B6 in CCRI12 (glanded) and CCRI12gl (glandless). ∗P < 0.05; ∗∗P < 0.01. (C) Relationship between CYP76B6 and gossypol biosynthesis in the terpene biosynthesis pathway. (D)cis-acting elements in the 2000-bp promoter upstream of the GbiCYP76B6 coding region. (E) Mechanistic model of GoPGF and CYP76B6 in gossypol biosynthesis. Providing further evidence of the relationship between GoPGF and CYP76B6, the expression of GbiCYP76B6 was significantly downregulated in GoPGF-silenced plants compared with TRV:00 ([195]Figure 5B). In addition, the homologous genes GhCYP76B6_A12 and GhCYP76B6_D12 were significantly downregulated in the glandless NIL CCRI12gl compared with the glanded NIL CCRI12 ([196]Figure 5B). These findings suggest that the expression of GbiCYP76B6 may be regulated by GoPGF. CYP76B6, a geraniol 8-hydroxylase, belongs to the CYP76 family of cytochrome P450 enzymes, which play a role in catalyzing the single or double oxidation of monoterpenols ([197]Ilc et al., 2016) ([198]Figure 5C). By predicting the function of cis-regulatory elements in the promoter sequence (2000 bp) of GbiCYP76B6 with PlantCARE ([199]Lescot et al., 2002), we found multiple sites related to abiotic stress, including elements that function in responses to light, gibberellin, and MeJA ([200]Figure 5D). Among them, a G-box light-responsive motif was found 1467 bp upstream of the coding region ([201]Figure 5D) and may interact with GoPGF ([202]Ma et al., 2016). GoPGF, from a class of basic helix–loop–helix transcription factors, may regulate the expression of GbiCYP76B6 by combining with its promoter element, thus affecting gossypol biosynthesis ([203]Figure 5E). Discussion The whole genome of G[1] was successfully sequenced and assembled. A total of 1704.25 Mb (96.51%) of sequences were mapped onto 13 pseudochromosomes with a contig N50 of 4.62 Mb and a scaffold N50 of 133.90 Mb. The genome size of G[1] was 1776.06 Mb, close to that of G[2]but larger than that of diploid A and D species ([204]Udall et al., 2019a; [205]Wang et al., 2019; [206]Huang et al., 2020a; [207]Cai et al., 2020). A phylogenetic tree was constructed, and the divergence time of the A and G genomes was estimated to be 5.9 mya, after the divergence of the D and A/G genomes (6.6 mya), showing that the G genome had a closer relationship with the A genome than the D genome. In addition, the content and proportion of TEs and LTRs in the G genome were similar to those in the A genome and were more than twice those in the D genome. These findings show a potential relationship between genome similarities and TEs/LTRs in Gossypium, and previous studies have indicated that LTR insertion events occurred during the evolution of Gossypium, contributing to the divergence and evolution of the genus ([208]Huang et al., 2020a; [209]Cai et al., 2020). In addition, in accordance with other Gossypium species, the formation of the G genome was also accompanied by a specific WGD event within 13–20 mya ([210]Li et al., 2014; [211]Cai et al., 2020) that did not occur in the evolution of T. cacao. This WGD event may have shaped the genomes of Gossypium after their divergence from Theobroma. Structural variations play a key role in the divergence between G. hirsutum and G. barbadense, contributing to the stronger adaptability and higher yield of G. hirsutum and the better fiber quality and disease resistance of G. barbadense ([212]Hu et al., 2019b; [213]Wang et al., 2019). Our findings revealed the presence of unique translocations in the G genome that were formed before the divergence between G[1] and G[2]. These two translocations contained genes related to gossypol biosynthesis, including CDNC and CYP82D113 ([214]Tian et al., 2018). In addition, genes related to fiber elongation and secondary wall thickening, including ADF1, PAG1, Sus1, RDL1, and CFE1A, were also located in this translocation region ([215]Ji et al., 2003; [216]Lee et al., 2007), which might be related to fiber development in G[1]. Moreover, two inversions were found between G[1] and G[2] and contained genes such as MYB25 and FSN1 ([217]Machado et al., 2009; [218]Zhang et al., 2018), which are involved in fiber formation and secondary wall thickening. By comparing the nuclear and chloroplast genomic and phenotypic characteristics among G[1], G[2], G[3], and C[1], we hypothesized that G[1] evolved from interspecific hybridization between the ancestor of C[1] as the female parent and the common ancestor of G[2] and G[3] as the male parent, consistent with the previous view that G[1] has a biphyletic ancestry ([219]Wendel et al., 1991). The Australian Gossypium of C and G genomes had closer genetic relationships and no geographic isolation, which provided them with more opportunities for crossing. This could explain why the phenotypes of G[1] were intermediate between those of C[1] and G[2]/G[3], further confirming that G[1] was the product of hybridization between G and C ancestors. Compared with G[2] and G[3], G[1] was more attractive for potential breeding, not only because of its delayed pigment gland morphogenesis in cotyledons but also because of some other useful traits such as high boll setting, biotic and abiotic stress tolerance, fiber quality, and insensitivity of maturity and flowering to seasonal temperature and light ([220]Zhu et al., 1998). Several genes related to pigment glands have been identified in previous studies ([221]Ma et al., 2016; [222]Janga et al., 2018; [223]Cai et al., 2020; [224]Gao et al., 2020). However, knocking out only GoPGF completely blocks the process of gland formation ([225]Ma et al., 2016; [226]Janga et al., 2018). Eight functional genes (CDNC, CYP706B1, DH1, CYP82D113, CYP71BE79, 2-ODD-1, SPG, and CYP736A196) have been reported to participate in the gossypol biosynthesis pathway ([227]Sunilkumar et al., 2006; [228]Tian et al., 2018; [229]Huang et al., 2020a). The expression patterns of these genes in G[1] were different from those in A[1] and A[2] and were related to its traits of pigment gland formation and gossypol accumulation. Furthermore, the novel gene GbiCYP76B6 was identified and shown to reduce the gossypol content after silencing. It has been identified as a geraniol hydroxylase and is involved in the biosynthesis of monoterpenes ([230]Ilc et al., 2016). GbiCYP76B6 expression was significantly positively correlated with that of GoPGF. GoPGF, in a class of transcription factors, interacts with the upstream cis-acting G-box element ([231]Ma et al., 2016). These results show that GoPGF may affect the biosynthesis of gossypol by regulating the expression of GbiCYP76B6 in G[1]. In addition, GoPGF can inhibit gossypol content by regulating other pathways, such as interacting with TPS and WRKY ([232]Ma et al., 2016). These findings provide insights into the mechanisms by which GoPGF and GbiCYP76B6 function in gossypol biosynthesis. Methods Plant material sampling The experimental materials, including G[1], G[2], G[3], C[1], A[1], and A[2], as well as a pair of glanded and glandless NILs of G. hirsutum, CCRI12 (glanded) and CCRI12gl (glandless), were used in the study. G[1], G[2], G[3], C[1], A[1], and A[2] were provided by the Wild Cotton Species Garden in Sanya, Hainan. CCRI12 and CCRI12gl were developed by crossing Hai-1 (G. barbadense, glandless) with CCRI12, an upland cotton cultivar, and backcrossing to CCRI12, with selection for the glandless trait, for more than 10 generations. All plant materials were planted in the greenhouse of the Agricultural Station, Zhejiang University, Hangzhou, China. Sample preparation and sequencing High-quality genomic DNA was extracted from fresh leaves of G[1] using the cetyl trimethyl ammonium bromide method ([233]Porebski et al., 1997). Two paired-end libraries (250 and 450 bp) were constructed and sequenced on an Illumina HiSeq X Ten platform (Illumina, San Diego, CA, USA) ([234]Nair et al., 2018), and a PCR-free library (450 bp) was constructed and sequenced on an Illumina HiSeq 2500 system ([235]Minoche et al., 2011). For PacBio sequencing, genomic DNA was sheared into approximately 20-kb targeted size, and long-read libraries were constructed with the SMRTbell Template Prep Kit (PacBio) and sequenced on the PacBio RS II platform ([236]Chin et al., 2016). For Hi-C sequencing, genomic DNA was cross-linked, digested with the restriction enzyme MboI, and labeled with biotinylated residues ([237]Belton et al., 2012). Biotinylated circles were then enriched, sheared, and sequenced on the Illumina HiSeq X Ten platform ([238]Nair et al., 2018). Genome survey and assembly Jellyfish ([239]https://genome.umd.edu/jellyfish.html) was used to estimate genome size based on K-mer distribution using the high-quality reads from the short-insert (250 and 450 bp) paired-end libraries. Contig assembly of G[1] was performed with the PacBio subreads using FALCON (v.0.7) with the following parameters: length_cutoff_pr=5,000, max_diff=150, max_cov=150 ([240]Chin et al., 2013). The assembled contigs were then polished using Quiver software (smrtlink 7.0.1) ([241]Chin et al., 2013) with default parameters. To increase the accuracy of the assembly, Illumina short reads were used to perform error correction with default parameters of Pilon (v.1.22) ([242]Walker et al., 2014). To obtain the chromosome-level genome assembly, all Hi-C reads were aligned onto the contigs with HiC-Pro (v.2.10.0) software ([243]Servant et al., 2015). LACHESIS (v.2.0) ([244]Burton et al., 2013) was used to cluster the contigs into chromosomes with the following parameters: CLUSTER_N=13, CLUSTER_MIN_RE_SITES=160, CLUSTER_MAX_LINK_DENSITY=2, CLUSTER_NONINFORMATIVE_RATIO=2, ORDER_MIN_N_RES_IN_TRUNK=50, ORDER_MIN_N_RES_IN_SHREDS=10. Juicebox (v.1.18) ([245]Robinson et al., 2018) was used for manual fine-tuning with a graphical and interactive approach to obtain the final chromosome-level assembly. Genome quality evaluation To evaluate the accuracy of the genome assembly, high-quality short paired-end reads were aligned to the G[1] genome using the Burrows–Wheeler aligner ([246]http://bio-bwa.sourceforge.net/) with parameters -k 32 -w 10 -B 3 -O 11 -E 4. The completeness of the genome assembly was then evaluated using CEGMA (v.2.5) ([247]Parra et al., 2007) and BUSCOs (v.4.1.2) ([248]Kriventseva et al., 2019). Genome annotation To annotate repetitive sequences in the G[1] genome, RepeatModeler (v.1.0.5) ([249]Price et al., 2005), RepeatScout (v.1.0.5) ([250]Lian et al., 2016), Piler (v.1.0) ([251]Edgar and Myers, 2005), and LTR_FINDER (v.1.07) ([252]Xu and Wang, 2007) were used with default settings to build a non-redundant de novo repeat library. RepeatMasker (v.4.0.7) and RepeatProteinMask (v.4.0.7) ([253]Xu and Wang, 2007) were used for homologous repeat identification by running against the Repbase library. Tandem repeats were identified using TRF (v.4.07b) ([254]Benson, 1999). To estimate insertion times, LTR retrotransposon ends were aligned with MUSCLE (v.3.8.31) ([255]Edgar, 2004). The transition–transversion ratio was used to calculate nucleotide distance (D) in EMBOSS ([256]http://emboss.sourceforge.net/), and the insertion time of LTR retrotransposons was estimated by T = D/2r. The r value was obtained from a previous report ([257]Hu et al., 2019b). De novo predictions, homology-based predictions, and transcriptome-based predictions were used for gene structure annotation in the G[1] genome. Five ab initio gene prediction programs, Augustus (v.3.0.2) ([258]Stanke and Waack, 2003), Genescan (v.1.0) ([259]Salamov, 2000), geneid (v.1.4.4) ([260]Parra et al., 2000), GlimmerHMM (v.3.0.2) ([261]Majoros et al., 2004), and SNAP (v.11.29) ([262]Korf, 2004), were used to predict coding regions in the repeat-masked genome. Homologous proteins from sequenced species used for homology-based prediction referred to in [263]Supplemental Table 10 were downloaded from the National Center for Biotechnology Information (NCBI; [264]www.ncbi.nlm.nih.gov) for homology-based predictions and aligned to the G[1] genome using TBLASTN (v.2.2.26) ([265]Altschul et al., 1997) (E value ≤ 1E−05). RNA-sequencing data from different tissues were mapped to the G[1] genome using Cufflinks (v.2.1.1) ([266]Trapnell et al., 2009), and PASA (v.2.1.1) ([267]Trapnell et al., 2010) was used to assemble transcripts and predict gene structures. All predicted genes from the above three methods were integrated into a comprehensive non-redundant gene set with EVidenceModeler (v.1.1.1) ([268]Haas et al., 2008). Functional annotation of the protein-coding genes in the G[1] genome was performed using BLASTP (E value ≤ 1E−05) against the SwissProt ([269]Apweiler et al., 2004), TrEMBL ([270]Bairoch and Apweiler, 1999), and NR databases ([271]ftp://ftp.ncbi.nih.gov/blast/db) and KEGG ([272]Kanehisa and Goto, 2000). The best hits were used to assign homology-based gene functions with the parameter "blastp -max_target_seqs". The software tRNAscan-SE ([273]Lowe and Eddy, 1997) was used with default parameters to predict tRNA genes in the G[1] genome. The rRNA, microRNA, and small nuclear RNA fragments were identified using INFERNAL software ([274]Nawrocki et al., 2009) with searches against the Rfam database ([275]Griffiths-Jones et al., 2005). Genome resequencing High-quality genomic DNA was extracted from fresh leaves of G[2], G[3], and C[1] and sequenced on the Illumina HiSeq X Ten platform (Illumina, San Diego, CA, USA). The raw reads were trimmed using Trimmomatic (v.0.36) ([276]Bolger et al., 2014) with default parameters, and the clean reads were aligned to the G[1] genome using the Burrows–Wheeler aligner (v.0.7.8) with the parameters -t 4 -k 32 -M ([277]http://bio-bwa.sourceforge.net/). SAMtools (v.1.3) ([278]Li and Durbin, 2009) was used with default parameters for sorting and removing duplicates. The alignment results were then used to call variants with the HaplotypeCaller module of GATK (v.4.0) ([279]McKenna et al., 2010). VCFtools (v.0.1.14) ([280]Danecek et al., 2011) was used for further filtering of SNPs with the parameters --max-missing 1 --maf 0.05. Phylogenetic analysis was performed with the filtered SNPs using the software TreeBeST (v.1.9.2) with 1000 bootstraps ([281]https://github.com/Ensembl/treebest). Identity scores were calculated to compare the genome similarity between cotton species according to the SNPs within 100-kb windows and 50-kb step sizes. Divergence time was estimated using the MCMCTree program in PAML (v.4.6) ([282]Yang, 2007) with the main parameters burn-in = 10,000, sample-number = 100,000, and sample-frequency = 2. Chloroplast genome assembly To assemble the chloroplast genomes of G[1], G[2], G[3], and C[1], Illumina short paired-end reads generated from genome resequencing were trimmed using Trimmomatic (v.0.36) ([283]Bolger et al., 2014) with default parameters. ABySS software (v.2.3.4) ([284]http://www.bcgsc.ca/platform/bioinfo/software/abyss) was used for initial assembly. GapCloser (v.1.12) ([285]Luo et al., 2012) was then used for gap filling with default parameters to obtain the final assembled sequences. Subsequently, CHLOROBOX ([286]http://chlorobox.mpimp-golm.mpg.de/index.html) was used with default parameters to annotate the chloroplast genomes, and OGDRAW (v.1.3.1) ([287]Lohse et al., 2007) was used to draw the genome maps. MUSCLE (v.3.8.31) ([288]Edgar, 2004) was used to perform multiple sequence alignment with default parameters for each single-copy gene, and the alignment results were concatenated into a super alignment matrix. The concatenated sequences were then used to infer maximum likelihood trees using RAxML (v.8.0.19) ([289]Stamatakis, 2014) with 1000 bootstraps. Divergence time was estimated using the MCMCTree program in PAML (v.4.6) with the main parameters burn-in = 10,000, sample-number = 100,000, and sample-frequency = 2. Chromosomal collinearity analysis Genome syntenic blocks were detected using NUCmer software (v.4.0.0) ([290]Marcais et al., 2018) with parameters --mum, --maxgap=1000, --mincluster=90, --minmatch=40. Delta-filter was then used with default parameters to obtain one-to-one alignments, and syntenic regions were plotted using MUMmerplot. Inversions and translocations were further confirmed using Hi-C data from G[1] with 3D-DNA software (v.180114) ([291]Dudchenko et al., 2017), which was used to calculate links among different bins with a resolution of 500 kb. Gene synteny blocks were detected using MCScanX software ([292]https://github.com/wyp1125/MCScanX) with parameters --cscore 0.8, --minspan=30. PCR amplification was then used to verify the chromosome structural variations between Gossypium species by designing a pair of PCR primers at both ends of each inverted and translocated fragment ([293]Supplemental Table 29). PCR reactions were performed according to the protocol of 2× Phanta Max Master Mix (Vazyme, Nanjing, China), and the products were examined by 1.0% agarose gel electrophoresis. Gene family clustering and phylogenetic tree construction Whole protein-coding genes from G[1] and eight other species, G[2] (G2_CRI) ([294]Cai et al., 2020), A[1] (A1_WHU) ([295]Huang et al., 2020a), A[2] (A2_WHU) ([296]Huang et al., 2020a), G. raimondii (D5_NSF) ([297]Udall et al., 2019a), G. thurberi (D1-35) ([298]Grover et al., 2019), G. hirsutum (HAU v1) ([299]Wang et al., 2019), G. barbadense (3-79_HAU.2) ([300]Wang et al., 2019), and Gs. kirkii (kirkii_ISU) ([301]Udall et al., 2019b), were downloaded from CottonGen ([302]Yu et al., 2021) ([303]https://www.cottongen.org), and those of T. cacao were downloaded from NCBI ([304]https://www.ncbi.nlm.nih.gov) for gene family analysis. All-against-all BLASTP (E-value ≤ 1E−07) was used to determine the similarities between the retained protein sequences. OrthoFinder (v.2.3.3) ([305]Emms and Kelly, 2019) was used to cluster genes from the above species into gene families with default parameters. Single-copy orthologs from the gene family analysis were then used to construct a phylogenetic tree with default parameters of MUSCLE software (v.3.8.31) ([306]Edgar, 2004). The protein alignments were converted to CDS alignments and then concatenated into a super alignment matrix. RAxML (v.8.0.19) ([307]Stamatakis, 2014) was used for phylogenetic tree construction by maximum likelihood methods with default parameters. The MCMCTree program in PAML (v.4.6) ([308]Yang, 1997) was used for divergence time estimation with the following parameters: burn-in=100,000, sample-number=100,000, and sample-frequency=2. The calibration points used in the study were obtained from previous studies ([309]Huang et al., 2020a; [310]Cai et al., 2020). Expansion and contraction of gene families The expansion and contraction of the gene families were determined using CAFÉ (v.4.2) with default parameters ([311]Bie et al., 2006). A random birth and death model was used to model gene gain and loss along each lineage of the phylogenetic tree. A probabilistic graphical model was introduced to calculate the probability of transitions in gene family size from parent to child nodes in the phylogeny ([312]Bie et al., 2006). Using conditional likelihoods as the test statistics, the corresponding P values in each lineage were calculated, and a P value of 0.05 was used to identify families that were significantly expanded and contracted. Whole-genome duplication analysis and substitution rate estimation To study the evolution of the G[1] genome, we searched for WGD in the assembled genome. The protein sequences from G[1], G[2], A[1], A[2], and D[5] were compared within and between species using BLASTP with an E-value threshold of 1E−07. Syntenic blocks within a genome or between genomes were then detected based on detected gene pairs using MCScanX ([313]Tang et al., 2008). The synonymous substitution rate (Ks) values of gene pairs across syntenic blocks were calculated with the codeml program in PAML (v.4.6) ([314]Yang, 2007). The divergence times between species were calculated with the formula T = Ks/2r (r = 2.6 × 10^−9). The r value was obtained from a previous report ([315]Hu et al., 2019b). qRT-PCR analysis Total RNA was extracted with the RNAprep Pure Plant Kit according to the manufacturer’s instructions (TianGen, Beijing, China) from ovules at 10, 20, and 30 dpa and seeds at 0, 12, 24, and 36 h after germination in G[1], G[2], G[3], A[1], A[2], CCRI12, and CCRI12gl; three biological replicates were obtained for each sample. Total RNA was then reverse transcribed using the PrimeScript First Strand cDNA Synthesis Kit (with gDNA remover) (TaKaRa, Dalian, China). All primers used for qRT-PCR were designed with Primer 6.0 software and are listed in [316]Supplemental Table 29, and UBQ7 ([317]Tan et al., 2013) was used as the internal control gene. qPCR was performed with the protocol of SYBR Premix Ex Taq (TaKaRa) on a LightCycler 96 real-time PCR detection system (Roche). Transcriptome sequencing and analysis Total RNA samples were extracted from ovules at 10, 20, and 30 dpa and seeds at 0, 12, 24, and 36 h after germination in G[1], A[1], and A[2]. Three biological replicates were obtained for each sample. Isolated RNA was used for cDNA library construction with the NEBNext Ultra RNA Library Prep Kit following the manufacturer’s instructions (Illumina, San Diego, CA, USA). All libraries were subjected to paired-end 150-bp sequencing on the Illumina HiSeq 4000 platform. Illumina short paired-end reads produced previously were trimmed using Trimmomatic (v.0.36) with default parameters. The high-quality reads were aligned to the G[1] genome using HISAT2 (v.2.0.4) ([318]Tang et al., 2008). The counts of reads aligned to each transcript were calculated with HTSeq (v.0.6.0) software ([319]Anders and Huber, 2010), and fragments per kilobase of transcript per million mapped fragments were used to estimate the expression level of each gene ([320]Trapnell et al., 2009). The DEGseq program ([321]Wang et al., 2010) was used to identify DEGs, and P values were adjusted with the Benjamini–Hochberg method for multiple testing ([322]Trapnell et al., 2009). Genes with adjusted P < 0.05 and greater than or equal to two-fold changes in expression were considered to be differentially expressed. Gossypol content analysis To measure the gossypol content in ovules and seeds, we collected ovules at 10, 20, and 30 dpa and seeds at 0, 12, 24, and 36 h after germination in G[1], A[1], and A[2]. The protocols for sample treatment method and gossypol content determination are described in a previous study ([323]Tian et al., 2018). Virus-induced gene silencing system A VIGS assay was used to knock down the target genes CYP76B6 and GoPGF amplified from G[1] cDNA. The primers used are listed in [324]Supplemental Table 29. TRV-based vectors (pTRV1 and pTRV2) were used. The pTRV1 and recombinant pTRV2 vectors were introduced into Agrobacterium strain GV3101 and injected into 7-day-old cotyledons of G[1] seedlings. The instructions and protocols for vector construction and virus transformation were described in a previous study ([325]Li et al., 2019). Funding This study was supported by the National Key Technology R&D Program of China (2016YFD0101404), the Chinese Agricultural Research System (CARS-18-25), and the Jiangsu Collaborative Innovation Center for Modern Crop Production. Author contributions Z.S., C.J., and Z.T. designed the project. S.K., S.Y., C.Y., H.Y., W.W., L.H., and H.Y. performed the experiments. S.K., L.M., M.U., L.C., D.M., and Z.T. analyzed the results. S.K., Z.S., and Z.T. wrote or edited the manuscript. All authors read and approved the final manuscript. Published: August 10, 2022 Footnotes Published by the Plant Communications Shanghai Editorial Office in association with Cell Press, an imprint of Elsevier Inc., on behalf of CSPB and CEMPS, CAS. Supplemental information is available at Plant Communications Online. Contributor Information Shuijin Zhu, Email: tlzhao@zju.edu.cn. Jinhong Chen, Email: jinhongchen@zju.edu.cn. Tianlun Zhao, Email: shjzhu@zju.edu.cn. Accession numbers The G. bickii genome assembly and annotation data are available at the CottonGen database ([326]http://www.cottongen.org/). The raw genome and transcriptome reads have been deposited in the NCBI BioProject database. The genome sequencing data for G. bickii have been deposited under BioProject accession no. [327]PRJNA809539; the transcriptome sequencing data for G. bickii, G. herbaceum, and G. arboreum have been deposited under BioProject accession no. [328]PRJNA813940; and the resequencing data for G. australe, G. nelsonii, and G. sturtianum have been deposited under BioProject accession no. [329]PRJNA811802. Supplemental information Document S1. Supplemental Figures 1–19 [330]mmc1.pdf^ (8.3MB, pdf) Document S2. Supplemental Tables 1–29 [331]mmc2.xlsx^ (2.6MB, xlsx) Document S3. Article plus supplemental information [332]mmc3.pdf^ (12.4MB, pdf) References