Abstract

Background

   Cytoplasmic male sterility (CMS) is a maternally inherited trait
   failing to produce functional pollen. It plays a pivotal role in the
   exploitation of crop heterosis. The specific locus amplified fragment
   sequencing (SLAF-seq) as a high-resolution strategy for the
   identification of new SNPs on a large-scale is gradually applied for
   functional gene mining. The current study combined the bulked segregant
   analysis (BSA) with SLAF-seq to identify the candidate genes associated
   with fertility restorer gene (Rf) in CMS cotton.

Methods

   Illumina sequencing systematically investigated the parents. A
   segregating population comprising of 30 + 30 F[2] individuals was
   developed using 3096A (female parent) as sterile and 866R (male parent)
   as a restorer. The original data obtained by dual-index sequencing were
   analyzed to obtain the reads of each sample that were compared to the
   reference genome in order to identify the SLAF tag with a polymorphism
   in parent lines and the SNP with read-associated coverage. Based on
   SLAF tags, SNP-index analysis, Euclidean distance (ED) correlation
   analysis, and whole genome resequencing, the hot regions were
   annotated.

Results

   A total of 165,007 high-quality SLAF tags, with an average depth of
   47.90× in the parents and 50.78× in F[2] individuals, were sequenced.
   In addition, a total of 137,741 SNPs were detected: 113,311 and 98,861
   SNPs in the male and female parent, respectively. A correlation
   analysis by SNP-index and ED initially located the candidate gene on
   1.35 Mb of chrD05, and 20 candidate genes were identified. These genes
   were involved in genetic variations, single base mutations, insertions,
   and deletions. Moreover, 42 InDel markers of the whole genome
   resequencing were also detected.

Conclusions

   In this study, associated markers identified by super-BSA could
   accelerate the study of CMS in cotton, and as well as in other crops.
   Some of the 20 genes’ preliminary characteristics provided useful
   information for further studies on CMS crops.

Electronic supplementary material

   The online version of this article (10.1186/s12864-017-4406-y) contains
   supplementary material, which is available to authorized users.

   Keywords: CMS, High-throughput sequencing, SLAF-seq, Super-BSA, Cotton

Background

   As a maternally inherited characteristic, cytoplasmic male sterility
   (CMS) plays a major role in crop heterosis research and practice. The
   current studies suggest that CMS is caused by mutations in the
   correlated genes in the mitochondrial genome and inhibited by fertility
   restorer genes in the nuclear genome [[41]1]. This phenomenon exists in
   bean [[42]2, [43]3], petunia [[44]4], sorghum [[45]5], and rice
   [[46]6]. Fertility restorer gene (Rf), was often found in these crop,
   can inhibit the expression of mitochondrial sterility gene. For the
   cotton, the gene are not consistent in different sterile lines.

   The main cotton hybrids, which have the value of utilization were
   Harknessii cytoplasmic male sterile line, Trilobum cytoplasmic male
   sterile line, and cytoplasmic sterile line of upland cotton (104-7A,
   Xiangyuan A, Jin A). The three-line hybrid selection of China was
   primarily derived from the cytoplasmic male sterile lines of the upland
   cotton. Since the CMS sources are different, the restorers are also
   different, which leads to various theories on the CMS recovery
   mechanism of cotton. The fertility restoring characteristics of CMS in
   Harknessii cotton were regulated by one dominant gene, Rf1. The sterile
   nature of the Trilobum cotton could be restored by either Rf2 of the
   Trilobum restorer gene or Rf1 of the Harknessii restorer gene; however,
   Rf2 is not able to restore the CMS-D2–2 of Harknessii. Rf1 and Rf2 are
   closely linked with a distance of 0.93 cM [[47]7]. The Chinese breeding
   varies from the CMS lines of Harknessii and Trilobum. The fertility
   restoration of upland cotton CMS line is regulated by two pairs of
   independent recovery genes: Rf1 completely dominant and Rf2 partially
   dominant. The recovery effect of Rf1 is higher than that of Rf2
   [[48]8]. The identification of the molecular marker and gene mapping of
   CMS in cotton has also progressed. Liu et al. found 3 SSR and 2 RAPD
   markers closely linked to the restorer gene [[49]9]. Feng et al. found
   that 3 STS was co-segregated from the restorer gene [[50]10]. Yin et
   al. constructed accurate genetic and physical maps of 15 molecular
   markers closely linked to the restorer gene that was located on
   chromosome 19 (LGD08 linkage group) with a genetic distance of <1 cM,
   and the physical location was on 100 kb between the two BAC clone
   overlapping regions [[51]11]. Wang et al. suggested that the two Rf
   restorer genes might be located on chromosome 19 in the D chromosome
   subgroup, i.e. chromosome D5 of the cotton [[52]12, [53]13]. However,
   due to differences in the source of sterile cytoplasm and the variation
   in nuclear genotypes, the effects of nuclear gene and sterile cytoplasm
   are different. Thus, fine positioning and finding new restorer gene
   candidates in upland cotton are essential.

   Large-scale genotyping plays a major role in genetic association
   studies. Specific locus amplified fragment sequencing (SLAF-seq)
   provides a high-resolution strategy for large-scale genotyping and can
   be applied to various species and populations [[54]14]; for instance,
   cucumber [[55]15], Glycine max [[56]16], and sesame [[57]17]. It is
   based on reduced representation library (RRL) and high-throughput
   sequencing. The technology has several distinguishing characteristics:
   i) deep sequencing to ensure genotyping accuracy; ii) reduced
   representation strategy to reduce sequencing costs; iii) pre-designed
   reduced representation scheme to optimize marker efficiency; and iv)
   double barcode system for large populations [[58]14].

   In this study, we used the female parent CMS line 3096A (using CMS
   104-7A of upland cotton as a recurrent parent line that was breeded
   with the backbone parent line for nucleus replacement) of three-line Ji
   FRH3018 [[59]18] and the male parent restorer line 866R with strong
   restoring power and its combination F[2] segregating population as the
   material. Herein, we studied the fine mapping of the restorer gene and
   its correlated candidate gene using high-throughput sequencing
   platforms. A total of 137,741 SNPs were detected and we found that 20
   candidate genes are identified and 19 genes were found annotated in
   each database of the candidate genes located on 1.35Mbp of chrD05.

Methods

Test material

   The female parent CMS line 3096 from three-line CMS hybrid Ji FRH3018
   of upland cotton and the male parent restorer line 866 with strong
   restoring power and its combination F[2] segregation population
   (30 + 30 mixed pools with extreme characteristics) were used as
   research materials.

Test method

Genomes resequencing of CMS line 3096A and fertility restorer line 866R

Sample collection and SLAF library preparation

   Fresh leaves were obtained from the parent lines and F[2], frozen with
   liquid nitrogen, extracted by the CTAB method, and assessed for the
   quality of DNA by 1% agarose gel electrophoresis. The purity of DNA was
   examined using the NanoPhotometer® spectrophotometer (Implen, CA, USA).
   The DNA concentration was estimated using Qubit® DNA Assay Kit in
   Qubit® 2.0 Fluorometer (Life Technologies, CA, USA).

   We used 1.5 μg DNA/sample as input material for the preparations of the
   sample. We have chosen to use RsaIas restriction enzyme in the
   electronic enzyme-digestion projections to the reference genome
   sequences of cotton. Sequencing libraries were generated using RsaIof
   restriction enzyme according to the manufacturer’s recommendations, and
   index codes were added to ascribe the sequences to each sample.
   Briefly, the DNA sample was fragmented by sonication to a size of
   350 bp. Then, the DNA fragments were end-polished, A-tailed, and
   ligated with the full-length adapter for Illumina sequencing by PCR
   amplification. Consequently, the PCR products were purified (Agencount®
   AMPure® XP, USA), and libraries were analyzed for size distribution by
   Agilent2100 Bioanalyzer and quantified by real-time PCR.

Illumina sequencing

   The libraries constructed above were sequenced by Illumina HiSeq ™2500
   (Illumina, Inc., San Diego, USA) platform at Biomarker Technologies
   Corporation in Beijing ([60]http://www.biomarker.com.cn/) and 125 bp
   paired-end reads were generated with an insert size approximately
   350 bp.

Data analysis, data filtering, and alignment

   The recently released genome of Gossypium hirsutum was downloaded from
   Cotton Research Institute (CRI) of Nanjing Agricultural University in
   China. ([61]http://mascotton.njau.edu.cn/Data.htm, v1.1) and used as a
   reference genome [[62]19]. Fastx-toolkit (v 0.0.14–1) was used to
   filter out the low-quality reads based on the following criteria: (i)
   reads with ≥10% unidentified nucleotides (N); (ii) reads >50% read
   length with a Phred quality value ≤10; (iii) reads with the adapter.
   The remaining clean reads were aligned to the reference cabbage genome
   using BWA-MEM (0.7.10-r789) [[63]20] and default parameters. Sequence
   Alignment/Map tools (SAMtools) (v1.1) [[64]21] was applied to sort and
   index the resulting binary alignment map (BAM) format files. The
   duplicates were excluded using Picard tools (v1.102)
   ([65]http://broadinstitute.github.io/picard/), and the final sorted bam
   files were utilized in the downstream analysis. Variant calling and
   filtering were performed in order to reduce the inaccuracy of the
   alignment. The local realignment around insertions and deletions, the
   base quality recalibration of the reads and variant calling was
   conducted using GATK Tools version 3.6. GATK Haplotype Caller (HC) was
   used for variant calling [[66]22, [67]23]. The variants that fulfilled
   the following criteria were retained (1) mapping quality filter
   equivalent to PASS; (2) quality depth (QD) >2; (3) mapping quality (MQ)
   >40; (5) QUAL >30. Moreover, the variants were filtered further if the
   coverage was <10, the cluster SNPs were >2 in a 5 bp window, if the SNP
   around the Indel was within 5 bp. SV detection and annotation
   BreakDancer was used to predict the five types of structural variants
   (SVs): insertions (INSs), deletions (DELs), inversions (INVs),
   intra-chromosomal translocations (ITXs), and inter-chromosomal
   translocations (CTXs) from next-generation paired-end sequencing reads
   utilizing the read pairs mapped with excessive separation distances or
   orientation. The SVs with read depth < 2 were filtered. Bedtools was
   employed to annotate the detected DELs, INSs, and INVs. The detection
   and annotation of CNVs (copy number variations) refers to a normal
   variation in the number of copies of ≥1 sections of some genomic
   fragments. We used CNVnator (parameters: -call 100) for the
   identification of CNVs and bedtools for annotations.

SLAF library construction and high-throughput sequencing

   The target fragment was selected by PCR amplification, purification,
   sample mixing, and excising from the gel. Illumina HiSeq™2500 was
   utilized for sequencing after inspection of the quality of the library.

SLAF tag development and SNP detection

   The original data reads were obtained by dual-index sequencing for each
   sample. After filtering the sequencing joints of the reads, the
   sequencing quality, and the volume of data were assessed. The
   efficiency of Rsa I through the control data was used to determine the
   accuracy and efficiency of the test procedure. The data reads were
   compared to that of the reference genome and the SLAF tag was developed
   in parent lines and mixed pools in order to identify the SLAF tag with
   a polymorphism in parent lines and SNP with reads coverage [[68]21]. A
   correlation analysis was conducted to identify the SNPs on the loci
   closely related to the characteristics and determine the candidate
   regions according to the correlation thresholds. Finally, a functional
   annotation and biological pathway enrichment analysis were conducted to
   identify the genes in the candidate regions.

Correlation analysis

SNP-index analysis

   The SNP-index of the two mixed pools was calculated using the SNP data
   of the parent lines and assessing the loci that might be associated
   with the segregation of characteristics through the ΔSNP-index [[69]24,
   [70]25]. The SNP-index is calculated as follows:
   [MATH: <mtable columnalign="left"
   displaystyle="true"><mtr><mtd><mi>SNP</mi><mo>‐</mo><mtext>index</mtext
   ><mspace width="0.25em"></mspace><mfenced close=")"
   open="("><mi>Mut</mi></mfenced><mo>=</mo><mi>ρx</mi><mo>/</mo><mfenced
   close=")"
   open="("><mrow><mi>ρX</mi><mo>+</mo><mi>ρx</mi></mrow></mfenced></mtd><
   /mtr><mtr><mtd><mi>SNP</mi><mo>‐</mo><mtext>index</mtext><mspace
   width="0.25em"></mspace><mfenced close=")"
   open="("><mi>WT</mi></mfenced><mo>=</mo><mi>ρx</mi><mo>/</mo><mfenced
   close=")"
   open="("><mrow><mi>ρX</mi><mo>+</mo><mi>ρx</mi></mrow></mfenced></mtd><
   /mtr><mtr><mtd><mtext>ΔSNP</mtext><mo>‐</mo><mtext>index</mtext><mo>=</
   mo><mi>SNP</mi><mo>‐</mo><mtext>index</mtext><mspace
   width="0.25em"></mspace><mfenced close=")"
   open="("><mi>Mut</mi></mfenced><mo>‐</mo><mi>SNP</mi><mo>‐</mo><mtext>i
   ndex</mtext><mspace width="0.25em"></mspace><mfenced close=")"
   open="("><mi>WT</mi></mfenced></mtd></mtr></mtable> :MATH]

   Mut and WT are the mutation and wild-type pool of the filial
   generation, respectively. ρX and ρx indicate the number of reads of the
   alleles of the wild and the mutation parent lines appearing in their
   pools, respectively. The difference in each locus between the mutation
   and pools can be observed through the ΔSNP-index [[71]26]. In order to
   eliminate the false positive locus, the SNP-indexes marked on the same
   chromosome can be fit by the position of the marker on the genome. The
   region above the threshold is correlated to the parameters. With
   respect to the qualitative character, the correlation threshold is the
   theoretical ΔSNP-index value of the corresponding population. For
   example, the correlation threshold of the F[2] population is 0.67. In
   the case of quantitative character the correlation threshold is
   obtained by a computer simulation sampling experiment, and the
   probability of each marker associated with the target characteristic is
   calculated.

Euclidean distance (ED) algorithm

   The ED algorithm evaluates the significant difference between mixed
   pools using the sequencing data. It also evaluates the area associated
   with the specific parameter [[72]27]. Theoretically, in addition to the
   difference in the target character-related loci between the two mixed
   pools established by BSA, the others tend to be consistent, and hence,
   the ED value of the non-target related loci is equivalent to 0. The
   formula for ED is as follows:
   [MATH: <mi
   mathvariant="italic">ED</mi><mo>=</mo><msqrt><mrow><msup><mfenced
   close=")" open="("><mrow><mtext
   mathvariant="italic">Amut</mtext><mo>−</mo><mi
   mathvariant="italic">Awt</mi></mrow></mfenced><mn>2</mn></msup><mo>+</m
   o><msup><mfenced close=")" open="("><mrow><mtext
   mathvariant="italic">Cmut</mtext><mo>−</mo><mi
   mathvariant="italic">Cwt</mi></mrow></mfenced><mn>2</mn></msup><mo>+</m
   o><msup><mfenced close=")" open="("><mrow><mtext
   mathvariant="italic">Gmut</mtext><mo>−</mo><mi
   mathvariant="italic">Gwt</mi></mrow></mfenced><mn>2</mn></msup><mo>+</m
   o><msup><mfenced close=")" open="("><mrow><mtext
   mathvariant="italic">Tmut</mtext><mo>−</mo><mi
   mathvariant="italic">Twt</mi></mrow></mfenced><mn>2</mn></msup></mrow><
   /msqrt> :MATH]

   The larger the ED value, the greater the difference between the two
   mixed pools. Amut is the frequency of the A base in the mutation pool,
   and Awt is the frequency of the A base in the wild pool; Cmut is the
   frequency of the C base in the mutation pool, Cwt is the frequency of
   the C base in the wild pool; Gmut is the frequency of the G base in the
   mutation pool, Gwt is the frequency of the G base in the wild pool;
   Tmut is the frequency of the T base in the mutation pool, Twt is the
   frequency of the T base in the wild pool.

   In the analysis, the SNP loci with differences in the genotypes between
   the two mixed pools are used for calculating the depth of each base in
   the different pools and the ED value of each locus. The original ED
   value is processed such as to exclude the background interference. In
   order to eliminate the false positives, the position of the marker on
   the genome can be utilized to fit the labeled ED on the same chromosome
   and select the region above the threshold as the region related to the
   fertility restoring gene according to the association threshold. In
   order to eliminate the false positive locus, the ED values marked on
   the same chromosome can be fit according to the position of the marker
   on the genome. The region above the threshold is selected as the region
   related to the fertility restoring gene according to the correlation
   threshold.

Identification of potential candidate genes

   The reference genome sequence of the AD genome of tetraploid G.
   hirsutum was downloaded. The region related to the target
   characteristics was identified in both genome sequences and scanned for
   annotated genes using the Multiple Sequence Comparison by
   Log-Expectation software.

   The Method of InDel (insertion-deletion Length Polymorphism) Markers
   Development on the Correlated Region.

   Eprimer3 in the EMBOSS (v6.4.0) [[73]28] software package was used on
   both ends of these loci sequences to design primers. The PCR reaction
   system constituted of 25 μL, containing 2 mmol/L MgCl2, 100 μmol/L
   dNTP, 0.2 μmol/L primers, 2 U Taq polymerase, 50 μL template DNA, and
   overlying 20 μL mineral oil. The PCR reaction was carried out in type
   PE480 DNA amplification equipment at 94 °C degeneration 3 min, 94 °C
   modified 30s, 40s, 58 °C annealing stretching up to 72 s, and 72 °C for
   40 cycles, followed by a final extension at 72 °C for 10 min. The PCR
   products were resolved on 6% polyacrylamide electrophoresis.

Results and analysis

SLAF-seq data analysis and evaluation

   The two parent lines and F[2] segregation population were sequenced by
   SLAF-seq. Rsa I is selected to construct the SLAF library, and the SLAF
   fragment should be between 364 and 414 bp; 38.94 M reads were obtained.
   The reads from samples were aligned to the reference genome using the
   BWA software, with >80% efficiency, which is normal. For sequencing
   results, the average Q30 was 92.01%, and the average GC content was
   37.63%. The male parent lines (R restorer lines) retrieved 9,673,045
   reads, Q30 was 90.07%, and the average GC content was 37.40%. On the
   other hand, the female parent lines (A sterile lines) obtained
   9,901,640 reads, Q30 90.65%, and the average GC content was 37.41%. The
   filial generation F[2] (aa and ab) retrieved 10,687,924 and 8,679,918
   reads, respectively, Q30 was 93.73% and 90.04%, respectively, and the
   average GC content was 37.96% and 37.73%, respectively (Table [74]1).

Table 1.

   Mining results of the high-throughput sequencing data
   Sample ID Total map (%) Properly mapped (%) Total Reads Q30 percentage
   (%) GC percentage (%)
   R 99.11 95.42 9,673,045 90.07 37.4
   A 99.35 95.77 9,901,640 90.65 37.41
   aa 99.18 95.41 10,687,924 93.73 37.96
   ab 99.41 95.76 8,679,918 90.04 37.73
   [75]Open in a new tab

Development of SLAF tag and SNP

   A total of 165,007 SLAF tags have been developed. The average
   sequencing depth of the parent lines was 47.90× and that of the mixed
   pools was 50.78×. Of these, the male parent lines obtained 16,173 SLAF
   tags with an average sequencing depth of 46.01×. The female parent
   lines obtained 161,854 SLAF tags, and the average sequencing depth was
   49.78×; whereas, the filial generation F[2] retrieved 163,688 and
   163,189 SLAF tags, respectively, and the average sequencing depth was
   55.96× + 45.59× (Table [76]2).

Table 2.

   Sequencing data of the developed SLAF markers
   Sample ID SLAF number Total depth Average depth
   R         161,173     7,415,507   46.01×
   A         161,854     8,057,541   49.78×
   aa        163,688     9,159,461   55.96×
   ab        163,189     7,440,285   45.59×
   [77]Open in a new tab

   SNPs were primarily detected by GATK software. According to the
   positioning results of the sequencing reads to the reference genome,
   GATK performs the local realignment, GATK mutation detection, samtools
   mutation detection, and identifying the overlapped mutation loci of
   GATK and samtools in order to ensure the accuracy of SNP, and obtain
   the final SNP loci set. A total of 137,741 SNPs were detected, of
   which, the male parent SNPs were 113,311, and the heterozygosity of
   SNPs in the sample was 4.19%. The female parent SNPs were 98, 861, and
   the heterozygosity was 5.37%. The filial generation F[2] demonstrated
   82,874 and 75,961 SNPs, respectively, and the heterozygosity was 20.55
   and 19.28%, respectively (Table [78]3). The distribution of SLAF tags
   and SNP markers on different chromosomes was enumerated
   (Additional file [79]1), chrA01 had the maximum number of SLAF tags,
   while chrA08 exhibited the maximum number of SNP markers. According to
   the distribution of SLAF and SNP on the chromosome, the chromosome
   distribution map of SLAF tag and SNP is plotted. The specific
   distribution is shown in Fig. [80]1.

Table 3.

   The statistic results of each sample SNP
   Sample ID Total SNP SNP number Heterozygous locus numbers ratio (%)
   R         137,741   113,311    4.19
   A         137,741   98,861     5.37
   aa        137,741   82,874     20.55
   ab        137,741   75,961     19.28
   [81]Open in a new tab

   Note: Total SNP: Total number of SNP is detected, SNP num: The number
   of SNPs in the corresponding samples detected, Heterozygous locus
   numbers ratio (%):The heterozygous locus numbers account for the
   proportion of all locus of SNPs in the sample

Fig. 1.

   Fig. 1
   [82]Open in a new tab

   SLAF distribution and SNP markers on chromosome. Note: The abscissa is
   the length of the chromosome. Each yellow band represents a chromosome.
   The genome is divided by every 1Mbp. The more the number of SLAF tags
   in each window, the deeper the color and lesser the number of SLAF
   tags, the lighter the color. The darker area in the figure is the area
   where the SLAF tags are centrally distributed. The left panel shows the
   distribution of the SLAF tag, and the right panel is the distribution
   of SNP

Correlation analysis by SNP-index and ED

   Before the correlation analysis by SNP-index, 137,741 SNPs are
   filtered. A total of 16 SNP loci with multiple mutations are also
   filtered out. 102,105 loci with reads support <4 in the mixed pools are
   filtered out, and 27,289 loci that do not exist in the parent lines are
   filtered out. Finally, 8331 SNPs were obtained for the follow-up
   analysis. Using the SNP-index method, the correlation threshold was
   0.67 according to the theoretical separation ratio of the experimental
   population. 20 association regions (Fig. [83]2) containing the genes
   were obtained, located at chr D05.

Fig. 2.

   Fig. 2
   [84]Open in a new tab

   The distribution of SNP-index-associated values on chromosome. Note:
   The abscissa is the chromosome name. The color point represents the
   calculated SNP-index (or ΔSNP-index) value, and the black line is the
   fitted SNP-index (or ΔSNP-index) value. The top graph illustrates the
   distribution of the SNP-index values in h mixed pool; the middle graph
   is the distribution of the SNP-index values in L mixed pool; the bottom
   graph is the distribution of the ΔSNP-index values, where the magenta
   line represents the theoretical threshold line

   Similarly, before correlation analysis by ED, 137,741 SNPs should also
   be filtered out. 102,114 loci with read support <4 in any mixed pool
   are first filtered out, resulting in 35,627 high-quality and
   reliability loci. Therefore, a total of 14,226 different loci were
   identified between the two mixed pools. The correlation value was
   calculated by ED, and the median + 3SD of all the loci fitted values
   was considered as the correlation threshold of the analysis: 0.4969. A
   total of 351 correlated genes (Fig. [85]3) were obtained according to
   the correlation threshold.

Fig. 3.

   Fig. 3
   [86]Open in a new tab

   The distribution of ED-associated values on chromosome. Note: The
   abscissa is the chromosome name. The color point represents the ED
   value of each SNP locus. The black line is the fitted ED value, and the
   red dotted line represents the significantly associated threshold. The
   higher the ED value, the better the correlation effect

   Finally, the intersection of the associated genes obtained from the
   above two methods was found to be located on the candidate gene on
   1.35 Mb of chrD05, and about 20 candidate genes were identified
   (Table [87]4). A correlation analysis of the genetic information to the
   associated region is summarized in the Additional file [88]2.

Table 4.

   The information of the association region
   Assocition region Chromosome ID   Start       End     Size (Mb) Gene number
   I                 chrD05        37,535,705 37,755,211 0.22      6
   II                chrD05        39,558,551 40,416,294 0.86      12
   III               chrD05        40,531,406 40,804,095 0.27      2
                     Total                               1.35      20
   [89]Open in a new tab

Gene functional annotations in related area correlation region

   The 20 genes in the correlated region are compared to the databases of
   NR, SwissProt [[90]29], GO [[91]30], COG, and KEGG [[92]31] using BLAST
   software. Finally, the annotations of 19 genes were obtained
   (Additional file [93]3). A total of 19/20 genes were found annotated in
   each database. Of these, annotations of 8 genes in KEGG, participating
   in 10 signaling pathways were found, including plant hormone signal
   transduction, protein output, DNA replication, homologous
   recombination, mismatch repair, nucleotide excision repair, ribosome,
   nitrogen metabolism, purine, and pyrimidine metabolism. In the DNA
   replication pathway, the enrichment factor 29.37 was a significant
   difference (p = 0.00186).

Differences in sterile and restorer line on the correlated region

   The genomes of the CMS line 3096A and fertility restorer line 866R were
   sequenced at 19× and 20× read depth, respectively, by Illumina
   sequencing of the paired-end libraries. Using the cotton AD-genome
   sequence as a reference, genetic variations, single base mutations,
   insertions, and deletions as compared to the reference genome were
   identified. The comparison of the structure of the genomes of the
   sterile and restorer lines on the correlated region revealed that the
   restorer line was located on the SV of the correlated region; however,
   the sterile line was not found on the SV as compared to the reference
   genome. We found that 7 SVs, 4 SVs are deletion and 3 SVs are
   interchromosomal translocation, are on the restorer line (Table [94]5).
   A total of 1607 indels were found in the correlated region, including
   1246 intergenic indels, 3 exonic indels, involving 3 genes:
   Gh_D05G3001, Gh_D05G3028, and Gh_D05G3039; we found 242 intronic
   indels, 51 upstream indels and 65 downstream indels. A total of 13,175
   SNP loci exhibited differences in the correlated region of the sterile
   and restorer lines, including 10,711 intergenic SNPs, 1858 intronic
   SNPs, 254 upstream SNPs, 227 downstream SNPs, 124 exonic SNPs and 2
   splicing SNPs in reference to the genes, Gh_D05G3005 and Gh_D05G3038.
   Nonsynonymous SNPs were found in 16 exonic regions, 1 stop-gain SNP was
   identified in Gh_D05G3042, and a stop-loss SNP was discovered in
   Gh_D05G3031.

Table 5.

   The SV on the correlated region in restorer lines
   Chr1 Pos1 Orient-ation1 Chr2 Pos2 Orient-ation2 Type Size Score
   num_Reads
   D05 40000632 14 + 0- D05 40005953 0 + 16- DEL 5334 99 14
   D05 40285089 15 + 0- D05 40286963 1 + 17- DEL 1931 99 15
   D05 40356478 17 + 0- D05 40356891 0 + 15- DEL 473 99 13
   D05 40643903 13 + 10- D05 40655906 0 + 12- DEL 12091 99 12
   D05 39644894 7 + 15- scaffold1082_A05 9062 0 + 14- CTX −318 99 12
   D05 37580919 0 + 15- scaffold4041_D05 13444 15 + 0- CTX −318 99 15
   D05 39824830 14 + 0- scaffold6268 11207 12 + 14- CTX −318 99 12
   [95]Open in a new tab

InDel (insertion-deletion length polymorphism) markers development on the
correlated region

   The analysis of the comparison of the correlated regions on the sterile
   and restorer lines’ genome sequence found 1607 InDel sites. While
   analyzing the sterile and maintainer line amplification of the genomic
   DNA and design 165 primers, we found 42 primers (Attached Additional
   file [96]4) that distinctly detected the polymorphism, and hence, could
   be used as InDel markers. The 42 InDel markup tags, 24 as codominant
   markers, and 18 as dominant markers were developed Fig. [97]4. These
   will be laid as the underlying foundations for the fine mapping of the
   restorer genes.

Fig. 4.

   Fig. 4
   [98]Open in a new tab

   The polymorphic graph of primers. Note: 1–24 Codominant markers 25–42
   Dominant markers A: sterile lines R: restorer line

Discussion

The molecular marker discovery and fine mapping of fertility restoring gene
of CMS in cotton

   The molecular marker discovery and fine mapping of fertility restorer
   gene of CMS in cotton are under intensive research. Yin et al.
   established the location of Rf1 on 100 kb between two BAC clone
   overlapping regions and selected 5 SSR in proximity to Rf1 by
   constructing a BAC library of Gossypium harknessii cytoplasmic male
   sterile restorer lines coupled with the genetic and physical maps
   recovering gene linkage [[99]11]. Yang et al. screened out 6 EST-SSR
   markers (NAU2650, NAU2924, NAU3205, NAU3652, NAU3938, and NAU4040) with
   0.327 cM from the fertility restorer Rf1 of CMS in Harknessii cotton
   [[100]32]. Wu et al. found that the fertility of CMS-D2 was regulated
   by a pair of dominant single gene Rf1, and 13 molecular markers closely
   linked to the fertility were screened out. The marker closest to Rf1
   was BNL3535 with a genetic distance of 0.049 cM; on the other side
   NAU3652 was the nearest marker with a genetic distance of 0.078 cM.
   [[101]33]. Wang et al. demonstrated that CIR179–250 was closely linked
   to both Rf1 and Rf2, which was located on LGD08 linkage group (D5
   chromosome, 19th chromosome) of D genome set with CMS-D2 and CMS-D8
   restorer, respectively, of upland cotton used as research material
   [[102]13]. Li et al. located Jin-A cytoplasmic male sterile restorer
   gene Rf on the 19th chromosome (LGD08) with a distance of 5.4 and
   10.3 cM from markers CM042 and CIR179, respectively [[103]34]. You et
   al. studied three cotton cytoplasmic male sterile lines and their
   corresponding restorers from China, Israel, and the USA, respectively.
   The results indicated that 2 restoring genes in the restorers were from
   the USA. The Rf1 was positioned between BNL3535 and CIR179 at a
   distance of 5.3 cM, while Rf2 was between STS659 and BNL1045 at a
   distance of 4.8 cM. Only 1 restoring gene was identified in the
   restorers from China, and Rf was between CIR222 and BNL632 at a
   distance of 6.7 cM. Only 1 restoring gene was found in the restorers
   from Israel, and Rf was between STS147 and CIR179 at a distance of
   4.3 cM [[104]35]. According to the SSR primers, we found the recovery
   of SSR markers in the gene location map (Table [105]6). Furthermore, we
   established that although the sterile line source type was different,
   the tags on the reference genome was found on chrD05 between
   35,690,656–59,566,733. The present study on the fertility restoration
   gene identified the location for chr D05 base sequence as
   37,535,705–37,755,211 (0.22Mbp), 39,558,551–40,416,294 (0.86Mbp), and
   40,531,406–40,804,095 (0.27Mbp) interval; the sterility-related gene
   mapping was reported between NAU2924 and NAU4040 SSR markers. As the
   same markers appear in the position of cotton CMS fertility restoring
   gene from different sources, it is speculated that the chromosomal
   segments of the restoring gene derived from various types of restorer
   lines should be consistent. These markers, which are closely linked to
   the restorer gene, act as insertion or deletion of the restorer gene
   fragment in the process of genetic improvement, resulting in the
   altered genetic distance. The present study developed 42 InDel markers
   in the correlated region; subsequently, it should be the laid a
   foundation for positioning of the cotton fertility restoring genes. The
   present results also showed that SLAF-seq technology is an efficient
   and high-resolution QTL fine-positioning technique characterized by
   high success rate, specificity, stability, and cost-efficiency. The
   combination of SLAF-seq technology, SNP_index, and BSA provides an
   efficient method for identifying the genomic regions associated with
   the characteristics described above.

Table 6.

   The summary of restorer gene marker in the genome location
   Marker Chromosome ID Genome location Source Restorer gene Reference
   NAU2924 D5 35690459–35690656 Gossypium harknessii Rf1 Yang [[106]32]
   NAU3652 D5 37123844–37124070 Gossypium harknessii Rf1 Yang [[107]32]
   Wu [[108]33]
   NAU4040 D5 43363683–43363832 Gossypium harknessii Rf1 Yang [[109]32]
   NAU2650 D5 44346401–44346571 Gossypium harknessii Rf1 Yang [[110]32]
   NAU3205 D5 50573886–50573694 Gossypium harknessii Rf1 Yang [[111]32]
   NAU3938 D5 52546928–52547146 Gossypium harknessii Rf1 Yang [[112]32]
   BNL3535 D5 54287875–54288016 CMS-D2 Rf1 Wu [[113]33];
   You [[114]35]
   CIR222 D5 54288233–54287945 Unknown
   (China) Rf You [[115]35]
   CM042 D5 55139471–55139336 Jin A Rf Li [[116]34]
   BNL632 D5 59566733–59566464 Unknown
   (China) Rf You [[117]35]
   [118]Open in a new tab

Cloning of fertility restorer gene

   The cloning of fertility restorer gene in cotton CMS is yet under
   investigation. Yang et al. identified the gene containing Rf1 and
   conducted the whole length sequencing. The Rf1 locus is found to
   contain 5 PPR genes and 2 genes highly homologous to the PPR gene in a
   region of approximately 130 kb. Based on gene prediction,
   characterization analysis, and the difference in the phylogenetic
   sequence analysis, ORF3 is speculated as the Rf1 gene that encodes the
   PPR gene and contains the mitochondrial localization signal. ORF3
   necessitates functional complementation by transgenesis [[119]36].
   Zhang et al. concluded that the starch synthase and the phosphate
   -ribose o-aminobenzoic acid transferase (PAT) gene might be associated
   with the Rf2 gene in the Trilobum cytoplasm by differential display
   technique analysis [[120]37]. Wu and Hou cloned the genes, GH182Rorf392
   and GhPG2, related to cotton fertility restoration from the upland
   cotton restorer Y18R line. GH182Rorf392 encodes 392 amino acids. The 3′
   end of the gene contains a 26 s rRNA sequence, and the 5′ end is a
   novel sequence [[121]38, [122]39]. The gene might interact with
   ribosomes in organelles such as mitochondria or chloroplasts. GhPG2
   codes for polygalacturonase, which is related to the flower organ
   development. In recent years, the Rf genes of crops such as corn
   [[123]40], rice [[124]41], onion [[125]42], and sorghum [[126]43] have
   been cloned successively. Except for corn Rf2 and Rf4 and rice Rf2 and
   Rf17, the other known Rf genes belong to the PPR (pentatricopeptide
   repeats) gene family. The coding protein of the PPR gene family is
   considered to be a single-stranded RNA-binding protein and plays a
   vital role in the processing of organelles’ RNA [[127]44]. The Rf gene
   encoding protein plays a major role in organelle RNA processing. The N
   ends of the Rf gene encoding protein contains the mitochondrial
   localization sequences that are transported to mitochondria after
   maturing in the cytoplasm, participating in mitochondrial gene
   transcription, post-transcriptional processing, and translation for
   regulating the plant fertility [[128]45]. These studies provided
   further references for exploring the cotton CMS fertility restorer