Abstract

   Relatively, little is known about the genetic variation of woody trees
   during domestication. Castor bean (Ricinus communis L. Euphorbiaceae)
   is a commercially important nonedible annual oilseed crop and differs
   from its wild progenitors that have a perennial woody habit. Although
   castor bean is one of the oldest cultivated crops, its domestication
   origin, genomic variation, and potential targets of selection
   underlying domestication traits remain unknown. Here, we performed a
   phylogenetic analysis, which suggests that the wild accessions were
   distinctively separated from the cultivated accessions. Genome
   sequencing of three accessions (one each wild, landrace, and cultivar)
   showed a large number of genetic variants between wild and cultivated
   castor bean (ZB306 or Hale), and relatively few variants between
   cultivar ZB306 and Hale. Comparative genome analysis revealed many
   candidate genes of selection and key pathways potentially involved in
   the transition from a perennial woody tree to annual crop.
   Interestingly, among 16 oil‐related genes only three showed evidence of
   selection and the remainder showed low genetic variation at the
   population level, suggesting strong purifying selection in both the
   wild and domesticated gene pools. These results extend our
   understanding of the origin, genomic variation, and domestication, and
   provide a valuable resource for future gene–trait associations and
   castor bean breeding.

   Keywords: castor bean, domestication selection, genetic variation, seed
   storage lipids, woody tree

1. INTRODUCTION

   In addition to natural or environmental selection,s the morphology and
   genetic diversity of species can be shaped by human activity (Olsen &
   Wendel, [34]2013; Shi & Lai, [35]2015; Wright, [36]2015). Hundreds of
   wild plants have been selected by humans, giving rise to a multitude of
   major and minor crops with remarkably modified agricultural or
   horticultural traits (Doebley, Gaut, & Smith, [37]2006). Recent
   comparative genetic analyses have greatly advanced our understanding of
   questions concerning the geographic origins of crops, the change in
   genetic diversity, and the molecular basis underlying important
   agricultural traits (Shi & Lai, [38]2015). Only a few studies have
   focused on the evolutionary history and domestication origin of woody
   trees, and have shown that the woody trees possessed a different
   evolutionary rate from annual domesticated crops (Delplancke et al.,
   [39]2013; Miller & Schaal, [40]2006; Verde et al., [41]2013). However,
   in general, relatively little is known about woody tree domestication.
   In particular, it is still unclear how perennial woody trees could be
   modified into annual crops by human cultivation or what extent genomic
   variation is modified in response to artificial selection in perennial
   woody trees.

   Castor bean (Ricinus communis L., Euphorbiaceae) is an important
   nonedible oilseed annual semi‐woody crop. Its seed oil contains a high
   ricinoleic acid concentration (~90%), an unusual hydroxylated fatty
   acid, and is a vital industrial raw material (Ogunniyi, [42]2006).
   Considering the high economic value and the easy growth in unfavorable
   environments, castor bean has been extensively cultivated in tropical
   and subtropical countries. Anthropological studies have showed that
   castor bean was used by humans dating back more than 6,000 years (Chan
   et al., [43]2010; Hayes, [44]1953). However, owing to castor bean
   probably being introduced multiple times in many regions, predominant
   outcrossing under natural conditions (Brigham, [45]1967; Meinders &
   Jones, [46]1950), and its widespread escape from cultivation (Weber,
   [47]2003), the origin of castor bean domestication is still unknown
   (Foster et al., [48]2010). According to morphological features, it is
   supposed that eastern Africa (Ethiopia and Kenya) might be a candidate
   site of origin (Vavilov, [49]1951; Zeven & Zhukovsky, [50]1975);
   however, this supposition lacks supporting evidence. Studies of
   germplasm collected throughout the world reveal low levels of genetic
   variation and lack of geographically structured genetic populations in
   castor bean, regardless of marker system used (Allan et al., [51]2008;
   Foster et al., [52]2010; He, Xu, Li, Wang, & Liu, [53]2017; Qiu, Yang,
   Tian, Yang, & Liu, [54]2010; Rivarola et al., [55]2011). Few studies
   have examined genetic variation and population structure of castor bean
   accessions from eastern Africa due to this wild germplasm being
   unavailable.

   Sampling of wild castor bean from eastern Africa would facilitate an
   understanding of the origin and domestication of this species and help
   reappraise genetic diversity at the genomic level. Based on our
   extensive field investigation, we found that castor bean accessions
   from eastern Africa usually display traits typical of wild species,
   such as extremely small seeds, natural seed dispersal, strong
   disease/stress resistance, and a perennial and woody tree phenotype
   with an elongated stem. In stark contrast, most cultivated castor bean
   accessions are annual, semi‐woody crops, the result of marked
   phenotypic modifications during domestication. Thus, castor bean
   provides an excellent model to dissect the genetic mechanism of trait
   variation in woody trees during domestication, and comparative genomics
   of wild and domesticated castor bean would also provide some new
   insights into understanding fundamental issues of tree biology such as
   perennial versus annual habit, woody growth, and adaptation to the
   environment (Neale, Martínez‐García, De La Torre, Montanari, & Wei,
   [56]2017).

   Here, we investigated the phylogenetic relationship between wild and
   cultivated castor bean, assessed genome‐wide genetic variation during
   domestication, and identified potential domestication loci throughout
   the genome. The results obtained in this study provide novel insights
   into not only genetic diversity and population structure, but also the
   effect of human cultivation on the domestication of castor bean, and
   potentially how a perennial woody tree can evolve into an annual crop.

2. MATERIALS AND METHODS

2.1. Collection of wild accessions and investigation of population structure

   We collected seven wild castor bean accessions from eastern Africa and
   obtained another 76 castor bean accessions, representing the worldwide
   distribution (Table [57]S1). Seeds of each accession were germinated at
   Kunming Institute of Botany (Kunming, Yunnan, China). Three‐week‐old
   seedlings (one to four per accession) were then frozen in liquid
   nitrogen, and nuclear DNA was extracted using the DNeasy Plant Mini Kit
   (Qiagen). We firstly employed EST‐SSR markers (expressed sequence
   tag‐simple sequence repeat), including those developed by Qiu et al.
   ([58]2010) alongside markers newly developed here using the same
   approach as Qiu et al. ([59]2010), to investigate genetic relationships
   among germplasm. The phylogenetic tree was generated with
   neighbor‐joining and UPGMA methods (Tamura et al., [60]2013). Further,
   bootstrapping analysis with 1,000 replicates was carried out using the
   software FREETREE V.0.9.1.50 (Pavlícek, Hrdá, & Flegr, [61]1999) and
   bootstrap values >50 were displayed. Population structure was analyzed
   using the software STRUCTURE 2.3.1 (Falush, Stephens, & Pritchard,
   [62]2003) with values of K (the potential number of clusters) tested
   from 1 to 9 with six independent runs of 10,000 iterations after a
   burn‐in period of 10,000 iterations. The optimal K was determined using
   the delta K method (Evanno, Regnaut, & Goudet, [63]2005) using
   STRUCTURE HARVESTER (Earl & von Holdt, [64]2012).

2.2. Genome resequencing

   Based on phylogenetic analysis, three castor bean accessions
   representing one wild accession (WT001 from Kenya), one landrace (ZB107
   from China), and one cultivar (ZB306 from China) were selected for
   whole‐genome sequencing. ZB107 and ZB306 are genetically divergent and
   are important for castor bean breeding in China. Their morphological
   features including plant height, branching, stem diameter, seed oil
   content, and growth circle were characterized based on three seasons of
   observations under natural conditions at Kunming Institute of Botany.
   Genomic DNA was sonicated to fragments of ca. 500 bp, and sequencing
   libraries were constructed using TruSeq® DNA Library Prep Kits
   following the manufacturer's instructions (Illumina). Libraries were
   sequenced for 90 cycles (paired‐end) on the HiSeq 2000 platform in the
   Shenzhen‐BGI (China). Raw data were preprocessed to filter out adapter
   sequences, contaminated sequences, and low‐quality reads using
   Trimmomatic ver. 0.36 (Bolger, Lohse, & Usadel, [65]2014), with the
   following parameters: ILLUMINACLIP:TruSeq3‐PE.fa:2:30:10:8 LEADING:3
   TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75 AVGQUAL:20. The quality of
   clean data was checked by FastQC software. The clean data have been
   submitted to NCBI Short Read Archive under accession number SRS5353940.

2.3. Identification of SNPs and short indels

   The clean reads were then mapped to the reference genome (Hale,
   [66]http://castorbean.jcvi.org) using the command “bwasw” in BWA (ver.
   0.5.8) with default parameters (Li & Durbin, [67]2009). Reads which
   mapped to multiple chromosomal positions and unmapped reads were
   discarded. SAMtools (Version 1.3.1, Li et al., [68]2009) was used to
   convert mapping results to bam format and to remove PCR duplicates
   generated during sequencing. We then applied the “mpileup” command of
   SAMtools to call raw SNP variants including homo‐ and heterozygous
   sites. Subsequently, SNPs in each individual were further filtered by
   sufficient base quality (Q values >20), at least five reads supporting
   each SNP and more than 5‐bp interval between two adjacent SNPs using
   vcftools ver.0.1.16 together with a custom Perl script. The remaining
   SNPs were retained for subsequent analysis. The methods and criteria of
   Zheng et al. ([69]2011) were used to identify short insertions and
   deletions (indels) (≤10 bp). Briefly, the paired‐end reads were first
   aligned to the reference sequence by allowing up to 10‐bp gaps using
   BWA ver. 0.5.8 with “aln ‐o 10” parameter, and then the redundant pairs
   were merged before calling indels. Gaps supported by at least five
   paired‐end reads were identified, and only homozygous indels were
   considered when the number of the un‐gapped reads that crossed a
   potential indel was no more than twice that of the gapped reads.

   Based on the previous gene model of the castor bean genome (Chan et
   al., [70]2010), genetic variants identified above (SNPs and indels)
   were further annotated as being present in exons, introns, untranslated
   regions of coding regions, and intergenic regions. The effect of coding
   sequence variants was further classified into so‐called “large‐effect
   variants” (including start/stop codon gains or losses, putative splice
   site mutations, and frameshifts), nonsynonymous variants, and
   synonymous variants.

2.4. Validation of SNP and indels

   To evaluate the accuracy of SNPs and indels identified in this study,
   we used the same method to call SNPs and indels from our previous
   RNA‐seq data from two of the same accessions (ZB107 and ZB306;
   downloaded from NCBI SRX485027, reported by Xu, Dai, Li, & Liu,
   [71]2014). SNPs and indels identified from the two different datasets
   were compared to ascertain the effectiveness of our SNP and indel
   calling pipeline.

2.5. Copy number and structural variation

   We employed readDepth (Miller, Hampton, Coarfa, & Milosavljevic,
   [72]2011) to identify copy number variants (CNVs). ReadDepth uses a
   binning procedure to call CNVs based on read mapping depth,
   recalibrated by mappability and GC content, and then calls segment
   boundaries using a circular binary segmentation algorithm. In order to
   increase the confidence of our CNVs, we calculated the average read
   depth (RD) for the identified regions of the castor bean genome for
   each of the three accessions. Only regions with RD >3 standard
   deviations and significantly different from the genome‐wide mean RD
   (FDR < 0.01) were considered. CNVs were classified based on whether
   they exhibited significantly higher (gain CNVs) or lower (loss CNVs or
   CNVs decrease) read depth than average.

   BreakDancer (Chen et al., [73]2009; Fan, Abbott, Larson, & Chen,
   [74]2014) was used to detect genomic structural variants (SVs)
   including large insertions and deletions, inversions, and
   intrachromosomal translocations, by comparing the genomic coordinates
   of each end of a paired‐end read against the distribution of the insert
   sizes. Any SV for each individual with support from at least three
   abnormal paired‐end reads was considered. SVs were further filtered to
   remove those found in repetitive regions of the genome and those that
   represented (highly implausible) deletions of more than 100 kb.

   Overlaps between CNVs or SVs and gene regions were detected using
   bedtools (ver. 2.17.0,
   [75]http://bedtools.readthedocs.org/en/latest/index.html). To reduce
   false identification of genes affected by CNVs or SVs, we only
   considered CNVs or SVs which overlapped at least 50% of a gene. Gene
   Ontology (GO) enrichment analyses were performed to determine whether
   any functions were enriched in the lists of genes affected by CNVs and
   SVs.

2.6. Identification of candidate selection genes

   The ratio between Ka (the number of nonsynonymous substitutions per
   nonsynonymous site) and Ks (the number of synonymous substitutions per
   synonymous site) is used to estimate the level of selective constraint
   on a locus, where Ka/Ks values >1 and <1 are suggestive of positive
   selection and purifying selection, respectively. Pairwise estimates of
   Ka, Ks, and Ka/Ks were calculated between each pair of accessions using
   the Ka_Ks calculator (Zhang et al., [76]2006) following the YN method.
   Those genes with Ka/Ks > 1 between wild and Hale or between ZB306 and
   Hale were identified for further characterization. To investigate
   whether particular pathways were overrepresented in the list of genes
   with Ka/Ks > 1, KEGG enrichment analyses were performed using the
   OmicShare tools ([77]www.omicshare.com/tools).

2.7. Selection analyses for candidate genes involved in oil biosynthesis and
assembly

   Here, we selected 16 key genes that involved in the synthesis and
   modification of fatty acids as well as triacylglycerol biosynthesis
   (Table [78]S2). For multicopy gene families of proteins, one candidate
   locus was selected based on its high expression in castor bean seeds
   (Brown et al., [79]2012). In addition, five genes with a Ka/Ks ratio
   approximately equal to 1 were considered as presumptively neutral loci
   (neutrality was tested, see below). Primers are listed in Table [80]S2.
   We amplified and sequenced these loci from 12 castor bean accessions,
   comprising six cultivars and six diverse wild‐like lines (Table
   [81]S3). PCR was performed with the following conditions: 94°C for
   3 min, followed by 30 amplification cycles at 94°C for 30 s, 60°C for
   30 s, and 72°C for 60 s, and a final extension at 72°C for 7 min. PCR
   products were purified using a PCR purification kit (Tiangen, Beijing,
   China), cloned into pGEM‐T vectors (Tiangen), and transformed into
   competent Trans5 α E. coli (TransGen, Beijing, China). Six positive
   clones were randomly selected for sequencing. The oil content of seed
   from the 12 castor bean accessions was also investigated according to
   the method of Xu Wang and Liu ([82]2011).

   Maximum‐likelihood Hudson–Kreitman–Aguade tests (ML‐HKA version2.cpp,
   Wright & Charlesworth, [83]2004) were employed to test each locus for
   departure from neutrality, with the parameters number of segregating
   sites, number of silent sites, number of divergent sites, and
   Watterson's ([84]1975) estimate of diversity (θ) estimated using DnaSP6
   (Rozas et al., [85]2017). In brief, two different models (100,000
   simulations each) were performed: one for a strictly neutral model in
   which all loci were considered as to be neutrally evolving; the other
   for a selection model in which each locus was considered in turn as
   being under selection, relative to the other loci. Tests were performed
   separately for the wild and cultivar population. Two times the
   difference in log‐likelihoods of the models was then used in a
   chi‐square test with one degree of freedom to determine statistical
   significance. This was first run to test that each of the five
   putatively neutral loci was in fact not demonstrating departure from
   neutrality, and once this was confirmed, each of the selection
   candidates was tested in turn against the five neutral loci.

3. RESULTS

3.1. Characterization of population structure in castor bean

   In this study, we firstly applied 261 polymorphic EST‐SSR markers
   (Table [86]S4) to analyze the relationships between 83 individuals of
   castor bean (Figure [87]1a and Table [88]S1). The phylogenetic analysis
   revealed that castor bean is divided into two main groups (I and II;
   Figure [89]1b), and this was consistent with the STRUCTURE results
   which suggested that K = 2 was the most strongly supported (Figure
   [90]1c,d). Group I consisted of ten castor bean accessions from Africa,
   including seven lines identified morphologically as wild castor bean
   (see bold branches in Figure [91]1b) plus three wild‐like individuals
   (which display similar phenotype to the wild but the information of
   collection location was missing). Group II contained the remaining
   accessions and was broadly separated into accessions from South Africa
   and the Middle East (Turkey, Iran, Afghanistan, Pakistan), and the rest
   of the world, although there was not complete delineation. STRUCTURE
   showed moderate support for K = 3 which broadly agreed with the
   phylogenetic analysis. For K = 3, the African accessions were
   identified as one cluster (n = 10, numbered 1–10 in Figure [92]1b,d),
   accessions from the Middle East as the second (n = 18, numbered 11–28
   in Figure [93]1b,d; all with > 50% membership to cluster 2, but some
   with admixture), and the remaining accessions as the third (n = 55,
   again several accessions demonstrating admixture). Within each of these
   three groups, there was no obvious geographic structuring of
   accessions.

Figure 1.

   Figure 1
   [94]Open in a new tab

   An overview of the castor bean collection. (a) The core collection of
   83 lines comes from a wide geographic distribution. (b) Analysis of the
   phylogenetic relationships between castor bean accessions based on 261
   EST‐SSR markers developed in this study. Bootstrap support >50% is
   indicated. Two distinct groups (I and II) were evident. Bold black
   branches indicate morphologically wild accessions. The solid black
   triangles indicate wild, landrace (ZB107), and cultivar (ZB306)
   accessions used in the genome sequencing. (c) Delta K plot showing a
   peak at K = 2 for the STRUCTURE results. (d) Model‐based cluster
   analysis of the core set using the program STRUCTURE. Values of K
   (number of clusters) from 2 to 7 are displayed. The most supported
   model was K = 2

3.2. Choice of castor bean lines used for resequencing

   Based on phylogenetic analysis, three accessions (WT001, ZB107, and
   ZB306) were chosen for whole‐genome resequencing. Their morphological
   features are listed in Table [95]S5 based on our field observations for
   three years. The wild and landrace lines are taller with a greater stem
   diameter than the two cultivars (ZB306 and Hale). Meanwhile, the wild
   and landrace lines are perennial and resistant to disease, while the
   two cultivars are annual and susceptible to disease (Table [96]S5).
   Other traits of agronomic benefit are exhibited in the cultivars, for
   example, a many‐branched stem, high oil content (see also later), and
   high seed yield. In addition, the landraces and cultivars lack natural
   seed dispersal that is present in the wild.

3.3. Genome‐wide genetic variation in castor bean

   Genome sequencing of each castor bean line yielded approximately
   10.2 Gb of high‐quality sequence, corresponding to ~30‐fold genome
   coverage for each line (Table [97]S6). We found substantial numbers of
   genetic variants including single nucleotide polymorphism (SNP), short
   (≤10 bp) insertion and deletion (indel), copy number variation (CNV),
   and structure variation (SV) throughout the castor bean genome (Figure
   [98]2). In total, we obtained 1,189,858 homozygous SNPs and 233,955
   heterozygous SNPs among these castor bean genomes using stringent
   filtering criteria (see Table [99]1). Among them, heterozygous sites
   were made up 16.4% of all SNP sites; in subsequent analyses, only
   homozygous SNPs were used. The majority (90.0%) of homozygous SNPs were
   located in intergenic regions (1,070,618), while only 10.0% (119,240)
   were found in genic regions or untranslated regions (UTRs, Table
   [100]1). This corresponds to a SNP density of 3.72 per kb (one SNP per
   269 bp) in the castor bean genome. The genic SNPs comprised 973 and
   1917 SNPs in 5′ UTR and 3′ UTR region, 77,631 in introns, and 38,719 in
   exons (Table [101]1). Of the SNPs in exons, 764 putatively affected
   start/stop codons, 22,297 were nonsynonymous, and 15,658 were
   synonymous SNPs, resulting in average nonsynonymous to synonymous
   substitution ratio (Non‐syn/Syn) of 1.42. On a per individual basis,
   the value of Non‐syn/Syn was slightly lower for cultivated castor bean
   (1.40) than for the wild (1.43) and landrace lines (1.43) (Table
   [102]2). The Non‐syn/Syn value in the high‐confidence genes (i.e.,
   those having transcript support, reported by Brown et al., [103]2012
   and Xu, Li, Ling, & Liu, [104]2013) was smaller than in low‐confidence
   genes (without transcript support; Table [105]2). It was also found
   that in the gene set with Pfam support (retrieved from the castor bean
   gene annotation file,
   [106]https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Rcomm
   unis), genes of high confidence had a smaller ratio than those of the
   low‐confidence genes.

Figure 2.

   Figure 2
   [107]Open in a new tab

   Genome‐wide landscape of genetic variation in castor bean. The top ten
   scaffolds, in terms of length, are given with the distribution of
   variants below, including single nucleotide polymorphism (SNP), short
   insertion and deletion (indel), copy number variation (CNV), and
   structure variation (SV)

Table 1.

   Summary of (A) single nucleotide polymorphism (SNP) and (B) short
   insertion and deletion (INDEL) differences between castor bean lines
   A. SNPs
   Comparison Total homozygous Total heterozygous Intergenic Genic 5' UTR
   Exon Nonsynonymous[108]a Synonymous Intron 3' UTR
   Wild versus Hale 722,358 150,120 656,820 65,538 530 21,125 12,591 8,534
   42,844 1,039
   ZB306 versus Wild 933,549 43,221 843,283 90,266 732 29,764 17,600
   12,164 58,312 1,458
   ZB107 versus Wild 197,732 21,237 170,382 27,350 209 9,243 5,687 3,556
   17,465 433
   ZB306 versus ZB107 964,020 43,045 872,906 91,114 735 30,029 17,714
   12,315 58,872 1,478
   ZB107 versus Hale 800,517 149,444 728,676 71,841 591 23,224 13,851
   9,373 46,865 1,161
   ZB306 versus Hale 439,919 79,091 384,933 54,986 462 17,951 10,626 7,325
   35,675 898
   Total 1,189,858 233,955 1,070,618 119,240 973 38,719 23,061 15,658
   77,631 1,917
   B. INDELs
   Comparison Insertions Deletions Intergenic Genic 5' UTR Coding Del
   Coding Ins Frameshift Intron 3' UTR
   Wild versus Hale 44,530 36,034 69,568 10,996 282 193 176 904 9,117 324
   ZB306 versus Wild 38,805 41,633 68,213 12,225 338 288 286 530 10,420
   363
   ZB107 versus Wild 11,429 11,992 19,558 3,863 119 98 98 176 3,257 115
   ZB306 versus ZB107 40,546 43,133 71,327 12,352 336 288 276 548 10,529
   375
   ZB107 versus Hale 50,413 41,706 80,203 11,916 295 224 209 949 9,874 365
   ZB306 versus Hale 31,926 27,000 49,311 9,615 231 179 155 816 7,991 243
   Total 74,638 67,238 122,588 19,288 500 380 368 1,279 16,199 562
   [109]Open in a new tab

   UTR, untranslated region of mRNA.
   ^a

   Includes start/stop mutations.

Table 2.

   Number and ratio of nonsynonymous (Non‐syn) and synonymous (Syn) sites
   in high/low‐confidence genes and genes with and without Pfam annotation
   Samples versus Hale High‐confidence genes Low‐confidence genes
   Pfam‐containing genes Genes without Pfam Total
   Non‐syn Syn Non‐syn/Syn Non‐syn Syn Non‐syn/Syn Non‐syn Syn Non‐syn/Syn
   Non‐syn Syn Non‐syn/Syn Non‐syn/Syn
   Wild 8,906 7,026 1.27 3,271 1,508 2.17 8,620 6,621 1.30 3,557 1,912
   1.86 1.43
   ZB107 9,732 7,717 1.26 3,653 1,656 2.21 9,407 7,258 1.30 3,978 2,115
   1.88 1.43
   ZB306 7,673 6,139 1.25 2,610 1,186 2.20 7,313 5,798 1.26 2,970 1,527
   1.94 1.40
   [110]Open in a new tab

   Additionally, we detected a total of 141,876 small indels (≤10 bp)
   (Table [111]1); the majority (86.4%) were located in intergenic
   regions, with 13.6% (19,288) in genic regions. This corresponds to an
   indel density of one indel per 2.26 kb. Of the indels in genic regions,
   2,027 (10.5%) occurred in coding sequences, 1,062 (5.5%) in the UTRs,
   and 16,199 (84.0%) in introns. Potential frameshifts were predicted for
   1,279 of the coding indels. The distribution of indel length in coding
   regions was significantly different from that in the intergenic regions
   (Mann–Whitney U tests: wild, z = −3.742; p = .000183; ZB107,
   z = −3.745; p = .000181; ZB306, z = −3.742; p = .000183) demonstrating
   enrichment for multiples of three, most likely the result of purifying
   selection against frameshifts (Figure [112]3a). Importantly, SNPs and
   indels identified from our previous RNA‐seq data (Xu et al., [113]2014)
   were greatly consistent with (92% accuracy rate) those identified in
   this study, indicating that these genetic variations (SNPs and indels)
   across the entire castor bean genome were reliable.

Figure 3.

   Figure 3
   [114]Open in a new tab

   Distribution of polymorphisms in the castor bean genome. (a) Number of
   indels of different length in the coding sequences (CDS) and the whole
   genome. (b) Venn diagram indicating the unique and shared SNPs and
   indels between castor bean lines. (c) Venn diagram indicating the
   unique and shared large‐effect mutations between castor bean lines. (d)
   Annotation of large‐effect SNPs. The numbers of large‐effect SNPs for
   selected groups of genes are displayed. All the gene families shown
   were significantly abundant in large‐effect SNPs (Fisher's exact test,
   p < .05). (e) Number of gain and loss CNVs (relative to the Hale
   genome) in the entire genome and in genic regions in each castor bean
   line. (f) Venn diagram indicating the unique and shared genes affected
   by CNV

3.4. Polymorphisms between castor bean lines

   Comparative genomic analysis revealed a considerable amount of genetic
   variation (SNPs and indels) within and between castor bean lines. For
   SNP density in castor bean, we found one SNP per 442 in the wild and
   one SNP per 399 bp in the landrace line, whereas SNP density was lower
   (one SNP per 727 bp) in the cultivated line. Comparing the cultivated
   lines (Hale and ZB306) to the wild line, we found 722,358 and 933,549
   homozygous SNPs, and 80,564 and 80,438 indels, respectively, whereas
   only 197,732 homozygous SNPs and 23,421 indels were detected in
   landrace ZB107 relative to the wild line (Table [115]1). Similarly, we
   detected a large number of differences between ZB107 and Hale (800,517
   SNPs and 92,119 indels) and between ZB107 and ZB306 (964,020 SNPs and
   83,679 indels). While the number of differences between the two
   cultivars was lower, this was still relatively high (439,919 SNPs and
   58,926 indels; Table [116]1). The number of heterozygous SNPs in the
   wild and landrace genomes (150,120 and 149,444, respectively) was
   substantially higher than that in ZB306 (79,091) when all were compared
   with the Hale genome (Table [117]1).

   Of the 1.19 M SNPs, 626,478 (52.7%) were accession‐specific (including
   Hale), a large number (527,590; 44.3%) were shared between the wild and
   landrace (ZB107) lines and a relatively small number were shared
   between the wild and cultivated (ZB306) lines (10,179; 0.9%) or the
   landrace and cultivated lines (25,341; 2.1%) (Figure [118]3b). A
   similar pattern was found with the indels, with 62.4%
   accession‐specific, 31.2% shared between the wild and landrace
   accessions, and only 4.4% shared between wild and ZB306 or between
   landrace and ZB306 (Figure [119]3b).

3.5. Deleterious mutations and gene content variation

   By projecting the SNPs and indels onto the Hale genome gene models, we
   further analyzed the distribution of so‐called large‐effect variants,
   which are predicted to have a potential effect on gene function. In
   total, we found 2,230 large‐effect variations, comprising 212 SNPs that
   were predicted to alter start codons, 552 SNPs that were expected to
   alter stop codons, 187 SNPs that were predicted to disrupt splicing
   donor or acceptor sites, and 1,279 indels that induced a frameshift
   mutation (Table [120]S7). There were 1,429, 1,535, and 1,239
   large‐effect variants identified in the wild, ZB107, and ZB306
   accessions, respectively, relative to the reference genome (Table
   [121]S7). A large number were shared between accessions, but we
   identified 165, 246, and 498 accession‐specific large‐effect variants
   in the wild, ZB107, and ZB306 accessions, respectively (Figure
   [122]3c). Again, the wild and landrace accessions shared a large
   proportion of the mutations (580/2,230; 26.0%), whereas only a small
   proportion of large‐effect variants were shared between wild and ZB306
   or shared between ZB107 and ZB306 (89/2,230; 4.0%). These large‐effect
   variants were located in 1,904 genes, and of them, approximately 41%
   genes had no transcript support and 43% had no Pfam domain. This is
   consistent with their percentage in the genome (40% of genes had no
   transcript support and 39% had no Pfam domain), suggesting large‐effect
   mutations were no more or less common in the high‐confidence genes.
   These large‐effect SNPs are significantly (Fisher's exact test,
   p < .05) enriched in six Pfam families including leucine‐rich repeat
   regions, PPR repeat, and NB‐ARC domain families (Figure [123]3d).

3.6. Copy number variation

   Increasing evidence has shown that copy number variation (CNV) is
   prevalent in many eukaryotic species and makes a significant
   contribution to phenotypic variation in plants (Freeman et al.,
   [124]2006; Muñoz‐Amatriaín et al., [125]2013). We identified 1,776 CNVs
   in at least one genotype relative to Hale, which summed to about ~6% of
   the castor bean genome. Of these CNVs, 1,129 were identified in the
   wild accession, 1,065 in ZB107, and 699 in ZB306 (Figure [126]3e; Table
   [127]S8). Nearly equal numbers of gain and loss (relative to Hale) CNVs
   were discovered in these accessions. We detected a total of 438 genes
   which may potentially be affected by CNVs (Figure [128]3e; Table
   [129]S8). Of these, 72, 74, and 87 CNV‐affecting genes were exclusive
   to wild, ZB107, and ZB306, respectively (i.e., 75.6% of CNVs were
   accession‐specific; Figure [130]3f).

   Gene Ontology (GO) enrichment analysis revealed that there was no
   significant enrichment for these CNV‐affected genes in the wild and
   ZB306 line (Figure [131]S1a). However, for the genes affected by
   wild‐specific CNVs, there were nine GO terms significantly enriched
   (FDR < 0.05, Figure [132]S1b) including oxidoreductase and phosphatase
   activity, and nucleoside binding. These CNV‐affected genes in the wild
   accession have the potential to contribute to environmental adaptation.
   For example, several genes are involved in disease resistance, such as
   GTP‐binding protein beta 1 (AGB1, 30190.m010890) and R gene proteins
   (29838.m001648, 29838.m001649, 29838.m001650, 30090.m000236,
   30131.m007231, 29991.m000654, 29907.m000656, and 30158.m000501) (Table
   [133]S8). Interestingly, we found that a gene (60629.m00002) encoding a
   ricin protein had a greater copy number in the wild line relative to
   cultivated castor bean, though the exact ricin levels have not been
   surveyed in wild and cultivated lines yet. These results provide a
   resource for validating the hypothesis that some phenotypic variation
   in castor bean may be related to CNVs.

3.7. Structural variation

   We detected a large number of structural variants (SVs) in the castor
   bean genomes, including 6,016, 6,197, and 5,240 large (>50 bp)
   deletions and 2,595, 1,796, and 1,194 large insertions in the wild,
   ZB107, and ZB306 accessions, respectively, relative to Hale (Figure
   [134]4a). Although only a small number of SVs (inversions and
   intrachromosomal translocations) were identified in the castor bean
   individuals (Figure [135]4a), their average lengths were much greater
   than the large insertions and deletions, with an average size of 70 and
   15 kb for inversions and translocations, respectively (Figure [136]4b).
   The majority of SVs (about 86%) occurred in intergenic regions with the
   remainder (14%) in genic regions. Due to large insertions and deletions
   within genes likely to have significant functional consequences, these
   genes were further analyzed (Table [137]S9). In total, we identified
   2,443 genes containing 4,492 large insertions and deletions. Of these,
   1,958 (80.1%) were accession‐specific and 334 (13.7%) differentiated
   the wild and landrace lines from the cultivars (ZB306 and Hale) (Figure
   [138]S2a). Although those genes overlapping with large indels were not
   significantly enriched for any Pfam categories (Fisher's exact test,
   p > .05), we did find several Pfam categories involved in plant defense
   (e.g., leucine‐rich repeat and PPR repeat) among the ten most common
   Pfam families (Figure [139]S2b). As an example, the gene 29842.m003667
   (a cysteine‐rich RLK) was affected by an exonic deletion in the wild
   line relative to ZB306 and Hale (Figure [140]4c).

Figure 4.

   Figure 4
   [141]Open in a new tab

   Number and size of SV types. (a) The number of different types of SVs:
   deletion (DEL), insertion (INS), inversion (INV), and intrachromosomal
   translocations (ITX). (b) The size distribution of different types of
   SVs. (c) An example of a large deletion in the coding region of
   homogene 29842.m003667 in the wild line relative to cultivated castor
   bean. The blue tracks indicate reads mapping depth in wild and inbred
   ZB306 castor bean, respectively

3.8. Identification of candidate domestication genes underlying the
phenotypic traits

   To identify genetic variation which might be responsible for the
   phenotypic differentiation between perennial woody tree and annual crop
   of castor bean, we identified a set of genes with nonsynonymous SNPs
   that differentiated the wild and landrace accessions (both perennial
   woody trees) from the two cultivars Hale and ZB306 (annual semi‐woody
   crop). Similarly, we also selected a set of genes with indels that were
   identified in both wild and landrace lines but not in two cultivars.
   This identified 4,072 genes differentiating the wild tree and
   domesticated crop, a subset of which could underlie the domestication
   phenotypes. Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment
   analysis for these genes revealed that they were significantly present
   in five categories (p‐value < .05): (a) “Plant–pathogen interaction”
   including many disease resistance protein and NBS‐LRR genes; (b) “Plant
   hormone signal transduction” including auxin, gibberellin (GA),
   abscisic acid (ABA), and ethylene‐responsive protein; (c)
   “Phenylpropanoid biosynthesis,” including lignin synthesis pathway, for
   example, the hydroxycinnamoyl‐CoA shikimate/quinate hydroxycinnamoyl
   transferase (HCT), cinnamyl‐alcohol dehydrogenase (CAD), cinnamoyl‐CoA
   reductase (CCR), cinnamate 4‐monooxygenase (C4H), and 4‐coumarate‐‐CoA
   ligase 1‐like (4Cl); (d) “Carbon metabolism” including the citrate
   cycle, glycolysis/gluconeogenesis, and carbon fixation; and (e)
   “Biosynthesis of amino acids” (Table [142]S10).

   To further identify genes with evolutionary signatures of selection
   (i.e., Ka/Ks > 1) in the castor bean genome, we compared coding regions
   between the wild and Hale genomes to identify candidate domestication
   genes and between ZB306 and Hale to identify candidate diversification
   genes. As expected, the majority of genes exhibited Ka/Ks ≤ 1, while
   there were 238 and 202 candidate domestication and diversification
   genes, respectively, with Ka/Ks > 1 (Figure [143]5a; Table [144]S11).
   KEGG enrichment analysis for these genes revealed that they were mainly
   present in the pathways mentioned above. We found that six genes encode
   an LRR receptor‐like serine/threonine‐protein kinase, and nine genes
   encoding disease‐related proteins including many nucleotide‐binding
   site–leucine‐rich repeat genes (NBS‐LRR). In phenylpropanoid
   biosynthesis, there were six genes, including two encoding HCT, three
   encoding benzyl alcohol O‐benzoyltransferase, and one encoding CCR.
   Subsequently, we performed polymerase chain reaction (PCR) sequencing
   of the entire coding sequence of 28617.m000209 (encoding HCT) for the
   same six cultivars and six wild lines (Table [145]S3), due to its key
   function in lignin biosynthesis of Arabidopsis (Gallego‐Giraldo,
   Escamilla‐Trevino, Jackson, & Dixon, [146]2011). This revealed that the
   three nonsynonymous SNPs that differentiate the wild and Hale genomes
   were fixed within each population (Figure [147]5b). Additionally,
   several genes putatively under selection during domestication were
   found to participate in plant hormone signal transduction and
   photosynthesis (Table [148]S11). For example, four of these putatively
   encode a DELLA protein, cytokinin receptor (CRE1), auxin response
   factor (ARF2), and xyloglucan:xyloglucosyl transferase (XTH25), all of
   which are involved in plant hormone recognition or signal transduction.
   In addition, two genes putatively encode homologs of PsaA/PsaB protein
   and photosystem II reaction center protein B, which may function as a
   lighter receptor to promote photosynthesis.

Figure 5.

   Figure 5
   [149]Open in a new tab

   Selection pressure and genetic diversity analysis in castor bean. (a)
   Synonymous (Ks) and nonsynonymous (Ka) substitution rate between wild
   and cultivated (Hale) castor bean and between cultivated varieties
   (ZB306 vs. Hale). (b) Distribution of nonsynonymous sites within the
   candidate positively selected gene 28617.m000209. (c) Schematic
   representation of fatty acids and TAG biosynthesis. (d) Watterson's θ
   of oil‐related genes within wild and cultivated castor bean population

   Divergence between the cultivars ZB306 and Hale was also analyzed due
   to their phenotypic variations and two distinctly different purpose of
   domestication. One (ZB306) form of selection was for better seed oil
   content and yield, and the other (Hale) was for lower plant height and
   dwarf. KEGG pathway enrichment analysis revealed that pathways related
   to unsaturated fatty acids and photosynthesis were overrepresented,
   which may contribute to differences in seed oil composition and yield
   between the cultivars.

3.9. Nucleotide variation and domestication at the oil‐related gene loci

   Seed oil content is one of the most important economic traits in castor
   bean, and we found significant difference in oil content between wild
   and cultivated lines (p‐value < .01, see Table [150]S3). However, we
   did not find evidence that oil‐related genes showed nonsynonymous
   differences between the wild and cultivated types or evidence for
   selection. One potential reason is that the only one wild and one
   cultivated castor bean was considered, but recent selection may be
   detectable as a selective sweep without the need for a high Ka/Ks
   ratio. Thus, we extended our research to analyze population‐level
   polymorphism and divergence using six wild lines and six cultivated
   castor bean lines as above (Table [151]S3). We selected 16 key genes
   mainly involving into the fatty acids biosynthesis pathway (including
   GPDH: glycerol‐3‐phosphate dehydrogenase, ACCase: acetyl CoA
   carboxylase, KASI: 3‐ketoacyl‐ACP synthase I, KASII: 3‐ketoacyl‐ACP
   synthase II, and KASIII: 3‐ketoacyl‐ACP synthase III), the fatty acid
   chain modification processes (such as FAH12: oleate 12‐hydroxylase and
   FAD3: omega‐3 fatty acid desaturase), and the assembly pathway of
   triacylglycerol (including GPA: glycerol‐3‐phosphate acyltransferase,
   LPA: lysophosphatidic acid acyltransferase, PAP: phosphatidate
   phosphatase, PDCT: phosphatidylcholine:diacylglycerol choline
   phosphotransferase, CPT: diacylglycerol cholinephosphotransferase,
   DGAT1: type 1 diacylglycerol acyltransferase, DGAT2: type 2
   diacylglycerol acyltransferase, PDAT2:
   phosphatidylcholine:diacylglycerol acyltransferase 2, and oleosin2)
   (Figure [152]5c, the full name listed in Table [153]S2), and sequenced
   these from the panel and analyzed their sequence diversity.

   The regions analyzed covered a total of 12.48 kb, corresponding to
   5.54 kb of coding sequence and 6.94 kb of noncoding regions (including
   intron and 3' UTR). We found that seven genes (ACCase, KASIII, GPA,
   FAH12, PDAT2, DGAT2, and PDCT) harbored no nucleotide polymorphism in
   the tested regions, whereas for the remaining nine genes (GPDH, KASI,
   KASII, LPA, PAP, CPT, FAD3, DGAT1, and oleosin2) we uncovered 38 SNPs
   and 40 indels across all accessions (Table [154]S12). Two of these (PAP
   and DGAT1) contained no SNPs in the coding regions, and none of the 38
   SNPs were nonsynonymous. Nucleotide diversity for these nine loci is
   shown in Table [155]S12.

   We compared genetic variation at these nine genes between wild and
   domesticated castor bean lines to that found in five neutral genes
   (Figure [156]5d). On average, a reduction in sequence diversity
   (estimated here as Watterson's θ) was found in the cultivars relative
   to the wild population for both neutral genes and candidate genes
   (Figure [157]5d; Table [158]3); however, this value was quite variable
   in the oil‐related genes. For four genes (DGAT1, KASI, LPA, and PAP),
   we found reduced nucleotide diversity in the cultivars in both the
   coding and noncoding regions. This was significant for KASI and LPA in
   the maximum‐likelihood Hudson–Kreitman–Aguade (ML‐HKA) test (p < .05).
   For the remaining genes, we observed similar or slightly higher
   nucleotide diversity in the cultivars, relative to the wild lines
   (Figure [159]5d). For FAD3, this slight increase was observed in the
   coding region, whereas for oleosin2 this was in the intronic region.
   For oleosin2, the greater polymorphism in the cultivars was significant
   in the ML‐HKA test (p < .05; Table [160]3). It appears therefore that
   the oil‐related genes have experienced a range of selection pressures,
   in some cases being swept of diversity during domestication and in
   others diversity has been maintained during crop improvement.

Table 3.

   Genetic diversity (Watterson's θ) for five neutral genes (N) and nine
   putative domestication genes (D) sampled from wild and cultivated
   castor bean populations
   Type Locus Annotation Watterson's θ ML‐HKA p‐values
   Wild Cultivar Wild Cultivar
   N 27894.m000793 Conserved hypothetical protein 0.0025 0.0019 .1590
   .2168
   N 30078.m002350 Conserved hypothetical protein 0.0014 0.0013 .2160
   .3113
   N 30042.m000473 Polygalacturonase precursor 0.0097 0.0097 .0716 .0799
   N 29790.m000826 Conserved hypothetical protein 0.0032 0.0029 .3001
   .2506
   N 30170.m013951 Conserved hypothetical protein 0.0025 0.0025 .1607
   .1178
   D 30138.m003845 Diacylglycerol cholinephosphotransferase (CPT) 0.0019
   0.0019 .1142 .1145
   D 29912.m005373 Type 1 diacylglycerol acyltransferase (DGAT1) 0.0019
   0.0005 .5231 .1082
   D 29681.m001360 Omega‐3 fatty acid desaturase (FAD3) 0.0007 0.0008
   .1510 .1414
   D 29745.m000373 Glycerol‐3‐phosphate dehydrogenase (GPDH) 0.0016 0.0016
   .1857 .1068
   D 29693.m002034 3‐Ketoacyl‐ACP synthase I (KAS I) 0.0038 0.0024 .3970
   .0349*
   D 29739.m003711 3‐Ketoacyl‐ACP synthase II (KAS II) 0.0019 0.0020 .0538
   .1042
   D 27810.m000646 Lysophosphatidic acid acyltransferase (LPA) 0.0008
   0.0000 .3786 .0196*
   D 29660.m000760 Phosphatidate phosphatase (PAP) 0.0004 0.0000 .1278
   .0701
   D 30147.m014333 Oleosin2 (Ole2) 0.0014 0.0018 .1444 .0228*
   [161]Open in a new tab

   p‐value of candidate gene was obtained from a maximum‐likelihood
   Hudson–Kreitman–Aguade (ML‐HKA) test against five neutral genes.
   Asterisks represent a significant ML‐HKA result (p < .05).

4. DISCUSSION

4.1. Population structure analysis of castor bean

   Castor bean has been agriculturally cultivated as an annual semi‐woody
   oilseed crop in various regions of the world. Owing to the scarcity of
   wild castor bean samples for analysis in previous studies, the
   understanding of the domestication origin and evolution remains
   unknown. In this study, we remedied this by collecting and analyzing
   seven wild castor bean accessions from eastern Africa and investigated
   the population structure of wild and cultivated castor bean from
   throughout its global distribution. We revealed that wild lines form a
   distinct group from cultivated accessions, strongly implying that
   eastern Africa is likely the center of origin and that genetic
   divergence between the African and cultivated accessions has taken
   place. The cultivated castor bean accessions worldwide were divided
   into two clades with some, but not complete, distinction based on
   geography that was consistent with previous studies (Foster et al.,
   [162]2010; Qiu et al., [163]2010; Rivarola et al., [164]2011). This
   genetic structure could be due to a small number of wild plants being
   introduced and domesticated, alongside extensive subsequent
   international movement of domesticated castor bean (Foster et al.,
   [165]2010).

4.2. Genetic diversity in castor bean genome

   Whole‐genome sequencing for representative lines (one wild, one
   landrace, and one cultivar) reveals a large number of genetic variants
   including SNPs, indels, CNVs, and SVs within castor bean genome. SNP
   density across our four castor bean genomes (3.72 per kb) was greater
   than that reported in woody trees such as in bamboo (1.0 per kb, Peng
   et al., [166]2013), peach (1.5 per kb, Verde et al., [167]2013), and
   poplar (2.6 per kb, Tuskan et al., [168]2006), while the density was
   low relative to grape (4.2 per kb, Velasco et al., [169]2007). Also, we
   found a disparity in the Non‐syn/Syn ratio in coding regions (Table
   [170]2) between the high‐confidence genes (having transcript support)
   and the low‐confidence genes (having no transcript support), which
   could suggest relaxed selection in the latter. Because there was no
   transcript support, this could mean that they are lowly expressed, or
   expressed in tissues not previously reported (Brown et al., [171]2012;
   Xu et al., [172]2013), or alternatively these predicted genes are in
   fact not true genes and the increased Non‐syn/Syn ratio was elevated
   because a subset of these loci are noncoding regions of the genome.
   Taken together, these results suggest that a large amount of genetic
   variation is present in the castor bean genome, unlike previous reports
   where low genetic diversity was revealed (Foster et al., [173]2010; Qiu
   et al., [174]2010; Rivarola et al., [175]2011), and that so far wild
   germplasm is largely untapped in castor bean.

4.3. Large‐effect mutations in castor bean genome

   We found that several genes may be affected by large‐effect mutations
   including SVs, and these were enriched in disease/stress‐related gene
   families such as NB‐ARC and leucine‐rich repeat domains, consistent
   with previous reports of these categories of genes harboring greater
   polymorphism than other categories of genes (Clark et al., [176]2007;
   Lai et al., [177]2010; McNally et al., [178]2009; Zheng et al.,
   [179]2011). Because of their role in pathogen response, it is not
   unexpected that these proteins are among the most variable. Our results
   also uncovered substantial numbers of indels, and those that are
   multiples of 3 bp were significantly enriched in coding regions
   relative to noncoding regions, presumably because nontriplet indels in
   coding regions interrupt the reading frame and are deleterious. CNVs
   usually affected the individual phenotype by altering gene expression
   dosage (Stranger et al., [180]2007). We identified thousands of CNVs in
   castor bean genome, and genes that were affected by wild‐specific CNVs
   were significantly enriched in GO terms related to disease resistance
   such as GTP‐binding protein beta 1 gene and R genes, which could confer
   strong pest resistance in the wild.

4.4. Genetic variation associated with the transition from a wild perennial
tree to a cultivated annual crop

   Comparative genomic analyses revealed a considerable amount of genomic
   variation (SNPs and indels) between wild and cultivars (ZB107 and Hale)
   (Table [181]1), highlighting the genetic divergence between wild and
   cultivars. Additionally, there was a relatively high density of SNPs
   and high percentage of heterozygous SNPs in the wild genome relative to
   cultivated castor bean, supporting the hypothesis that cultivated
   castor bean is most likely introduced from a small number of wild
   progenitors and has undergone a selective sweep limiting the genetic
   variation during domestication. Meanwhile, during domestication the
   growth habit of this plant was substantially changed from perennial
   woody trees to annual semi‐woody cultivars. Our results showed that
   genes strongly differentiated between tree and crop were significantly
   enriched in the process of “Plant–pathogen interaction,” consistent
   with their phenotype difference in disease resistance. Moreover, nine
   NBS‐LRR genes in this pathway showed strong positive selection. This
   may more likely be due to the different growth condition: Cultivated
   crops usually grow under milder biotic stresses than wild trees in
   natural environments. Similarly, studies on perennial woody plant
   genome have revealed overrepresentation of genes responsive to biotic
   stress, especially genes in the NBS‐LRR family (Neale et al.,
   [182]2017). Alternatively, it is possible that the cultivated castor
   bean genomes experienced bottleneck or intense inbreed during
   domestication, resulting in disease‐related genes differentiation
   between wild and cultivated castor bean. Woody growth is another
   typical characteristic of trees, exemplified by wild castor bean. A
   large number of genes identified as putatively under diversifying
   selection are involved in lignin synthesis. Among them, two HCT genes
   in the phenylpropanoid biosynthesis pathway exhibited evolutionary
   signatures in castor bean genome, and previous evidence demonstrates
   that HCT plays a central role in lignin biosynthesis in Arabidopsis
   (Gallego‐Giraldo et al., [183]2011); these represent important
   candidates to follow up in relation to adaptive trait evolution during
   domestication. Comparative genome analyses have revealed the remarkable
   differentiation in lignin synthetic genes between woody and herbaceous
   plants, suggesting that these genes might be associated with the
   difference in woody hardness between woody and herbaceous plants (Neale
   et al., [184]2017), and this deserves further study. Additionally, we
   found that many of differentiated genes between wild and cultivated
   castor bean were enriched in the pathways such as “plant hormone signal
   transduction” and “carbon metabolism,” and some genes such as CRE1,
   ARF2, and XTH25 were identified as candidate domestication genes. It is
   still unclear whether these genes were associated directly or
   indirectly with continuous growth and flowering in wild castor bean
   tree. In short, these identified genes involved in certain pathways
   likely play a critical role in genetically distinguishing trees from
   nontree plants (Neale et al., [185]2017), providing new insights into
   understanding the unique features, complexity, and domestication of
   tree genomes. It should be noted that more wild and cultivated lines
   should be selected for genome resequencing to better understand the
   origin and domestication process of castor bean tree genomes (Kantar,
   Nashoba, Anderson, Blackman, & Rieseberg, [186]2017).

4.5. Potential selection of oil‐related genes during domestication

   We did not see significant sequence differentiation or signatures of
   selection in oil‐related genes between wild and cultivated castor bean
   lines, though they have significant difference in seed oil content.
   Using population genetic ML‐HKA tests, we only found evidence for a
   significant difference in diversity in three (KASI, LPA, and oleosin2)
   of the 16 genes (seven were monomorphic and could not be tested in the
   ML‐HKA). The signature of selection found for the genes KASI and LPA
   represents a significant drop in diversity during domestication and may
   be due to human selection and result in the altered oil content of
   cultivated castor bean. Interestingly, we found a significantly
   increased diversity of oleosin2 gene in cultivated lines, though the
   biological relevance and molecular basis remain unknown. Our results
   provide clear statistical evidence that these three genes exhibited
   significantly departure from neutrality, but it should be noted that
   these genes are not single copy in the castor bean genome (Chan et al.,
   [187]2010) and their contributions to oil production need to be
   inquired by further genetic analysis. A broader sampling of cultivated
   castor bean may indeed reveal further candidate genes under selection
   between groups with different oil content. Key genes for castor bean
   oil production include ACCase, GPA (Brown et al., [188]2012), FAH12
   (Smith, Moon, Chowrira, & Kunst, [189]2003; Van de Loo, Broun, Turner,
   & Somerville, [190]1995), DGAT (Burgal et al., [191]2008; He, Turner,
   Chen, Lin, & McKeon, [192]2004; Kroon, Wei, Simon, & Slabas,
   [193]2006), and PDAT (Kim et al., [194]2011); however, these exhibited
   no variation in our sample of 12 accessions and hence may be under
   strong purifying selection in both the wild and cultivated gene pools.
   In sunflower, another oilseed crop bred for its oil content, the ML‐HKA
   test was used to test for selection during domestication on genes
   involved in fatty acid biosynthesis and identified a number of
   candidate genes having exhibited selective sweeps (Chapman & Burke,
   [195]2012). In addition, it should be noted that we only placed
   considerable focus in finding intragenic sequence polymorphisms and
   relatively little focus on polymorphisms in the promoter region that
   may be regulatory in nature and affecting gene expression levels.
   Genome‐wide association studies employing more broad samples will be
   necessary to identify the domestication loci related to seed oil
   content in castor bean (Kantar et al., [196]2017).

   Overall, we have revealed a rich amount of genetic variation in the
   castor bean genome and have identified many candidate genes and key
   pathways potentially involved in the transition from a tall and
   perennial woody tree to a dwarf and annual semi‐woody plant. ML‐HKA
   tests showed only three oil‐related genes with the evidence of
   selection during castor bean domestication. In particular, we found
   genetic differentiation between the wild and cultivated accessions,
   suggesting that castor bean has undergone a genetic bottleneck during
   domestication. These results largely extend our understanding of the
   domestication origin, genomic variation, and evolution of castor bean.
   The genomic variation identified within wild and cultivated castor bean
   will provide a valuable resource for future gene–trait associations and
   as tools for castor bean breeding.

CONFLICT OF INTEREST

   The authors declare no conflict of interest associated with the work
   described in this manuscript.

AUTHOR CONTRIBUTIONS

   AL and WX conceived the project. TY and LQ contributed to the analysis
   of population structure between wild and cultivated castor bean and the
   development of EST‐SSR markers. WX, DZL, and AL collected the samples
   and investigated the phenotypic variation among castor bean accessions.
   WX and MAC analyzed all the data, and WX prepared figures and tables.
   WX, MAC, and AL wrote the paper. All authors read and approved the
   final manuscript.

Supporting information


   [197]Click here for additional data file.^ (264.1KB, pdf)


   [198]Click here for additional data file.^ (136.6KB, pdf)


   [199]Click here for additional data file.^ (13.8KB, xlsx)


   [200]Click here for additional data file.^ (12KB, xlsx)


   [201]Click here for additional data file.^ (27.5KB, docx)


   [202]Click here for additional data file.^ (32.1KB, xlsx)


   [203]Click here for additional data file.^ (27KB, docx)


   [204]Click here for additional data file.^ (19.8KB, docx)


   [205]Click here for additional data file.^ (20.8KB, docx)


   [206]Click here for additional data file.^ (68.2KB, xlsx)


   [207]Click here for additional data file.^ (635.1KB, xlsx)


   [208]Click here for additional data file.^ (17.2KB, xlsx)


   [209]Click here for additional data file.^ (42.5KB, xlsx)


   [210]Click here for additional data file.^ (22.9KB, docx)

ACKNOWLEDGMENTS