Abstract Relatively, little is known about the genetic variation of woody trees during domestication. Castor bean (Ricinus communis L. Euphorbiaceae) is a commercially important nonedible annual oilseed crop and differs from its wild progenitors that have a perennial woody habit. Although castor bean is one of the oldest cultivated crops, its domestication origin, genomic variation, and potential targets of selection underlying domestication traits remain unknown. Here, we performed a phylogenetic analysis, which suggests that the wild accessions were distinctively separated from the cultivated accessions. Genome sequencing of three accessions (one each wild, landrace, and cultivar) showed a large number of genetic variants between wild and cultivated castor bean (ZB306 or Hale), and relatively few variants between cultivar ZB306 and Hale. Comparative genome analysis revealed many candidate genes of selection and key pathways potentially involved in the transition from a perennial woody tree to annual crop. Interestingly, among 16 oil‐related genes only three showed evidence of selection and the remainder showed low genetic variation at the population level, suggesting strong purifying selection in both the wild and domesticated gene pools. These results extend our understanding of the origin, genomic variation, and domestication, and provide a valuable resource for future gene–trait associations and castor bean breeding. Keywords: castor bean, domestication selection, genetic variation, seed storage lipids, woody tree 1. INTRODUCTION In addition to natural or environmental selection,s the morphology and genetic diversity of species can be shaped by human activity (Olsen & Wendel, [34]2013; Shi & Lai, [35]2015; Wright, [36]2015). Hundreds of wild plants have been selected by humans, giving rise to a multitude of major and minor crops with remarkably modified agricultural or horticultural traits (Doebley, Gaut, & Smith, [37]2006). Recent comparative genetic analyses have greatly advanced our understanding of questions concerning the geographic origins of crops, the change in genetic diversity, and the molecular basis underlying important agricultural traits (Shi & Lai, [38]2015). Only a few studies have focused on the evolutionary history and domestication origin of woody trees, and have shown that the woody trees possessed a different evolutionary rate from annual domesticated crops (Delplancke et al., [39]2013; Miller & Schaal, [40]2006; Verde et al., [41]2013). However, in general, relatively little is known about woody tree domestication. In particular, it is still unclear how perennial woody trees could be modified into annual crops by human cultivation or what extent genomic variation is modified in response to artificial selection in perennial woody trees. Castor bean (Ricinus communis L., Euphorbiaceae) is an important nonedible oilseed annual semi‐woody crop. Its seed oil contains a high ricinoleic acid concentration (~90%), an unusual hydroxylated fatty acid, and is a vital industrial raw material (Ogunniyi, [42]2006). Considering the high economic value and the easy growth in unfavorable environments, castor bean has been extensively cultivated in tropical and subtropical countries. Anthropological studies have showed that castor bean was used by humans dating back more than 6,000 years (Chan et al., [43]2010; Hayes, [44]1953). However, owing to castor bean probably being introduced multiple times in many regions, predominant outcrossing under natural conditions (Brigham, [45]1967; Meinders & Jones, [46]1950), and its widespread escape from cultivation (Weber, [47]2003), the origin of castor bean domestication is still unknown (Foster et al., [48]2010). According to morphological features, it is supposed that eastern Africa (Ethiopia and Kenya) might be a candidate site of origin (Vavilov, [49]1951; Zeven & Zhukovsky, [50]1975); however, this supposition lacks supporting evidence. Studies of germplasm collected throughout the world reveal low levels of genetic variation and lack of geographically structured genetic populations in castor bean, regardless of marker system used (Allan et al., [51]2008; Foster et al., [52]2010; He, Xu, Li, Wang, & Liu, [53]2017; Qiu, Yang, Tian, Yang, & Liu, [54]2010; Rivarola et al., [55]2011). Few studies have examined genetic variation and population structure of castor bean accessions from eastern Africa due to this wild germplasm being unavailable. Sampling of wild castor bean from eastern Africa would facilitate an understanding of the origin and domestication of this species and help reappraise genetic diversity at the genomic level. Based on our extensive field investigation, we found that castor bean accessions from eastern Africa usually display traits typical of wild species, such as extremely small seeds, natural seed dispersal, strong disease/stress resistance, and a perennial and woody tree phenotype with an elongated stem. In stark contrast, most cultivated castor bean accessions are annual, semi‐woody crops, the result of marked phenotypic modifications during domestication. Thus, castor bean provides an excellent model to dissect the genetic mechanism of trait variation in woody trees during domestication, and comparative genomics of wild and domesticated castor bean would also provide some new insights into understanding fundamental issues of tree biology such as perennial versus annual habit, woody growth, and adaptation to the environment (Neale, Martínez‐García, De La Torre, Montanari, & Wei, [56]2017). Here, we investigated the phylogenetic relationship between wild and cultivated castor bean, assessed genome‐wide genetic variation during domestication, and identified potential domestication loci throughout the genome. The results obtained in this study provide novel insights into not only genetic diversity and population structure, but also the effect of human cultivation on the domestication of castor bean, and potentially how a perennial woody tree can evolve into an annual crop. 2. MATERIALS AND METHODS 2.1. Collection of wild accessions and investigation of population structure We collected seven wild castor bean accessions from eastern Africa and obtained another 76 castor bean accessions, representing the worldwide distribution (Table [57]S1). Seeds of each accession were germinated at Kunming Institute of Botany (Kunming, Yunnan, China). Three‐week‐old seedlings (one to four per accession) were then frozen in liquid nitrogen, and nuclear DNA was extracted using the DNeasy Plant Mini Kit (Qiagen). We firstly employed EST‐SSR markers (expressed sequence tag‐simple sequence repeat), including those developed by Qiu et al. ([58]2010) alongside markers newly developed here using the same approach as Qiu et al. ([59]2010), to investigate genetic relationships among germplasm. The phylogenetic tree was generated with neighbor‐joining and UPGMA methods (Tamura et al., [60]2013). Further, bootstrapping analysis with 1,000 replicates was carried out using the software FREETREE V.0.9.1.50 (Pavlícek, Hrdá, & Flegr, [61]1999) and bootstrap values >50 were displayed. Population structure was analyzed using the software STRUCTURE 2.3.1 (Falush, Stephens, & Pritchard, [62]2003) with values of K (the potential number of clusters) tested from 1 to 9 with six independent runs of 10,000 iterations after a burn‐in period of 10,000 iterations. The optimal K was determined using the delta K method (Evanno, Regnaut, & Goudet, [63]2005) using STRUCTURE HARVESTER (Earl & von Holdt, [64]2012). 2.2. Genome resequencing Based on phylogenetic analysis, three castor bean accessions representing one wild accession (WT001 from Kenya), one landrace (ZB107 from China), and one cultivar (ZB306 from China) were selected for whole‐genome sequencing. ZB107 and ZB306 are genetically divergent and are important for castor bean breeding in China. Their morphological features including plant height, branching, stem diameter, seed oil content, and growth circle were characterized based on three seasons of observations under natural conditions at Kunming Institute of Botany. Genomic DNA was sonicated to fragments of ca. 500 bp, and sequencing libraries were constructed using TruSeq® DNA Library Prep Kits following the manufacturer's instructions (Illumina). Libraries were sequenced for 90 cycles (paired‐end) on the HiSeq 2000 platform in the Shenzhen‐BGI (China). Raw data were preprocessed to filter out adapter sequences, contaminated sequences, and low‐quality reads using Trimmomatic ver. 0.36 (Bolger, Lohse, & Usadel, [65]2014), with the following parameters: ILLUMINACLIP:TruSeq3‐PE.fa:2:30:10:8 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75 AVGQUAL:20. The quality of clean data was checked by FastQC software. The clean data have been submitted to NCBI Short Read Archive under accession number SRS5353940. 2.3. Identification of SNPs and short indels The clean reads were then mapped to the reference genome (Hale, [66]http://castorbean.jcvi.org) using the command “bwasw” in BWA (ver. 0.5.8) with default parameters (Li & Durbin, [67]2009). Reads which mapped to multiple chromosomal positions and unmapped reads were discarded. SAMtools (Version 1.3.1, Li et al., [68]2009) was used to convert mapping results to bam format and to remove PCR duplicates generated during sequencing. We then applied the “mpileup” command of SAMtools to call raw SNP variants including homo‐ and heterozygous sites. Subsequently, SNPs in each individual were further filtered by sufficient base quality (Q values >20), at least five reads supporting each SNP and more than 5‐bp interval between two adjacent SNPs using vcftools ver.0.1.16 together with a custom Perl script. The remaining SNPs were retained for subsequent analysis. The methods and criteria of Zheng et al. ([69]2011) were used to identify short insertions and deletions (indels) (≤10 bp). Briefly, the paired‐end reads were first aligned to the reference sequence by allowing up to 10‐bp gaps using BWA ver. 0.5.8 with “aln ‐o 10” parameter, and then the redundant pairs were merged before calling indels. Gaps supported by at least five paired‐end reads were identified, and only homozygous indels were considered when the number of the un‐gapped reads that crossed a potential indel was no more than twice that of the gapped reads. Based on the previous gene model of the castor bean genome (Chan et al., [70]2010), genetic variants identified above (SNPs and indels) were further annotated as being present in exons, introns, untranslated regions of coding regions, and intergenic regions. The effect of coding sequence variants was further classified into so‐called “large‐effect variants” (including start/stop codon gains or losses, putative splice site mutations, and frameshifts), nonsynonymous variants, and synonymous variants. 2.4. Validation of SNP and indels To evaluate the accuracy of SNPs and indels identified in this study, we used the same method to call SNPs and indels from our previous RNA‐seq data from two of the same accessions (ZB107 and ZB306; downloaded from NCBI SRX485027, reported by Xu, Dai, Li, & Liu, [71]2014). SNPs and indels identified from the two different datasets were compared to ascertain the effectiveness of our SNP and indel calling pipeline. 2.5. Copy number and structural variation We employed readDepth (Miller, Hampton, Coarfa, & Milosavljevic, [72]2011) to identify copy number variants (CNVs). ReadDepth uses a binning procedure to call CNVs based on read mapping depth, recalibrated by mappability and GC content, and then calls segment boundaries using a circular binary segmentation algorithm. In order to increase the confidence of our CNVs, we calculated the average read depth (RD) for the identified regions of the castor bean genome for each of the three accessions. Only regions with RD >3 standard deviations and significantly different from the genome‐wide mean RD (FDR < 0.01) were considered. CNVs were classified based on whether they exhibited significantly higher (gain CNVs) or lower (loss CNVs or CNVs decrease) read depth than average. BreakDancer (Chen et al., [73]2009; Fan, Abbott, Larson, & Chen, [74]2014) was used to detect genomic structural variants (SVs) including large insertions and deletions, inversions, and intrachromosomal translocations, by comparing the genomic coordinates of each end of a paired‐end read against the distribution of the insert sizes. Any SV for each individual with support from at least three abnormal paired‐end reads was considered. SVs were further filtered to remove those found in repetitive regions of the genome and those that represented (highly implausible) deletions of more than 100 kb. Overlaps between CNVs or SVs and gene regions were detected using bedtools (ver. 2.17.0, [75]http://bedtools.readthedocs.org/en/latest/index.html). To reduce false identification of genes affected by CNVs or SVs, we only considered CNVs or SVs which overlapped at least 50% of a gene. Gene Ontology (GO) enrichment analyses were performed to determine whether any functions were enriched in the lists of genes affected by CNVs and SVs. 2.6. Identification of candidate selection genes The ratio between Ka (the number of nonsynonymous substitutions per nonsynonymous site) and Ks (the number of synonymous substitutions per synonymous site) is used to estimate the level of selective constraint on a locus, where Ka/Ks values >1 and <1 are suggestive of positive selection and purifying selection, respectively. Pairwise estimates of Ka, Ks, and Ka/Ks were calculated between each pair of accessions using the Ka_Ks calculator (Zhang et al., [76]2006) following the YN method. Those genes with Ka/Ks > 1 between wild and Hale or between ZB306 and Hale were identified for further characterization. To investigate whether particular pathways were overrepresented in the list of genes with Ka/Ks > 1, KEGG enrichment analyses were performed using the OmicShare tools ([77]www.omicshare.com/tools). 2.7. Selection analyses for candidate genes involved in oil biosynthesis and assembly Here, we selected 16 key genes that involved in the synthesis and modification of fatty acids as well as triacylglycerol biosynthesis (Table [78]S2). For multicopy gene families of proteins, one candidate locus was selected based on its high expression in castor bean seeds (Brown et al., [79]2012). In addition, five genes with a Ka/Ks ratio approximately equal to 1 were considered as presumptively neutral loci (neutrality was tested, see below). Primers are listed in Table [80]S2. We amplified and sequenced these loci from 12 castor bean accessions, comprising six cultivars and six diverse wild‐like lines (Table [81]S3). PCR was performed with the following conditions: 94°C for 3 min, followed by 30 amplification cycles at 94°C for 30 s, 60°C for 30 s, and 72°C for 60 s, and a final extension at 72°C for 7 min. PCR products were purified using a PCR purification kit (Tiangen, Beijing, China), cloned into pGEM‐T vectors (Tiangen), and transformed into competent Trans5 α E. coli (TransGen, Beijing, China). Six positive clones were randomly selected for sequencing. The oil content of seed from the 12 castor bean accessions was also investigated according to the method of Xu Wang and Liu ([82]2011). Maximum‐likelihood Hudson–Kreitman–Aguade tests (ML‐HKA version2.cpp, Wright & Charlesworth, [83]2004) were employed to test each locus for departure from neutrality, with the parameters number of segregating sites, number of silent sites, number of divergent sites, and Watterson's ([84]1975) estimate of diversity (θ) estimated using DnaSP6 (Rozas et al., [85]2017). In brief, two different models (100,000 simulations each) were performed: one for a strictly neutral model in which all loci were considered as to be neutrally evolving; the other for a selection model in which each locus was considered in turn as being under selection, relative to the other loci. Tests were performed separately for the wild and cultivar population. Two times the difference in log‐likelihoods of the models was then used in a chi‐square test with one degree of freedom to determine statistical significance. This was first run to test that each of the five putatively neutral loci was in fact not demonstrating departure from neutrality, and once this was confirmed, each of the selection candidates was tested in turn against the five neutral loci. 3. RESULTS 3.1. Characterization of population structure in castor bean In this study, we firstly applied 261 polymorphic EST‐SSR markers (Table [86]S4) to analyze the relationships between 83 individuals of castor bean (Figure [87]1a and Table [88]S1). The phylogenetic analysis revealed that castor bean is divided into two main groups (I and II; Figure [89]1b), and this was consistent with the STRUCTURE results which suggested that K = 2 was the most strongly supported (Figure [90]1c,d). Group I consisted of ten castor bean accessions from Africa, including seven lines identified morphologically as wild castor bean (see bold branches in Figure [91]1b) plus three wild‐like individuals (which display similar phenotype to the wild but the information of collection location was missing). Group II contained the remaining accessions and was broadly separated into accessions from South Africa and the Middle East (Turkey, Iran, Afghanistan, Pakistan), and the rest of the world, although there was not complete delineation. STRUCTURE showed moderate support for K = 3 which broadly agreed with the phylogenetic analysis. For K = 3, the African accessions were identified as one cluster (n = 10, numbered 1–10 in Figure [92]1b,d), accessions from the Middle East as the second (n = 18, numbered 11–28 in Figure [93]1b,d; all with > 50% membership to cluster 2, but some with admixture), and the remaining accessions as the third (n = 55, again several accessions demonstrating admixture). Within each of these three groups, there was no obvious geographic structuring of accessions. Figure 1. Figure 1 [94]Open in a new tab An overview of the castor bean collection. (a) The core collection of 83 lines comes from a wide geographic distribution. (b) Analysis of the phylogenetic relationships between castor bean accessions based on 261 EST‐SSR markers developed in this study. Bootstrap support >50% is indicated. Two distinct groups (I and II) were evident. Bold black branches indicate morphologically wild accessions. The solid black triangles indicate wild, landrace (ZB107), and cultivar (ZB306) accessions used in the genome sequencing. (c) Delta K plot showing a peak at K = 2 for the STRUCTURE results. (d) Model‐based cluster analysis of the core set using the program STRUCTURE. Values of K (number of clusters) from 2 to 7 are displayed. The most supported model was K = 2 3.2. Choice of castor bean lines used for resequencing Based on phylogenetic analysis, three accessions (WT001, ZB107, and ZB306) were chosen for whole‐genome resequencing. Their morphological features are listed in Table [95]S5 based on our field observations for three years. The wild and landrace lines are taller with a greater stem diameter than the two cultivars (ZB306 and Hale). Meanwhile, the wild and landrace lines are perennial and resistant to disease, while the two cultivars are annual and susceptible to disease (Table [96]S5). Other traits of agronomic benefit are exhibited in the cultivars, for example, a many‐branched stem, high oil content (see also later), and high seed yield. In addition, the landraces and cultivars lack natural seed dispersal that is present in the wild. 3.3. Genome‐wide genetic variation in castor bean Genome sequencing of each castor bean line yielded approximately 10.2 Gb of high‐quality sequence, corresponding to ~30‐fold genome coverage for each line (Table [97]S6). We found substantial numbers of genetic variants including single nucleotide polymorphism (SNP), short (≤10 bp) insertion and deletion (indel), copy number variation (CNV), and structure variation (SV) throughout the castor bean genome (Figure [98]2). In total, we obtained 1,189,858 homozygous SNPs and 233,955 heterozygous SNPs among these castor bean genomes using stringent filtering criteria (see Table [99]1). Among them, heterozygous sites were made up 16.4% of all SNP sites; in subsequent analyses, only homozygous SNPs were used. The majority (90.0%) of homozygous SNPs were located in intergenic regions (1,070,618), while only 10.0% (119,240) were found in genic regions or untranslated regions (UTRs, Table [100]1). This corresponds to a SNP density of 3.72 per kb (one SNP per 269 bp) in the castor bean genome. The genic SNPs comprised 973 and 1917 SNPs in 5′ UTR and 3′ UTR region, 77,631 in introns, and 38,719 in exons (Table [101]1). Of the SNPs in exons, 764 putatively affected start/stop codons, 22,297 were nonsynonymous, and 15,658 were synonymous SNPs, resulting in average nonsynonymous to synonymous substitution ratio (Non‐syn/Syn) of 1.42. On a per individual basis, the value of Non‐syn/Syn was slightly lower for cultivated castor bean (1.40) than for the wild (1.43) and landrace lines (1.43) (Table [102]2). The Non‐syn/Syn value in the high‐confidence genes (i.e., those having transcript support, reported by Brown et al., [103]2012 and Xu, Li, Ling, & Liu, [104]2013) was smaller than in low‐confidence genes (without transcript support; Table [105]2). It was also found that in the gene set with Pfam support (retrieved from the castor bean gene annotation file, [106]https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Rcomm unis), genes of high confidence had a smaller ratio than those of the low‐confidence genes. Figure 2. Figure 2 [107]Open in a new tab Genome‐wide landscape of genetic variation in castor bean. The top ten scaffolds, in terms of length, are given with the distribution of variants below, including single nucleotide polymorphism (SNP), short insertion and deletion (indel), copy number variation (CNV), and structure variation (SV) Table 1. Summary of (A) single nucleotide polymorphism (SNP) and (B) short insertion and deletion (INDEL) differences between castor bean lines A. SNPs Comparison Total homozygous Total heterozygous Intergenic Genic 5' UTR Exon Nonsynonymous[108]a Synonymous Intron 3' UTR Wild versus Hale 722,358 150,120 656,820 65,538 530 21,125 12,591 8,534 42,844 1,039 ZB306 versus Wild 933,549 43,221 843,283 90,266 732 29,764 17,600 12,164 58,312 1,458 ZB107 versus Wild 197,732 21,237 170,382 27,350 209 9,243 5,687 3,556 17,465 433 ZB306 versus ZB107 964,020 43,045 872,906 91,114 735 30,029 17,714 12,315 58,872 1,478 ZB107 versus Hale 800,517 149,444 728,676 71,841 591 23,224 13,851 9,373 46,865 1,161 ZB306 versus Hale 439,919 79,091 384,933 54,986 462 17,951 10,626 7,325 35,675 898 Total 1,189,858 233,955 1,070,618 119,240 973 38,719 23,061 15,658 77,631 1,917 B. INDELs Comparison Insertions Deletions Intergenic Genic 5' UTR Coding Del Coding Ins Frameshift Intron 3' UTR Wild versus Hale 44,530 36,034 69,568 10,996 282 193 176 904 9,117 324 ZB306 versus Wild 38,805 41,633 68,213 12,225 338 288 286 530 10,420 363 ZB107 versus Wild 11,429 11,992 19,558 3,863 119 98 98 176 3,257 115 ZB306 versus ZB107 40,546 43,133 71,327 12,352 336 288 276 548 10,529 375 ZB107 versus Hale 50,413 41,706 80,203 11,916 295 224 209 949 9,874 365 ZB306 versus Hale 31,926 27,000 49,311 9,615 231 179 155 816 7,991 243 Total 74,638 67,238 122,588 19,288 500 380 368 1,279 16,199 562 [109]Open in a new tab UTR, untranslated region of mRNA. ^a Includes start/stop mutations. Table 2. Number and ratio of nonsynonymous (Non‐syn) and synonymous (Syn) sites in high/low‐confidence genes and genes with and without Pfam annotation Samples versus Hale High‐confidence genes Low‐confidence genes Pfam‐containing genes Genes without Pfam Total Non‐syn Syn Non‐syn/Syn Non‐syn Syn Non‐syn/Syn Non‐syn Syn Non‐syn/Syn Non‐syn Syn Non‐syn/Syn Non‐syn/Syn Wild 8,906 7,026 1.27 3,271 1,508 2.17 8,620 6,621 1.30 3,557 1,912 1.86 1.43 ZB107 9,732 7,717 1.26 3,653 1,656 2.21 9,407 7,258 1.30 3,978 2,115 1.88 1.43 ZB306 7,673 6,139 1.25 2,610 1,186 2.20 7,313 5,798 1.26 2,970 1,527 1.94 1.40 [110]Open in a new tab Additionally, we detected a total of 141,876 small indels (≤10 bp) (Table [111]1); the majority (86.4%) were located in intergenic regions, with 13.6% (19,288) in genic regions. This corresponds to an indel density of one indel per 2.26 kb. Of the indels in genic regions, 2,027 (10.5%) occurred in coding sequences, 1,062 (5.5%) in the UTRs, and 16,199 (84.0%) in introns. Potential frameshifts were predicted for 1,279 of the coding indels. The distribution of indel length in coding regions was significantly different from that in the intergenic regions (Mann–Whitney U tests: wild, z = −3.742; p = .000183; ZB107, z = −3.745; p = .000181; ZB306, z = −3.742; p = .000183) demonstrating enrichment for multiples of three, most likely the result of purifying selection against frameshifts (Figure [112]3a). Importantly, SNPs and indels identified from our previous RNA‐seq data (Xu et al., [113]2014) were greatly consistent with (92% accuracy rate) those identified in this study, indicating that these genetic variations (SNPs and indels) across the entire castor bean genome were reliable. Figure 3. Figure 3 [114]Open in a new tab Distribution of polymorphisms in the castor bean genome. (a) Number of indels of different length in the coding sequences (CDS) and the whole genome. (b) Venn diagram indicating the unique and shared SNPs and indels between castor bean lines. (c) Venn diagram indicating the unique and shared large‐effect mutations between castor bean lines. (d) Annotation of large‐effect SNPs. The numbers of large‐effect SNPs for selected groups of genes are displayed. All the gene families shown were significantly abundant in large‐effect SNPs (Fisher's exact test, p < .05). (e) Number of gain and loss CNVs (relative to the Hale genome) in the entire genome and in genic regions in each castor bean line. (f) Venn diagram indicating the unique and shared genes affected by CNV 3.4. Polymorphisms between castor bean lines Comparative genomic analysis revealed a considerable amount of genetic variation (SNPs and indels) within and between castor bean lines. For SNP density in castor bean, we found one SNP per 442 in the wild and one SNP per 399 bp in the landrace line, whereas SNP density was lower (one SNP per 727 bp) in the cultivated line. Comparing the cultivated lines (Hale and ZB306) to the wild line, we found 722,358 and 933,549 homozygous SNPs, and 80,564 and 80,438 indels, respectively, whereas only 197,732 homozygous SNPs and 23,421 indels were detected in landrace ZB107 relative to the wild line (Table [115]1). Similarly, we detected a large number of differences between ZB107 and Hale (800,517 SNPs and 92,119 indels) and between ZB107 and ZB306 (964,020 SNPs and 83,679 indels). While the number of differences between the two cultivars was lower, this was still relatively high (439,919 SNPs and 58,926 indels; Table [116]1). The number of heterozygous SNPs in the wild and landrace genomes (150,120 and 149,444, respectively) was substantially higher than that in ZB306 (79,091) when all were compared with the Hale genome (Table [117]1). Of the 1.19 M SNPs, 626,478 (52.7%) were accession‐specific (including Hale), a large number (527,590; 44.3%) were shared between the wild and landrace (ZB107) lines and a relatively small number were shared between the wild and cultivated (ZB306) lines (10,179; 0.9%) or the landrace and cultivated lines (25,341; 2.1%) (Figure [118]3b). A similar pattern was found with the indels, with 62.4% accession‐specific, 31.2% shared between the wild and landrace accessions, and only 4.4% shared between wild and ZB306 or between landrace and ZB306 (Figure [119]3b). 3.5. Deleterious mutations and gene content variation By projecting the SNPs and indels onto the Hale genome gene models, we further analyzed the distribution of so‐called large‐effect variants, which are predicted to have a potential effect on gene function. In total, we found 2,230 large‐effect variations, comprising 212 SNPs that were predicted to alter start codons, 552 SNPs that were expected to alter stop codons, 187 SNPs that were predicted to disrupt splicing donor or acceptor sites, and 1,279 indels that induced a frameshift mutation (Table [120]S7). There were 1,429, 1,535, and 1,239 large‐effect variants identified in the wild, ZB107, and ZB306 accessions, respectively, relative to the reference genome (Table [121]S7). A large number were shared between accessions, but we identified 165, 246, and 498 accession‐specific large‐effect variants in the wild, ZB107, and ZB306 accessions, respectively (Figure [122]3c). Again, the wild and landrace accessions shared a large proportion of the mutations (580/2,230; 26.0%), whereas only a small proportion of large‐effect variants were shared between wild and ZB306 or shared between ZB107 and ZB306 (89/2,230; 4.0%). These large‐effect variants were located in 1,904 genes, and of them, approximately 41% genes had no transcript support and 43% had no Pfam domain. This is consistent with their percentage in the genome (40% of genes had no transcript support and 39% had no Pfam domain), suggesting large‐effect mutations were no more or less common in the high‐confidence genes. These large‐effect SNPs are significantly (Fisher's exact test, p < .05) enriched in six Pfam families including leucine‐rich repeat regions, PPR repeat, and NB‐ARC domain families (Figure [123]3d). 3.6. Copy number variation Increasing evidence has shown that copy number variation (CNV) is prevalent in many eukaryotic species and makes a significant contribution to phenotypic variation in plants (Freeman et al., [124]2006; Muñoz‐Amatriaín et al., [125]2013). We identified 1,776 CNVs in at least one genotype relative to Hale, which summed to about ~6% of the castor bean genome. Of these CNVs, 1,129 were identified in the wild accession, 1,065 in ZB107, and 699 in ZB306 (Figure [126]3e; Table [127]S8). Nearly equal numbers of gain and loss (relative to Hale) CNVs were discovered in these accessions. We detected a total of 438 genes which may potentially be affected by CNVs (Figure [128]3e; Table [129]S8). Of these, 72, 74, and 87 CNV‐affecting genes were exclusive to wild, ZB107, and ZB306, respectively (i.e., 75.6% of CNVs were accession‐specific; Figure [130]3f). Gene Ontology (GO) enrichment analysis revealed that there was no significant enrichment for these CNV‐affected genes in the wild and ZB306 line (Figure [131]S1a). However, for the genes affected by wild‐specific CNVs, there were nine GO terms significantly enriched (FDR < 0.05, Figure [132]S1b) including oxidoreductase and phosphatase activity, and nucleoside binding. These CNV‐affected genes in the wild accession have the potential to contribute to environmental adaptation. For example, several genes are involved in disease resistance, such as GTP‐binding protein beta 1 (AGB1, 30190.m010890) and R gene proteins (29838.m001648, 29838.m001649, 29838.m001650, 30090.m000236, 30131.m007231, 29991.m000654, 29907.m000656, and 30158.m000501) (Table [133]S8). Interestingly, we found that a gene (60629.m00002) encoding a ricin protein had a greater copy number in the wild line relative to cultivated castor bean, though the exact ricin levels have not been surveyed in wild and cultivated lines yet. These results provide a resource for validating the hypothesis that some phenotypic variation in castor bean may be related to CNVs. 3.7. Structural variation We detected a large number of structural variants (SVs) in the castor bean genomes, including 6,016, 6,197, and 5,240 large (>50 bp) deletions and 2,595, 1,796, and 1,194 large insertions in the wild, ZB107, and ZB306 accessions, respectively, relative to Hale (Figure [134]4a). Although only a small number of SVs (inversions and intrachromosomal translocations) were identified in the castor bean individuals (Figure [135]4a), their average lengths were much greater than the large insertions and deletions, with an average size of 70 and 15 kb for inversions and translocations, respectively (Figure [136]4b). The majority of SVs (about 86%) occurred in intergenic regions with the remainder (14%) in genic regions. Due to large insertions and deletions within genes likely to have significant functional consequences, these genes were further analyzed (Table [137]S9). In total, we identified 2,443 genes containing 4,492 large insertions and deletions. Of these, 1,958 (80.1%) were accession‐specific and 334 (13.7%) differentiated the wild and landrace lines from the cultivars (ZB306 and Hale) (Figure [138]S2a). Although those genes overlapping with large indels were not significantly enriched for any Pfam categories (Fisher's exact test, p > .05), we did find several Pfam categories involved in plant defense (e.g., leucine‐rich repeat and PPR repeat) among the ten most common Pfam families (Figure [139]S2b). As an example, the gene 29842.m003667 (a cysteine‐rich RLK) was affected by an exonic deletion in the wild line relative to ZB306 and Hale (Figure [140]4c). Figure 4. Figure 4 [141]Open in a new tab Number and size of SV types. (a) The number of different types of SVs: deletion (DEL), insertion (INS), inversion (INV), and intrachromosomal translocations (ITX). (b) The size distribution of different types of SVs. (c) An example of a large deletion in the coding region of homogene 29842.m003667 in the wild line relative to cultivated castor bean. The blue tracks indicate reads mapping depth in wild and inbred ZB306 castor bean, respectively 3.8. Identification of candidate domestication genes underlying the phenotypic traits To identify genetic variation which might be responsible for the phenotypic differentiation between perennial woody tree and annual crop of castor bean, we identified a set of genes with nonsynonymous SNPs that differentiated the wild and landrace accessions (both perennial woody trees) from the two cultivars Hale and ZB306 (annual semi‐woody crop). Similarly, we also selected a set of genes with indels that were identified in both wild and landrace lines but not in two cultivars. This identified 4,072 genes differentiating the wild tree and domesticated crop, a subset of which could underlie the domestication phenotypes. Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis for these genes revealed that they were significantly present in five categories (p‐value < .05): (a) “Plant–pathogen interaction” including many disease resistance protein and NBS‐LRR genes; (b) “Plant hormone signal transduction” including auxin, gibberellin (GA), abscisic acid (ABA), and ethylene‐responsive protein; (c) “Phenylpropanoid biosynthesis,” including lignin synthesis pathway, for example, the hydroxycinnamoyl‐CoA shikimate/quinate hydroxycinnamoyl transferase (HCT), cinnamyl‐alcohol dehydrogenase (CAD), cinnamoyl‐CoA reductase (CCR), cinnamate 4‐monooxygenase (C4H), and 4‐coumarate‐‐CoA ligase 1‐like (4Cl); (d) “Carbon metabolism” including the citrate cycle, glycolysis/gluconeogenesis, and carbon fixation; and (e) “Biosynthesis of amino acids” (Table [142]S10). To further identify genes with evolutionary signatures of selection (i.e., Ka/Ks > 1) in the castor bean genome, we compared coding regions between the wild and Hale genomes to identify candidate domestication genes and between ZB306 and Hale to identify candidate diversification genes. As expected, the majority of genes exhibited Ka/Ks ≤ 1, while there were 238 and 202 candidate domestication and diversification genes, respectively, with Ka/Ks > 1 (Figure [143]5a; Table [144]S11). KEGG enrichment analysis for these genes revealed that they were mainly present in the pathways mentioned above. We found that six genes encode an LRR receptor‐like serine/threonine‐protein kinase, and nine genes encoding disease‐related proteins including many nucleotide‐binding site–leucine‐rich repeat genes (NBS‐LRR). In phenylpropanoid biosynthesis, there were six genes, including two encoding HCT, three encoding benzyl alcohol O‐benzoyltransferase, and one encoding CCR. Subsequently, we performed polymerase chain reaction (PCR) sequencing of the entire coding sequence of 28617.m000209 (encoding HCT) for the same six cultivars and six wild lines (Table [145]S3), due to its key function in lignin biosynthesis of Arabidopsis (Gallego‐Giraldo, Escamilla‐Trevino, Jackson, & Dixon, [146]2011). This revealed that the three nonsynonymous SNPs that differentiate the wild and Hale genomes were fixed within each population (Figure [147]5b). Additionally, several genes putatively under selection during domestication were found to participate in plant hormone signal transduction and photosynthesis (Table [148]S11). For example, four of these putatively encode a DELLA protein, cytokinin receptor (CRE1), auxin response factor (ARF2), and xyloglucan:xyloglucosyl transferase (XTH25), all of which are involved in plant hormone recognition or signal transduction. In addition, two genes putatively encode homologs of PsaA/PsaB protein and photosystem II reaction center protein B, which may function as a lighter receptor to promote photosynthesis. Figure 5. Figure 5 [149]Open in a new tab Selection pressure and genetic diversity analysis in castor bean. (a) Synonymous (Ks) and nonsynonymous (Ka) substitution rate between wild and cultivated (Hale) castor bean and between cultivated varieties (ZB306 vs. Hale). (b) Distribution of nonsynonymous sites within the candidate positively selected gene 28617.m000209. (c) Schematic representation of fatty acids and TAG biosynthesis. (d) Watterson's θ of oil‐related genes within wild and cultivated castor bean population Divergence between the cultivars ZB306 and Hale was also analyzed due to their phenotypic variations and two distinctly different purpose of domestication. One (ZB306) form of selection was for better seed oil content and yield, and the other (Hale) was for lower plant height and dwarf. KEGG pathway enrichment analysis revealed that pathways related to unsaturated fatty acids and photosynthesis were overrepresented, which may contribute to differences in seed oil composition and yield between the cultivars. 3.9. Nucleotide variation and domestication at the oil‐related gene loci Seed oil content is one of the most important economic traits in castor bean, and we found significant difference in oil content between wild and cultivated lines (p‐value < .01, see Table [150]S3). However, we did not find evidence that oil‐related genes showed nonsynonymous differences between the wild and cultivated types or evidence for selection. One potential reason is that the only one wild and one cultivated castor bean was considered, but recent selection may be detectable as a selective sweep without the need for a high Ka/Ks ratio. Thus, we extended our research to analyze population‐level polymorphism and divergence using six wild lines and six cultivated castor bean lines as above (Table [151]S3). We selected 16 key genes mainly involving into the fatty acids biosynthesis pathway (including GPDH: glycerol‐3‐phosphate dehydrogenase, ACCase: acetyl CoA carboxylase, KASI: 3‐ketoacyl‐ACP synthase I, KASII: 3‐ketoacyl‐ACP synthase II, and KASIII: 3‐ketoacyl‐ACP synthase III), the fatty acid chain modification processes (such as FAH12: oleate 12‐hydroxylase and FAD3: omega‐3 fatty acid desaturase), and the assembly pathway of triacylglycerol (including GPA: glycerol‐3‐phosphate acyltransferase, LPA: lysophosphatidic acid acyltransferase, PAP: phosphatidate phosphatase, PDCT: phosphatidylcholine:diacylglycerol choline phosphotransferase, CPT: diacylglycerol cholinephosphotransferase, DGAT1: type 1 diacylglycerol acyltransferase, DGAT2: type 2 diacylglycerol acyltransferase, PDAT2: phosphatidylcholine:diacylglycerol acyltransferase 2, and oleosin2) (Figure [152]5c, the full name listed in Table [153]S2), and sequenced these from the panel and analyzed their sequence diversity. The regions analyzed covered a total of 12.48 kb, corresponding to 5.54 kb of coding sequence and 6.94 kb of noncoding regions (including intron and 3' UTR). We found that seven genes (ACCase, KASIII, GPA, FAH12, PDAT2, DGAT2, and PDCT) harbored no nucleotide polymorphism in the tested regions, whereas for the remaining nine genes (GPDH, KASI, KASII, LPA, PAP, CPT, FAD3, DGAT1, and oleosin2) we uncovered 38 SNPs and 40 indels across all accessions (Table [154]S12). Two of these (PAP and DGAT1) contained no SNPs in the coding regions, and none of the 38 SNPs were nonsynonymous. Nucleotide diversity for these nine loci is shown in Table [155]S12. We compared genetic variation at these nine genes between wild and domesticated castor bean lines to that found in five neutral genes (Figure [156]5d). On average, a reduction in sequence diversity (estimated here as Watterson's θ) was found in the cultivars relative to the wild population for both neutral genes and candidate genes (Figure [157]5d; Table [158]3); however, this value was quite variable in the oil‐related genes. For four genes (DGAT1, KASI, LPA, and PAP), we found reduced nucleotide diversity in the cultivars in both the coding and noncoding regions. This was significant for KASI and LPA in the maximum‐likelihood Hudson–Kreitman–Aguade (ML‐HKA) test (p < .05). For the remaining genes, we observed similar or slightly higher nucleotide diversity in the cultivars, relative to the wild lines (Figure [159]5d). For FAD3, this slight increase was observed in the coding region, whereas for oleosin2 this was in the intronic region. For oleosin2, the greater polymorphism in the cultivars was significant in the ML‐HKA test (p < .05; Table [160]3). It appears therefore that the oil‐related genes have experienced a range of selection pressures, in some cases being swept of diversity during domestication and in others diversity has been maintained during crop improvement. Table 3. Genetic diversity (Watterson's θ) for five neutral genes (N) and nine putative domestication genes (D) sampled from wild and cultivated castor bean populations Type Locus Annotation Watterson's θ ML‐HKA p‐values Wild Cultivar Wild Cultivar N 27894.m000793 Conserved hypothetical protein 0.0025 0.0019 .1590 .2168 N 30078.m002350 Conserved hypothetical protein 0.0014 0.0013 .2160 .3113 N 30042.m000473 Polygalacturonase precursor 0.0097 0.0097 .0716 .0799 N 29790.m000826 Conserved hypothetical protein 0.0032 0.0029 .3001 .2506 N 30170.m013951 Conserved hypothetical protein 0.0025 0.0025 .1607 .1178 D 30138.m003845 Diacylglycerol cholinephosphotransferase (CPT) 0.0019 0.0019 .1142 .1145 D 29912.m005373 Type 1 diacylglycerol acyltransferase (DGAT1) 0.0019 0.0005 .5231 .1082 D 29681.m001360 Omega‐3 fatty acid desaturase (FAD3) 0.0007 0.0008 .1510 .1414 D 29745.m000373 Glycerol‐3‐phosphate dehydrogenase (GPDH) 0.0016 0.0016 .1857 .1068 D 29693.m002034 3‐Ketoacyl‐ACP synthase I (KAS I) 0.0038 0.0024 .3970 .0349* D 29739.m003711 3‐Ketoacyl‐ACP synthase II (KAS II) 0.0019 0.0020 .0538 .1042 D 27810.m000646 Lysophosphatidic acid acyltransferase (LPA) 0.0008 0.0000 .3786 .0196* D 29660.m000760 Phosphatidate phosphatase (PAP) 0.0004 0.0000 .1278 .0701 D 30147.m014333 Oleosin2 (Ole2) 0.0014 0.0018 .1444 .0228* [161]Open in a new tab p‐value of candidate gene was obtained from a maximum‐likelihood Hudson–Kreitman–Aguade (ML‐HKA) test against five neutral genes. Asterisks represent a significant ML‐HKA result (p < .05). 4. DISCUSSION 4.1. Population structure analysis of castor bean Castor bean has been agriculturally cultivated as an annual semi‐woody oilseed crop in various regions of the world. Owing to the scarcity of wild castor bean samples for analysis in previous studies, the understanding of the domestication origin and evolution remains unknown. In this study, we remedied this by collecting and analyzing seven wild castor bean accessions from eastern Africa and investigated the population structure of wild and cultivated castor bean from throughout its global distribution. We revealed that wild lines form a distinct group from cultivated accessions, strongly implying that eastern Africa is likely the center of origin and that genetic divergence between the African and cultivated accessions has taken place. The cultivated castor bean accessions worldwide were divided into two clades with some, but not complete, distinction based on geography that was consistent with previous studies (Foster et al., [162]2010; Qiu et al., [163]2010; Rivarola et al., [164]2011). This genetic structure could be due to a small number of wild plants being introduced and domesticated, alongside extensive subsequent international movement of domesticated castor bean (Foster et al., [165]2010). 4.2. Genetic diversity in castor bean genome Whole‐genome sequencing for representative lines (one wild, one landrace, and one cultivar) reveals a large number of genetic variants including SNPs, indels, CNVs, and SVs within castor bean genome. SNP density across our four castor bean genomes (3.72 per kb) was greater than that reported in woody trees such as in bamboo (1.0 per kb, Peng et al., [166]2013), peach (1.5 per kb, Verde et al., [167]2013), and poplar (2.6 per kb, Tuskan et al., [168]2006), while the density was low relative to grape (4.2 per kb, Velasco et al., [169]2007). Also, we found a disparity in the Non‐syn/Syn ratio in coding regions (Table [170]2) between the high‐confidence genes (having transcript support) and the low‐confidence genes (having no transcript support), which could suggest relaxed selection in the latter. Because there was no transcript support, this could mean that they are lowly expressed, or expressed in tissues not previously reported (Brown et al., [171]2012; Xu et al., [172]2013), or alternatively these predicted genes are in fact not true genes and the increased Non‐syn/Syn ratio was elevated because a subset of these loci are noncoding regions of the genome. Taken together, these results suggest that a large amount of genetic variation is present in the castor bean genome, unlike previous reports where low genetic diversity was revealed (Foster et al., [173]2010; Qiu et al., [174]2010; Rivarola et al., [175]2011), and that so far wild germplasm is largely untapped in castor bean. 4.3. Large‐effect mutations in castor bean genome We found that several genes may be affected by large‐effect mutations including SVs, and these were enriched in disease/stress‐related gene families such as NB‐ARC and leucine‐rich repeat domains, consistent with previous reports of these categories of genes harboring greater polymorphism than other categories of genes (Clark et al., [176]2007; Lai et al., [177]2010; McNally et al., [178]2009; Zheng et al., [179]2011). Because of their role in pathogen response, it is not unexpected that these proteins are among the most variable. Our results also uncovered substantial numbers of indels, and those that are multiples of 3 bp were significantly enriched in coding regions relative to noncoding regions, presumably because nontriplet indels in coding regions interrupt the reading frame and are deleterious. CNVs usually affected the individual phenotype by altering gene expression dosage (Stranger et al., [180]2007). We identified thousands of CNVs in castor bean genome, and genes that were affected by wild‐specific CNVs were significantly enriched in GO terms related to disease resistance such as GTP‐binding protein beta 1 gene and R genes, which could confer strong pest resistance in the wild. 4.4. Genetic variation associated with the transition from a wild perennial tree to a cultivated annual crop Comparative genomic analyses revealed a considerable amount of genomic variation (SNPs and indels) between wild and cultivars (ZB107 and Hale) (Table [181]1), highlighting the genetic divergence between wild and cultivars. Additionally, there was a relatively high density of SNPs and high percentage of heterozygous SNPs in the wild genome relative to cultivated castor bean, supporting the hypothesis that cultivated castor bean is most likely introduced from a small number of wild progenitors and has undergone a selective sweep limiting the genetic variation during domestication. Meanwhile, during domestication the growth habit of this plant was substantially changed from perennial woody trees to annual semi‐woody cultivars. Our results showed that genes strongly differentiated between tree and crop were significantly enriched in the process of “Plant–pathogen interaction,” consistent with their phenotype difference in disease resistance. Moreover, nine NBS‐LRR genes in this pathway showed strong positive selection. This may more likely be due to the different growth condition: Cultivated crops usually grow under milder biotic stresses than wild trees in natural environments. Similarly, studies on perennial woody plant genome have revealed overrepresentation of genes responsive to biotic stress, especially genes in the NBS‐LRR family (Neale et al., [182]2017). Alternatively, it is possible that the cultivated castor bean genomes experienced bottleneck or intense inbreed during domestication, resulting in disease‐related genes differentiation between wild and cultivated castor bean. Woody growth is another typical characteristic of trees, exemplified by wild castor bean. A large number of genes identified as putatively under diversifying selection are involved in lignin synthesis. Among them, two HCT genes in the phenylpropanoid biosynthesis pathway exhibited evolutionary signatures in castor bean genome, and previous evidence demonstrates that HCT plays a central role in lignin biosynthesis in Arabidopsis (Gallego‐Giraldo et al., [183]2011); these represent important candidates to follow up in relation to adaptive trait evolution during domestication. Comparative genome analyses have revealed the remarkable differentiation in lignin synthetic genes between woody and herbaceous plants, suggesting that these genes might be associated with the difference in woody hardness between woody and herbaceous plants (Neale et al., [184]2017), and this deserves further study. Additionally, we found that many of differentiated genes between wild and cultivated castor bean were enriched in the pathways such as “plant hormone signal transduction” and “carbon metabolism,” and some genes such as CRE1, ARF2, and XTH25 were identified as candidate domestication genes. It is still unclear whether these genes were associated directly or indirectly with continuous growth and flowering in wild castor bean tree. In short, these identified genes involved in certain pathways likely play a critical role in genetically distinguishing trees from nontree plants (Neale et al., [185]2017), providing new insights into understanding the unique features, complexity, and domestication of tree genomes. It should be noted that more wild and cultivated lines should be selected for genome resequencing to better understand the origin and domestication process of castor bean tree genomes (Kantar, Nashoba, Anderson, Blackman, & Rieseberg, [186]2017). 4.5. Potential selection of oil‐related genes during domestication We did not see significant sequence differentiation or signatures of selection in oil‐related genes between wild and cultivated castor bean lines, though they have significant difference in seed oil content. Using population genetic ML‐HKA tests, we only found evidence for a significant difference in diversity in three (KASI, LPA, and oleosin2) of the 16 genes (seven were monomorphic and could not be tested in the ML‐HKA). The signature of selection found for the genes KASI and LPA represents a significant drop in diversity during domestication and may be due to human selection and result in the altered oil content of cultivated castor bean. Interestingly, we found a significantly increased diversity of oleosin2 gene in cultivated lines, though the biological relevance and molecular basis remain unknown. Our results provide clear statistical evidence that these three genes exhibited significantly departure from neutrality, but it should be noted that these genes are not single copy in the castor bean genome (Chan et al., [187]2010) and their contributions to oil production need to be inquired by further genetic analysis. A broader sampling of cultivated castor bean may indeed reveal further candidate genes under selection between groups with different oil content. Key genes for castor bean oil production include ACCase, GPA (Brown et al., [188]2012), FAH12 (Smith, Moon, Chowrira, & Kunst, [189]2003; Van de Loo, Broun, Turner, & Somerville, [190]1995), DGAT (Burgal et al., [191]2008; He, Turner, Chen, Lin, & McKeon, [192]2004; Kroon, Wei, Simon, & Slabas, [193]2006), and PDAT (Kim et al., [194]2011); however, these exhibited no variation in our sample of 12 accessions and hence may be under strong purifying selection in both the wild and cultivated gene pools. In sunflower, another oilseed crop bred for its oil content, the ML‐HKA test was used to test for selection during domestication on genes involved in fatty acid biosynthesis and identified a number of candidate genes having exhibited selective sweeps (Chapman & Burke, [195]2012). In addition, it should be noted that we only placed considerable focus in finding intragenic sequence polymorphisms and relatively little focus on polymorphisms in the promoter region that may be regulatory in nature and affecting gene expression levels. Genome‐wide association studies employing more broad samples will be necessary to identify the domestication loci related to seed oil content in castor bean (Kantar et al., [196]2017). Overall, we have revealed a rich amount of genetic variation in the castor bean genome and have identified many candidate genes and key pathways potentially involved in the transition from a tall and perennial woody tree to a dwarf and annual semi‐woody plant. ML‐HKA tests showed only three oil‐related genes with the evidence of selection during castor bean domestication. In particular, we found genetic differentiation between the wild and cultivated accessions, suggesting that castor bean has undergone a genetic bottleneck during domestication. These results largely extend our understanding of the domestication origin, genomic variation, and evolution of castor bean. The genomic variation identified within wild and cultivated castor bean will provide a valuable resource for future gene–trait associations and as tools for castor bean breeding. CONFLICT OF INTEREST The authors declare no conflict of interest associated with the work described in this manuscript. AUTHOR CONTRIBUTIONS AL and WX conceived the project. TY and LQ contributed to the analysis of population structure between wild and cultivated castor bean and the development of EST‐SSR markers. WX, DZL, and AL collected the samples and investigated the phenotypic variation among castor bean accessions. WX and MAC analyzed all the data, and WX prepared figures and tables. WX, MAC, and AL wrote the paper. All authors read and approved the final manuscript. Supporting information [197]Click here for additional data file.^ (264.1KB, pdf) [198]Click here for additional data file.^ (136.6KB, pdf) [199]Click here for additional data file.^ (13.8KB, xlsx) [200]Click here for additional data file.^ (12KB, xlsx) [201]Click here for additional data file.^ (27.5KB, docx) [202]Click here for additional data file.^ (32.1KB, xlsx) [203]Click here for additional data file.^ (27KB, docx) [204]Click here for additional data file.^ (19.8KB, docx) [205]Click here for additional data file.^ (20.8KB, docx) [206]Click here for additional data file.^ (68.2KB, xlsx) [207]Click here for additional data file.^ (635.1KB, xlsx) [208]Click here for additional data file.^ (17.2KB, xlsx) [209]Click here for additional data file.^ (42.5KB, xlsx) [210]Click here for additional data file.^ (22.9KB, docx) ACKNOWLEDGMENTS