Abstract Background Copy number variations (CNVs) are crucial in plant evolution, adaptation, and domestication. In this study, we explored how CNVs contribute to genetic diversity, evolution, and adaptation during apple domestication. We examined the genome-wide CNV profiles and segmental duplications (SDs) in 116 Malus accessions, including domesticated apple (Malus domestica) and its primary progenitor species (M. sieversii and M. sylvestris). Results On average, two accessions of the same species showed differences in at least 7,000 genes with varying copy number (CN) profiles. In contrast, accessions from different species had at least 20,000 genes with differing CN profiles. Notably, 700 genes exhibited distinct CN profiles between M. domestica and M. sieversii, with an enrichment in defense response genes. Genes related to fruit ripening, flavor, and anthocyanin biosynthesis had higher copy numbers in M. domestica. Additionally, 360 genes showed differential CN profiles between M. domestica and M. sylvestris, with enrichment in polygalacturonase activity, which may influence differences in fruit flavor. The study also identified 3,000 genes with significant CN differentiation (V[ST] > 0.28) between M. domestica rootstock and scion cultivars enriched in lignin metabolic pathways, underscoring their role in stress resistance and mechanical support. Segmental duplications were particularly enriched in genes related to sorbitol metabolism, fruit development, and fruit quality traits, highlighting their evolutionary importance in defining apple morphology and physiology. Conclusions These findings offer valuable insights into the evolutionary mechanisms driving apple domestication and adaptation and provide a comprehensive resource for future research and apple breeding. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-025-11677-9. Keywords: Copy number variations (CNVs), Segmental duplications, Apples, Population structure, Selection signatures Introduction Genome sequence diversity ranges from single nucleotide polymorphism (SNPs) to whole chromosomal changes. Structural Variations (SVs) refer to alterations affecting more than 50 nucleotides up to several megabases. These alterations include deletions, insertions, duplications, inversions, and translocations [[28]1, [29]2]. SVs can be either balanced or unbalanced genome rearrangements. Copy Number Variants (CNVs) are unbalanced structural variants resulting from the insertion, loss (deletion), or gain (duplication) of bases ranging in size from 50 bp to several kilobases [[30]3, [31]4]. SVs impact a larger fraction of bases in the genome compared to SNPs or small variations (indels and microsatellites) and are more likely to cause large-scale genomic perturbations, contributing significantly to phenotypic variation [[32]5]. In the past, array comparative genomic hybridization (aCGH), fluorescent in situ hybridization (FISH), and quantitative PCR (qPCR) were the primary methods used to detect CNVs [[33]6, [34]7]. Whole-genome CNV profiling was primarily conducted using the aCGH array platform. This method relies on fluorescence in situ hybridization to measure copy numbers based on the intensity of fluorescence ratios resulting from the hybridization of differently labeled target and reference DNAs. However, the lower detection limit of CNVs of lengths approximately 5 kb to 25 kb, along with low resolution and sensitivity, limited their utility [[35]8, [36]9]. Advancements in high-throughput sequencing (HTS) technologies have led to the development of different algorithms based on read-pair (RP), split-read (SR), read depth (RD), and de novo assembly (AS) to identify CNVs from HTS data [[37]2]. Read depth (RD)-based tools analyze the average coverage of sequencing reads aligning across specific genomic positions and are particularly helpful in characterizing large (> 10 kb) duplications and deletions [[38]4]. CNVs in plants can have their origins in both ancient and recent genome duplication and polyploidization events, which result in extensive genome rearrangements [[39]10–[40]12]. Changes in SV architecture in different plant genomes have been implicated in generating new genotypes with novel traits. SVs can modify gene expression by altering the gene copy number or the architecture of cis-regulatory regions of the gene [[41]13]. They can act as potent phenotypic modifiers by unmasking recessive alleles, affecting gene dosage sensitivity, or causing truncation or fusion of genes [[42]14, [43]15]. Gene SVs are the most extensively studied type of SVs due to their direct impact on phenotypes. Large-scale segmental chromosome exchanges can cause gene presence-absence variations and CNVs [[44]15]. Genic CNVs (g-CNVs), which are CNVs involving gene sequences, have been shown to influence both disease susceptibility and normal phenotypic variations in humans, and they often map to duplicated sequences [[45]16]. In plants, numerous gene families, including MADS-box transcription factor genes, exhibit evidence of duplication through whole genome duplication (WGD) events in their evolutionary history [[46]17–[47]20]. Natural and artificial selection impose selective pressure on genomes, altering the genomic landscape of crops during adaptation and domestication [[48]21, [49]22]. These evolutionary forces can influence patterns of gene copy numbers, akin to other forms of polymorphism, such as SNPs and INDELs, by exerting selective pressures on genes, thereby leaving traces of adaptation on the genes [[50]13]. For instance, ethnic-specific variability in the copy numbers of salivary and pancreatic amylase genes, ranging from 2 to 18, reflects human biological adaptation to low-starch or high-starch diets [[51]23]. Similarly, species-specific shifts in dietary requirements between black bears and polar bears (from omnivorous to primarily carnivorous) are marked by changes in copy number (CN) of several genes involved in fatty acid metabolism and starch digestion, as well as alterations in CN of olfactory receptors, reflecting the biological and ecological adaptations of polar bears to polar regions [[52]24]. Over the past two decades, CNVs have been widely recognized as playing a key role in the adaptation and evolution of humans, cattle, turkeys, and apes, among others [[53]25–[54]28]. CNVs are crucial factors of variation and evolution in various crop species such as rice, tomato, and Arabidopsis [[55]13]. Divergence in genic copy number variation (g-CNVs) between species, leading to adaptive evolution, has been observed in plant species like Amaranthus palmeri and Arabidopsis halleri. An increased copy number of the EPSPS gene in the weed A. palmeri confers resistance to the commonly used herbicide glyphosate [[56]29]. Similarly, metal hyperaccumulation resulting from an expansion in the gene copy number of metal homeostasis genes provides a defense against biotic stress in A. halleri [[57]30]. The domestication history of the cultivated apple (M. domestica Borkh., family Rosaceae, tribe Pyreae), one of the most economically important fruit crops, began in the Tian Shan mountains of Central Asia about 4000—10,000 years ago [[58]31]. Molecular evidence suggests that the domesticated apple originated from a wild progenitor, M. sieversii, in Central Asia, followed by its dispersal to Central Asia and Europe along the Silk Road. Subsequent hybridization with small, sour wild apples from Europe, M. sylvestris Mill., contributed to the development of modern-day apples [[59]31–[60]34]. Additionally, there are nearly 25–47 different wild Malus species native to various regions, including North America (M. ioensis, M. angustifolia, M. fusca, and M. coronaria), Europe (M. sylvestris Mill.) and Asia (M. baccata and M. hupehensis) [[61]17]. Previously, a detailed model of apple speciation and domestication was proposed based on SNP data [[62]33]. CNVs could provide another valuable resource beyond traditional microsatellite and SNP variations for pursuing genomic studies [[63]35, [64]36]. In an initial attempt to study CNVs in the apple genome, Boocock et al. [[65]37] identified 876 CNV regions across 30 accessions of M. domestica. These regions spanned 3.5% of the apple genome and were notably enriched for genes involved in resistance against pathogens. Subsequently, another study identified CNVs from 346 diverse apple accessions, including 289 cultivars and 57 wild relatives, accounting for 10.03% of the apple genome. CNVs accurately reflected population structure and distinguished wild relatives from cultivars. The CNVs were found in genes involved in biological processes such as defense response, reproduction, and metabolic processes. It was suggested that CNVs are widely distributed throughout the apple genome, highlighting the importance of studying this form of variation [[66]38]. While the identification of CNVs is important, it is imperative to identify the changes in the integer copy number of genes, their associated phenotypic effects, and how evolutionary forces have influenced these changes over time [[67]36]. Here, we performed a sequence read depth-based whole genome CNV and segmental duplication (SD) detection and generated genome-wide gene CN and SD profiles across 116 Malus accessions with whole-genome short read sequences. We have assessed the genetic relationships between M. domestica and its wild progenitors (M. sylvestris and M. sieversii), between M. domestica rootstock and scion cultivars, and among other wild Malus species based on CNVs and SDs. This study provides a comprehensive insight into gene CNVs and SDs and their potential role in genetic diversity, domestication, and adaptation of apples. Material and methods Read depth-based CNV detection A publicly available whole-genome short-read re-sequencing dataset comprising 116 Malus accessions [[68]33] was utilized for CNV discovery. These genome sequences encompass 24 species, including 34 accessions of the domesticated apple (23 scion cultivars and 11 rootstock genotypes), 29 accessions of its wild progenitor M. sieversii, and ten accessions of M. sylvestris. The remaining 21 species include M. robusta (9 accessions), M. baccata (6 accessions), M. asiatica (4 accessions), M. hupehensis (4 accessions), and others (Supplementary Table 1). The data were retrieved from the NCBI Sequence Read Archive (SRA) database with accession number SRP075497. Based on read-depth information, the mrCaNaVaR algorithm was used to predict integer copy numbers for genes and large segmental deletions and duplications (> 10 Kb). In this study, we followed the method of detection previously described [[69]4]. The apple reference genome GDDH13 [[70]39] was downloaded from the Genome Database for Rosaceae (GDR) ([71]https://www.rosaceae.org/species/malus/malus_x_domestica/genome_GD DH13_v1.1) and subsequently masked for common and tandem repeats using RepeatMasker [[72]40] and Tandem Repeats Finder tools [[73]41]. Illumina reads were mapped to the repeat-masked apple reference genome GDDH13 using the mrsFAST aligner [[74]42] and the mismatch rate was limited to 5% of the length of the read. The reads were also mapped to the unmasked reference genome for quality control using BWA-MEM, which returned a mapping rate of 98.1%. mrCaNaVaR identifies segmental duplications and deletions based on excess (reduced for deletions) depth of coverage. It first calculates read depth distribution in sliding windows of size 5 kb. Since this distribution will include “tails” in low and high read depth due to CNVs, mrCaNaVaR finds the distribution in non-CNV areas by fitting the data into the Poisson distribution. It then proceeds to correct for GC% bias found in Illumina sequencing data using the LOESS smoothing, and identifies expected average read depth and standard deviation. Finally, mrCaNaVaR identifies those regions with high (> average + 4 stdev) as duplications and low (< average-4 stdev) as deletions. It then applies the same procedure to shorter windows of size 1 kb, to refine estimated CNV breakpoints. Genome-wide absolute copy numbers were determined using a similar approach on non-overlapping windows of size 1 kb calculated as the ratio of observed read depth to average read depth. In short, this method counts sequence reads and addresses the GC bias of the Illumina platform using Locally Weighted Scatter-plot Smoother (LOESS) to achieve uniform coverage, uses a sliding window approach for the identification of SD and deletions, and finally, calculates the genome-wide absolute CN as the ratio of observed read depth to average read depth as outlined by [[75]43]. A CNVs map of six major Malus species was generated using the Circos software package [[76]44]. Segmental deletion and duplication maps SDs were defined as genomic DNA blocks with over 90% similarity and lengths greater than one kb [[77]45]. To identify potential hotspots for SDs in apples, duplications and deletions were analyzed across 116 Malus accessions. It should be noted that the term'deletion'here refers to a segment of DNA present in the reference genome GDDH13 [[78]39] but missing in the accession with a deletion. The SDs and deletions were classified into two categories: shared (common to multiple accessions) and unique (specific to a single accession). The analysis was conducted using R version 4.3.2. It was noted that some SDs and deletions had overlapping regions at their start or end sites across different accessions, suggesting that the same genomic segments were duplicated or deleted in multiple genotypes, albeit with slight variations in exact coordinates between accessions. To count those deletions and SDs as single events, GRanges objects were created for SDs and deletions for all 116 apple accessions using the GenomicRanges package [[79]46] in R version 4.3.2. Overlapping genomic ranges were merged using the reduce function. Accessions within the current genomic range were identified, and SDs or deletions were classified as unique or shared based on the number of accessions. The distribution of SDs and deletions across chromosomes was visualized using the ggplot2 function [[80]47] in R v4.3.2. Histograms and bar plots were created to show the number of deletions or SDs per chromosome, with color scales applied using the viridis package in R version 4.3.2 [[81]48]. The gene model data from GDDH13 [[82]39] available on the GDR (M. domestica GDDH13 v1.1 Whole Genome Assembly & Annotation | GDR (rosaceae.org)) were imported using the rtracklayer package [[83]49] in R version 4.3.2. These data were used to identify overlaps between genes and genomic deletions and duplications. Overlaps between deletions and genes were calculated, and the percentage of gene overlap by deletions was determined. The analysis, conducted in R version 4.3.2 using a customized script, allowed for parsing, comparison, gene overlap analysis, and visualization of segmental duplication and deletion events across different Malus accessions, ensuring accurate identification of unique and shared genomic alterations. CNV-based population genetic properties of Malus CN-based population clustering Hierarchical clustering was performed on gene copy numbers (CN) to evaluate clustering based on the integer copy number of genes, using the hclust function in R version 4.3.2. CN counts were first scaled and centered before computation, and a Euclidean distance matrix was used for cluster analysis. The resulting hierarchical clustering data was used to create a phylogenetic tree using the as.phylo function of the ape package [[84]50] in R version 4.3.2. To assess the robustness of the clustering, bootstrapping was performed with 1000 replicates using the boot.phylo function in R version 4.3.2. This approach generated bootstrap support values for the branches, which were subsequently added to the phylogenetic tree as node labels. The phylogenetic tree, including bootstrap values, was exported in newick format and further visualized using iTOL (Interactive Tree Of Life) [[85]51]. Multidimensional Scaling (MDS) was performed using the cmd scale function in R version 4.3.2. Population structure analysis A model-based Bayesian clustering software, STRUCTURE (v2.3.4), was used to evaluate the CN-based genome-wide admixture pattern of apple populations [[86]52, [87]53]. Each admixture analysis was conducted with 5000 burn-ins and 10,000 MCMC (Markov chain Monte Carlo) iterations under admixture and allele frequencies correlated models for each K value (K = 2–10); CLUMPP v. 1.1.2 [[88]54], and Distruct v. 1.1 [[89]55] was used to infer the best K value. For STRUCTURE analysis, CNs were recoded as CN = 1 for heterozygous deletions, CN = 2 for neutral events (homozygous for no change relative to the reference genome), CN = 3 for heterozygous duplications, and CN = 4 for homozygous duplication events. Note: CN profiles are relative to the reference genome. Estimation of group-specific divergent gene copy numbers To assess the role of divergent CN in the adaptation and evolution of Malus, selection signature analyses to assess the divergence between groups were performed within five groups: 1) M. domestica scion cultivars compared to rootstock genotypes; 2) M. sieversii accessions from Kazakhstan compared to M. sieversii accessions from Xinjiang, China; 3) M. domestica compared to M. sieversii; 4) M. domestica compared to M. sylvestris; 5) Wild Malus species compared to M. domestica, M. sieversii, and M. sylvestris. V[ST] values were calculated to measure the population genetic differentiation across all genes using CN profiles of all individuals within the group. V[ST] is analogous to F[ST] for multiallelic genotype data such as CN variants and microsatellites. It measures population differentiation and considers how genetic variation can partition groups (populations or sub-populations). The values range from 0 (no differentiation between groups) to 1 (complete differentiation between groups) [[90]24]. [MATH: VST=VtotalVgrou p1×Ng roup1+Vgrou< mi>p2×Ng< /mi>roup2/NtotalVtotal :MATH] Where V[total] is the total variance in normalized copy numbers among all unrelated individuals; V[group1] and V[group2] are the CN variance for two groups or populations, respectively; N[group1] and N[group2] is the sample size for the two groups or populations, respectively; and N[total] is the total sample size. Permutation tests on the CN counts were conducted to identify genes exhibiting strong interspecific CN variation while minimizing the impact of sampling bias. All group 1 and 2 individuals were randomly reassigned, and the V[ST] for each gene was recalculated. The permutation was repeated 1,000 times, generating a distribution of V[ST]values for each gene. Genes with observed V[ST] values exceeding the 95 th and 99 th percentiles of the permuted V[ST] distribution were selected. These genes showed strong intraspecific CN homogeneity along with high interspecific differentiation. Finally, for subsequent analysis, genome-wide standard cutoffs were determined based on the maximum permuted V[ST] values observed in the 95 th and 99 th percentiles of all genes. Gene V[ST] values were deemed significant if they surpassed the maximum 99% confidence interval cutoff [[91]24]. The analysis was conducted using a customized script in R version 4.3.2. Ontology enrichment analysis for genes showing CN variability Gene enrichment analysis of CN-variable genes was performed using the Singular Enrichment Analysis (SEA) function of the agriGO v2.0 online platform with default settings [[92]56]. Gene ontology (GO) terms were deemed significant based on Fisher's exact test with FDR correction, with significance defined as an adjusted p-value less than 0.05. Subsequently, the resulting information was utilized to generate pathway enrichment plots using the ggplot function in R version 4.3.2. Results CNV detection A total of 1,060 Gb of high-quality whole genome short read resequencing data was retrieved from 116 Malus accessions, averaging 9.06 Gb per accession, representing 12.2 × coverage of the Malus genome. CNV calls for all ~ 46,000 genes across 116 Malus accessions, particularly for six main species with sample sizes above 10, showed highly divergent CN profiles (Fig. [93]1a; Supplementary Table 2). The most common CN was a neutral event, homozygous for no change relative to the reference genome, one gene copy per chromosome (CN = 2), followed by heterozygous duplication, which accounts for a gain of one additional copy (CN = 3), and heterozygous deletion, which accounts for a loss of one copy (CN = 1) (Fig. [94]1b). Using whole genome CN estimates, we found that, on average, two accessions belonging to the same species have at least 7000 genes with different CN profiles, and two accessions from different species have at least 20,000 genes with different CN profiles. We next identified SDs (≥ 1 kb in length, ≥ 90% sequence identity) and deletions in 5 kb and 1 kb windows across the reference genome. We discovered 1,700–2,100 SDs per accession spanning 66 to 70 Mb (10%) of the Malus genome across different accessions (Supplementary Figure S1). The SDs ranged in size from 10,000 bp to 2,032,417 bp, with an average size of 36,044 bp (Supplementary Table 3; S1). Similarly, we calculated segmental deletions and discovered a range of 0 to 45 deletions, spanning 0.7 Mb (0.1175%) of the apple genome across different Malus accessions (Supplementary Figure S2). The deletions ranged from 0 to 79,785 bp, with an average size of 15,996 bp (Supplementary Table 4; S1). Fig. 1. [95]Fig. 1 [96]Open in a new tab Comparative analysis of copy number (CN) variations in Malus. a) Circos plot of CN distribution across 17 chromosomes in six Malus species. This Circos plot visualizes the comparative CN variations across 17 chromosomes for six Malus species. The plot is structured in concentric rings, each representing a different species. Within each ring, CN states are depicted as histograms. From the outermost to the innermost ring, the histograms of CN states of species are ordered as follows: Malus domestica (red), M. sieversii (green), M. robusta (orange), M. hupehensis (blue), M. baccata (dark purple), and M. sylvestris (light purple). b) Frequency plot of frequent genome-wide gene CNs across 116 Malus accessions. This plot illustrates the distribution of frequent gene copy numbers across the genomes of 116 Malus accessions, with the x-axis representing gene CN and the y-axis showing the frequency of each CN state Segmental deletion and duplication maps Approximately 48,000 SDs were observed across 116 Malus accessions, ranging in size from 10 kb to 2 Mb (Fig. [97]2a; Supplementary Table 3; S2). Additionally, 289 deletions were identified across all accessions, ranging in size from 10 to 79 kb (Fig. [98]3a; Supplementary Table 4; S2). Genomic coordinates of some SDs and deletions exhibited overlapping regions at their start or end sites across different accessions, indicating recurrent duplication or deletion of the same genomic segments. Fig. 2. [99]Fig. 2 [100]Open in a new tab Distribution and Characteristics of Segmental Duplications by Chromosome. a) Histogram of segmental duplication (SD) lengths across various chromosomes. SDs, ranging from 10 to 500 kb, are displayed with the x-axis representing duplication length (bp) and the y-axis representing frequency. b) Bar graph showing the number of shared and unique duplications per chromosome. The x-axis represents each chromosome, and the y-axis represents the number of duplications, highlighting the distribution of shared versus unique duplications (shown in red) across the genome. A black box highlights the region representing the unique duplications across all chromosomes Fig. 3. [101]Fig. 3 [102]Open in a new tab Distribution and Characteristics of Segmental Deletions by Chromosome. a) Histogram of segmental deletion lengths across various chromosomes. Deletions, ranging from 10 to 80 kb, are displayed with the x-axis representing deletion length (bp) and the y-axis representing frequency. b) Bar graph showing the number of shared and unique deletions per chromosome. The x-axis represents each chromosome, and the y-axis represents the number of deletions, highlighting the distribution of shared versus unique deletions (shown in red) across the genome. A black box highlights the unique deletions across all chromosomes The SD count was refined to approximately 4,900 post filtering, including 4,325 shared SDs and 563 unique ones (Supplementary Table 3; S3, S4). Among the deletions, 83 were shared, and 20 were unique, totaling 103 deletions (Supplementary Table 4; S3, S4). The distribution of shared SDs varied significantly across chromosomes, with counts ranging from 2 to 114, with the highest count of 114 on chromosome 7. Deletion counts varied from 2 to 73, with the highest count of 73 on chromosome 7. Unique SDs were most abundant on chromosome 15, especially in the M. florentina genotype, followed by M. fusca. The highest number of SDs was observed on chromosome 15, followed by chromosomes 5, 11, and 10. Conversely, chromosome 1 had the lowest number of SDs (Fig. [103]2b). The highest number of shared deletions was observed on chromosome 10, while unique deletions were most frequent on chromosome 16 (Supplementary Figure S3). Similarly, most deletions were observed on chromosome 10, and the most unique deletions were observed on chromosome 16 (Fig. [104]3b). The gene model data from GDDH13, available on the GDR, identified overlaps between genes and genomic deletions and duplications (Supplementary Table 3; S5). Multiple copies of several key genes were identified in the duplicated regions of the Malus genome. These include Agamous-like MADS-box, sorbitol dehydrogenase, sorbitol transporters, and ALMT genes. Additionally, multiple copies of genes related to flavonoids and isoflavones were found, specifically Flavanone 3-hydroxylase, 2-hydroxyisoflavanone dehydratase, and isoflavone reductase. Several genes involved in various biosynthetic pathways for synthesizing volatiles, aromatics, pigments, and antioxidants were also identified in these duplicated regions. These pathways include phenylpropanoids, carotenoids/norisoprenoids, esters/aldehydes/alcohols, and anthocyanins/anthocyanidins/flavonols/isoflavones (Supplementary Table S3; S6). Population genetic properties of Malus based on CNVs Hierarchical clustering and multi-dimensional scaling (MDS) analysis Hierarchical clustering identified two distinct groups in the 116 Malus accessions based on CNVs (Fig. [105]4a). Group I comprised domesticated apple (M. domestica) and its wild progenitors (M. sieversii and M. sylvestris), while group II included all other wild species. Within group I, two sub-groups emerged: one consisting mainly of 11 M. domestica rootstocks, and the other containing 23 scion accessions of M. domestica, 15 accessions of M. sieversii from Kazakhstan, 10 accessions of M. sylvestris, and one accession of M. pumila. Group II is further divided into four sub-groups: the first represents wild species native to North America (M. ioensis, M. angustifolia, and M. fusca); the second includes accessions from wild species such as M. hupehensis, M. baccata, M. robusta, and M. asiatica; the third comprising a few other wild species; and the fourth representing M. sieversii accessions collected from natural forests in Xinjiang, China (Fig. [106]4a). Interestingly, two accessions of M. sieversii (M. sieversii K12 and M. sieversii K15), two accessions of M. sylvestris (M. sylvestris 08 and M. sylvestris 09), one accession of M. orientalis (M. orientalis) and one accession of M. baccata (Malus baccata 01) clustered with M. domestica. Accessions of M. robusta clustered together with accessions of M. asiatica, while M. hupehensis clustered closely with M. baccata. Accessions of M. sieversii from Xinjiang, China, formed a separate cluster. The MDS analysis showed that M. domestica scion cultivars clustered tightly with M. sieversii from Kazakhstan and M. sylvestris. In contrast, M. sieversii from Xinjiang formed a distinct cluster closer to other wild accessions (Fig. [107]4b). Fig. 4. [108]Fig. 4 [109]Open in a new tab Population genetic structure of 116 domesticated apples and wild Malus species based on copy number variations (CNVs). a) CNV-based cluster dendrogram of 116 accessions of domesticated and wild Malus species. Bootstrap values are shown on the branches of the tree. Accessions belonging to the same species are labeled in the same color, and wild Malus species with one or few accessions are highlighted in brown. b) Multi-dimensional scaling analysis highlighting the genetic diversity within the genus Malus. The M. domestica scion accessions, along with the accessions of M. sylvestris and M. sieversii from Kazakhstan, are highlighted within the black circle to emphasize their grouping and genetic relationships c) Bayesian model-based clustering with numbers of ancestry kinship (K) from 3 to 4. Each vertical bar represents one Malus accession, and each color represents one putative ancestral background. Abbreviations: Dom = M. domestica, Syl = M. sylvestris, Sie_K = M. sieversii from Kazakhstan, Sie_X = M. sieversii from Xinjiang, China, N.A = North American Malus species and Wild = all other wild Malus species Admixture analysis Bayesian clustering identified four populations (K = 4) that best represent the ancestry of the 116 Malus accessions. At K = 3, M. domestica, M. sieversii, and M. sylvestris were clearly separated from the other wild Malus species. At K = 4, the North American wild species formed a distinct cluster, showing a clear distinction among M. domestica, M. sieversii, M. sylvestris, North American species, and other wild Malus species (Fig. [110]4c). Species-specific divergent gene copy numbers A total of 3,000 genes showing significant differentiation (V[ST at 99 th percentile] > 0.28, 0.6% of all genes) between M. domestica rootstocks and scion cultivars were identified (Supplementary Table 5; S1). Gene enrichment analysis of these highly differentiated genes revealed significant enrichment in the phenylpropanoid pathway (p = 1.20E^−06), lignin metabolic pathway (p = 1.20E^−06), secondary metabolic pathway (p = 9.20E^−07), and carbohydrate pathway (p = 1.10E^−05) (Fig. [111]5a; Supplementary Table 5; S2). CN of laccase and cinnamyl alcohol dehydrogenase (CAD) genes was higher in rootstocks compared to scions (Supplementary Table 5; S3). Malus sieversii accessions from Kazakhstan and Xinjiang exhibited approximately 2,000 genes with V[ST] values exceeding the maximum 99 th percentile permuted V[ST] value (V[ST] > 0.35, 0.04% of all genes) (Supplementary Table 5; S4). The enriched terms for these genes included electron transport (p = 1.80E-07), cellular respiration (p = 1.10E-05), defense response (p = 8.70E-05), response to stress (p = 0.00015), and chromatin assembly (p = 0.00018) (Fig. [112]5b) (Supplementary Table 5; S5). Fig. 5. [113]Fig. 5 [114]Open in a new tab Pathway Enrichment Analysis for Copy Number Variable Genes Pathway enrichment analysis for copy number variable genes across five groups: a) Rootstocks vs. scion cultivars, b) M. sieversii from Kazakhstan vs. M. sieversii from Xinjiang c) M. domestica vs. M. sylvestris d) M. domestica vs. M. sylvestris e) Wild vs. M. domestica, M. sieversii, and M. sylvestris. The x-axis represents the rich factor, indicating the ratio of observed to expected genes in a pathway, and the y-axis shows the pathway names. The size of the circles corresponds to the number of genes, while the color indicates the -log10(p-value), with a gradient from red to blue representing increasing significance Approximately 700 genes had differential CN between M. domestica and M. sieversii (V[ST] > 0.26 at 99 th percentile, 0.015% of all genes) (Supplementary Table 5; S6). Gene enrichment analysis of these extremely differentiated gene sets revealed a significant enrichment of genes in the defense response pathways (p = 0.0003) (Supplementary Table 5; S7). In addition, genes such as beta-galactosidase, flavonol synthase, beta-glucosidase, lipoxygenase, 1-aminocyclopropane-1-carboxylate oxidase, lycopene cyclase, dihydroflavonol 4-reductase, phosphofructokinase, MYB5, and MYB12 had higher average copy numbers in M. domestica. Similarly, about 360 genes were found to have differential gene CN between M. domestica and M. sylvestris (V[ST] > 0.36, 0.007% of all genes) (Supplementary Table 5; S8), with significant enrichment in polygalacturonase activity (p = 0.0011) (Fig. [115]5d; Supplementary Table 5; S9). Malus domestica has higher CN of Malate dehydrogenase (MDH), Phosphofructokinase (PFK), and beta-galactosidase genes than M. sylvestris. A comparison of wild Malus species with M. domestica, M. sieversii, and M. sylvestris revealed approximately 5,000 genes with different CN profiles (V[ST] > 0.1, 0.1% of all genes) (Supplementary Table 5; S10). Significant enrichment was observed in the electron transport chain (p = 7.00E-06), pollination (p = 0.00015), response to oxidative stress (p = 0.00025), and response to stress (p = 0.0015) (Fig. [116]5e) (Supplementary Table 5; S10). Higher copy numbers of NB-ARC domain-containing disease resistance genes (e.g., MD11G1020200, MD11G1020500, MD00G1172900, MD08G1096800, MD11G1034100, MD11G1040600, MD03G1015300, MD11G1033600, MD07G1259200, MD07G1008400, MD03G1049200, MD15G1391000) and NBS-LRR resistance proteins (e.g., MD02G1278900, MD15G1375200, MD05G1018900) were found in wild Malus species. Discussion CNVs and SDs and their potential role in shaping traits of apples We have comprehensively characterized the genome-wide gene CNV and SD profiles of 116 Malus accessions representing a large number of Malus species, including the domesticated apple (Malus domestica) and its main progenitor species (M. sieversii and M. sylvestris). We used mrsFAST aligner to map Illumina reads against apple reference genome GDDH13 [[117]39]. The selection of a mapping tool to characterize segmental duplications is important, as most of the mapping algorithms fail to characterize the recent segmental duplications in the genome. The current alignment tools fail to map multireads to the multiple locations of segmental duplications, particularly when short sequence reads are used. Therefore, mrsFAST was used to capture all possible mapping locations. It can handle user-defined specific numbers of mismatches better than other tools, making it suitable for studies like detecting copy number polymorphisms and segmental duplications [[118]57]. Following this, read mapping absolute copy numbers of genes were calculated in 1 kb windows along the genome using mrCaNaVaR. Developed within the 1000 Genomes Project, it still remains the only method that estimates integer copy numbers for duplicated gene families. Other CNV detection methods such as CNVnator and Control-FREEC only determine “new duplications” where there is no duplication in the reference genome, therefore they miss copy number polymorphism. CNVnator does not predict integer copy numbers, and Control-FREEC limits copy numbers to 4, and was developed mainly for finding somatic CNVs in tumor genomes [[119]3, [120]24, [121]38, [122]60]. It has been successfully used to identify CNVs and segmental duplications in grape, ape and cattle genomes as well as ancient human DNA [[123]26, [124]28, [125]36, [126]61]. Long reads can offer more effective mapping to large, repetitive, or structurally complex genomic regions, enabling better resolution of segmental duplications and large CNVs. They can also distinguish closely related paralogs by spanning entire genes [[127]58, [128]59]. However, long-read sequencing of plants remains expensive, especially when applied to many accessions with large genomes. The in-depth analysis of CNVs and SDs across different species indicates the significant role of these structural variations in the domestication of apples from their main progenitors, including both neutral and adaptive evolution during the domestication process. CNVs comprise an important source of genetic variation alongside SNPs [[129]62–[130]64]. Highly divergent CN profiles between Malus accessions belonging to the same species (~ 7,000) or different species (~ 20,000) can be attributed to the genetic diversity, high heterozygosity, and two genome-wide duplication events that have shaped the apple genome [[131]17]. Reports suggest a strong association between SDs and CNVs; SDs are hotspots for genomic rearrangements, expansions, and CNV formation, and they serve as a source of gene function diversification [[132]65–[133]68]. Approximately 10% of the apple genome was found to be duplicated across different Malus accessions. This is consistent with a previous study by Xu et al. [[134]38], which identified CNVs from 346 diverse apple accessions, including 289 cultivars and 57 wild relatives, accounting for 10.03% of the apple genome [[135]38]. In comparison, a higher proportion of the grape genome (26%) has been characterized as being covered by highly duplicated SDs [[136]36]. These SDs could explain the variable CN and higher CN profiles of apple genes across different Malus accessions. Analysis of the gene content revealed that SDs harbor multiple copies of genes that contribute to the distinctive characteristics of apples. These SDs were predominantly shared among different Malus accessions. Notably, apple-specific subclades of genes were identified within these SDs. Among these, Agamous-like MADS-box transcription factors (39 genes) were highly over-represented in the duplicated regions. Gene duplication is considered a driving force for biological evolution [[137]69]. These findings align with previous studies reporting multiple copies of MADS-box genes in the genome of M. domestica [[138]17]. MADS-box genes play a crucial role in the development of apple's characteristic pome fruit, a feature unique to the Pyreae tribe [[139]17]. Interestingly, another study on Pyrus, which also produces pome fruits, suggests that WGD events led to the expansion of the MADS-box gene family [[140]70]. Among the MADS-box genes, AGAMOUS (AG, C-function) genes are essential for determining floral organ identity, as described in the ABC model of flower development [[141]71, [142]72]. In Malus, AG genes are expressed in flowers and fruits, highlighting their significant role in fruit development [[143]73]. We have identified duplicated regions harboring multiple copies of genes that regulate sugar and acid content in apples, which are crucial for the fruit's taste and flavor. The composition and concentration of sugars and their balance with acids strongly influence the taste and flavor of apples [[144]74, [145]75]. Sorbitol, the main carbohydrate produced during photosynthesis in apples, is transported via the phloem to the fruits. There, it is taken up by plasma membrane-bound sorbitol transporters (SOTs) and converted to fructose by sorbitol dehydrogenase [[146]76]. Subsequently, multiple genes encoding sorbitol dehydrogenase and sorbitol transporters identified in these duplications suggest that these duplications played an important role in defining the distinctive characteristics of the domesticated apple fruit. Malic acid, the main organic acid in apples, accumulates during the early stages of fruit development and largely controls fruit acidity and the perception of sweetness [[147]77, [148]78]. Multiple copies of the ALMT gene, which influences malic acid accumulation, were also found in these duplicated regions. Taken together, these findings suggest that several important genes underlying these duplications contribute to the distinctive traits of apples. These SDs underscore the genetic complexity and evolutionary processes that might have shaped the apple genome, contributing to its unique characteristics and adaptability. Genome-wide CNVs and genetic relationship of Malus species Overall, accessions of the same Malus species and from the same geographical areas clustered together based on CNV profiles. Accessions of M. domestica grouped with the accessions of its main progenitors, M. sieversii and M. sylvestris. Clustering of the CNV matrix identified a structure mirroring the previously SNP-based apple phylogeny, with accessions clustering within their known taxonomic groups. The population structure inferred from CNVs was consistent with the SNP-based apple phylogeny reported by Duan et al. [[149]33]. M. domestica clustered closely with its main progenitor, M. sieversii, forming a domestication-related clade distinct from other wild species, similar to observations from the SNP-based study [[150]33]. North American wild species (M. ioensis, M. angustifolia, M. fusca) formed a distinct cluster closely related to the Asian wild species M. baccata and M. hupehensis, consistent with the patterns observed in the SNP-based study [[151]33]. M. sieversii from Xinjiang, China, formed a separate cluster outside the domestication-related clade and grouped closer to the wild Malus species in the CNV study. This indicates that the genetic variation captured by CNVs can accurately reflect the population structure in Malus. Results from the structure analysis (K = 3) further support the hypothesis that the primary genetic contributor to M. domestica is M. sieversii from Kazakhstan with significant contributions from M. sylvestris. Furthermore, M. sieversii accessions from Xinjiang, China are distinct from M. sieversii from Kazakhstan likely due to the geographic barrier imposed by the Tian Shan Mountains. Admixture is evident in M. sieversii from Kazakhstan, likely resulting from the hybridization with other wild Malus species or M. domestica growing nearby. At K = 4, North American wild species still form an independent group, likely due to long-term geographic isolation from other wild Malus species. Studies have demonstrated that CNVs accurately reflect population structure in humans, plants, and animals [[152]16, [153]36, [154]79, [155]80]. However, two accessions of M. sieversii (M. sieversii K12 and M. sieversii K15), two accessions of M. sylvestris (M. sylvestris 08 and M. sylvestris 09), one accession of M. orientalis (M. orientalis) and one accession of M. baccata (M. baccata 01) clustered with M. domestica outside of their respective species group. This anomaly could have arisen from incorrect clustering or mislabeling in the USDA-ARS germplasm repository. Previously, DNA profiling of M. sieversii, M. orientalis, and M. sylvestris accessions from the USDA-ARS National Plant Germplasm System (NPGS) revealed admixture. The study indicated that some seeds collected from wild species in the species'center of diversity were hybrids or completely M. domestica [[156]81]. To verify whether this anomaly arose from incorrect clustering or mislabeling, the dataset from this study was searched to match the accession numbers to those of the admixed accessions. It was found that M. sieversii K15 (PI 657101), M. sylvestris 08 (PI 619168), and M. orientalis (PI 644252) were, in fact, completely M. domestica. These findings suggest that gene CNVs represent an important source of genetic variation and could help explain the missing heritability that SNPs do not account for. A growing body of evidence suggests that natural selection acts on CNVs, facilitating adaptive refinements during domestication, evolution, and ecological adaptation [[157]24, [158]82–[159]84]. Selection pressure can act on gene CNs; given the prevalence and dynamic nature of CNVs within Malus accessions, it was hypothesized that this form of genetic variation might have been influenced during the domestication of apples to drive changes in fruit quality and size. Domesticated apples are firmer and sweeter than their primary progenitor, M. sieversii, and larger and less acidic than their introgression contributor, M. sylvestris. During domestication, key genes involved in fruit size, flavor, flesh firmness, sweetness, and acidity were under constant selection [[160]33]. Notable differences in the CN of genes involved in fruit ripening, softness, and flavor were observed between M. domestica and M. sieversii. Significant genes with higher CNs in M. domestica include β-Gal, LOX, and ACO. β-Gal have been reported to play an important role in fruit ripening and softening in apples [[161]85]. Two β-galactosidase genes, which underlie fruit weight QTL fw1, were reported to be under human selection during apple domestication. This selection pressure might have led to an increase in the CN of β-Gal. LOX and ACO have been reported to contribute to fruit flavor formation [[162]86] and fruit ripening, respectively. ACO catalyzes the final step in the biosynthesis of ethylene in flowering plants and plays a crucial role in fruit ripening in apples. ACO has been identified in the selective sweep across M. domestica and M. sylvestris [[163]33]. Additionally, genes involved in anthocyanin biosynthesis, such as flavonol synthase and dihydroflavonol 4-reductase, were found in higher CNs. M. domestica had higher CN of malate dehydrogenase (MDH), phosphofructokinase (PFK), and beta-galactosidase genes than M. sylvestris. MDH regulates malate content in apples [[164]87] and thus could be under positive selection in bigger fruits of M. domestica. Overall, the results suggest that the increase in CN of genes related to fruit ripening, softness, and flavor may have played an important role in the development of larger and more flavorful domesticated apples. These domestication-driven changes in CNVs underscore the role of selection pressure in shaping key traits such as fruit size, firmness, and flavor. Over evolutionary history, wild apples have faced diverse and challenging environments, exposing them to selection pressures that likely influenced CNVs associated with stress and disease tolerance genes. In other plant species, CNVs have also played a significant role in adaptive evolution. For instance, an increased copy number of the EPSPS gene in Amaranthus palmeri confers resistance to the widely used herbicide glyphosate [[165]29]. Similarly, in Arabidopsis halleri, the expansion of metal homeostasis gene copies enables metal hyperaccumulation, providing defense against biotic stress [[166]30]. These examples highlight how natural habitats have driven adaptive evolution in plants, shaping their genetic diversity and resilience. This pressure may have driven the adaptive evolution observed in wild Malus species. Genes associated with stress and defense responses were found to exhibit copy number variability between domesticated and wild Malus species. The NB-ARC domain-containing disease resistance proteins, in particular, play a crucial role in pathogen recognition and the subsequent activation of immune responses [[167]88]. Higher copy numbers of NB-ARC domain-containing disease resistance proteins in wild Malus species likely facilitated the effective recognition of pathogens present in their natural habitats. This adaptation might have helped wild Malus species withstand disease pressures in the wild. This is supported by the findings of a previous CNV study, which identified two disease resistance proteins, MD03G1049200 and MD15G1391000, as being present in higher copy numbers [[168]38]. Additionally, the study evaluated the expression data of one cultivar and two wild accessions, revealing higher expression levels of MD15G1391000 in the wild accessions. Peroxidases play a crucial role in stress responses [[169]89] and subsequently two genes belonging to Peroxidase superfamily protein (MD13G1097600, MD15G1252300) were found to have higher copies in wild Malus species. Taken together, these findings suggest that disease and stress response genes may have been subject to selective pressure in the wild, ultimately leading to an increase in their copy numbers. This increase in copy numbers likely contributed to the survival of wild Malus species in their natural environments, driving the adaptive evolution of Malus species in the wild. MADS-box genes play a crucial role in the development of apple's characteristic pome fruit [[170]17]. Notably, the presence of higher copy numbers of a MADS-box transcription factor (MD03G1135000) in domesticated apples suggests its potential role in contributing to the larger fruit size observed in cultivated varieties compared to wild Malus species. Rootstocks have been used for over 2000 years to enhance productivity and influence architectural traits such as early bearing and dwarfing in apple cultivars [[171]90]. Altered anatomical, metabolic, hormonal, and carbohydrate profiles have been implicated in rootstock-induced dwarfing in apples [[172]91]. Differences in the copy number profiles of genes involved in auxin biosynthesis, carbohydrate metabolism and secondary metabolic pathways have been observed between rootstock and scion accessions. Our results showed significant enrichment in the phenylpropanoid pathway and lignin metabolic pathway between rootstock and scion cultivars. Among them, laccase and cinnamyl alcohol dehydrogenase (CAD) genes were found to have higher CN in rootstocks than in scions. Laccase genes are involved in lignin biosynthesis. Lignins constitute an abundant component of plant cell walls and provide mechanical support and resistance to phytopathogens and environmental stresses [[173]92–[174]94]. Cell wall lignification at the infection site is hypothesized to delay pathogen entry, allowing plants to activate defense responses via the biosynthesis of phytoalexins and resistance proteins [[175]95]. A previous report also highlighted that lignin biosynthesis constitutes an important part of the defense network in apple roots against Pythium ultimum infection [[176]96]. Apple rootstocks can have a variety of desirable characteristics, such as resistance to biotic and abiotic stresses, dwarfing, increased precocity (early fruitfulness), fruit yield, and branching modifications [[177]97]. These findings indicate the potential importance of CNVs in understanding the genetic basis of traits in apple rootstocks. Conclusion In this study, we determined the genome-wide gene copy number and segmental duplication profiles for 116 Malus accessions representing 24 apple species using read-depth analysis of whole-genome short-read resequencing data. The distribution patterns of genome-wide copy number variations (CNVs) revealed an evolutionary map of apple species. These CNV profiles, derived from diverse apple accessions, including domesticated varieties (M. domestica), progenitors (M. sylvestris and M. sieversii), and wild species from North America, Europe, and Asia, reflect the genetic diversity shaped by different geographical habitats and neutral processes. Additionally, they highlight the evolutionary relationships among these apple species, illustrating how admixture and adaptations have occurred over time in response to varying environmental conditions and human selection pressures. Thus, the CNV data provide insights into apples’ historical dispersal and adaptive evolution across different continents and habitats. Analysis of segmental duplications (SDs) revealed that genes encoding MADS-box transcription factors, ALMT, and sorbitol metabolism have played crucial roles in imparting apple-specific characteristics, such as the development of pome fruit and fruit quality traits. These genes are pivotal in shaping the unique attributes of apples, highlighting the evolutionary significance of SDs in defining apple morphology and physiology. CNVs revealed the potential molecular basis for apple domestication and ecological adaptation. Domesticated apples and their progenitors show distinct copy number variations in genes related to reproduction and pollination compared to other wild species. In conclusion, our study on CNVs in domesticated apples and their wild progenitors highlights their role in shaping genetic diversity essential for adaptation and cultivation. These findings underscore the significance of CNVs in understanding the evolutionary mechanisms governing apple domestication and adaptation, providing a valuable resource for future research aimed at enhancing apple phenotypic traits through breeding strategies. Supplementary Information [178]Supplementary Material 1.^ (138.3KB, pptx) [179]Supplementary Material 2.^ (409.9KB, xlsx) [180]Supplementary Material 3.^ (12.3MB, csv) [181]Supplementary Material 4.^ (22.8MB, xlsx) [182]Supplementary Material 5.^ (91.9KB, xlsx) [183]Supplementary Material 6.^ (6.8MB, xlsx) Acknowledgements