Abstract Background High-throughput genotyping technology has become an indispensable tool for advancing molecular breeding and genetic research in plants, facilitating large-scale exploration of genomic variation. Genotyping technology based on liquid-phase array utilizes streptavidin-coated nanomagnetic beads to capture biotin-modified probes, thereby capturing the target sequence on the genome, achieving the purpose of genotyping. This study aims to develop a novel liquid-phase for tea plant, which can be used for cultivar identification, genetic map construction, Quantitative Trait Locus (QTL) mapping of key agronomic traits in tea plants, and genetic evolution analysis. Result We developed a highly efficient multiple-SNP array, the TEA5K mSNP array, which comprises 5,781 liquid-phase probes based on the Genotyping by Target Sequencing (GBTS) system. Using this array, we genotyped 231 developed tea cultivars, revealing that genetic similarity within the same cultivar ranged from 92.53–97.95%, whereas genetic similarity between different cultivars generally remained below 82.36%. Furthermore, utilizing this array, we constructed a high-density genetic map consisting of 3,274 markers, covering a total genetic distance of 2,225.19 cM, with an average marker interval of 0.76 cM. The high-resolution genetic map facilitated the identification of multiple QTLs linked to eight amino acid components, as well as two molecular markers strongly associated with the albino-leaf trait in the ‘Huangjinya’ cultivar, both mapped to chromosome 8. Moreover, we applied the array to analyze the population structure and phylogenetic relationships of 519 tea germplasm, classifying them into three major groups: wild accessions, landraces, and modern cultivars. Notably, modern cultivars exhibited lower genetic diversity compared to landraces. Additionally, we observed substantial genetic differentiation between wild resources and modern cultivars, with minimal to no gene flow from wild populations into domesticated cultivars. These findings suggest that modern tea breeding faces an “improvement bottleneck,” a challenge similar to that encountered in other perennial crops. Conclusion The TEA5K mSNP array is presented as a flexible, cost-effective, and low-maintenance genotyping tool that significantly enhances both genetic research and molecular breeding in tea plants. By providing a robust platform for genome-wide analysis and facilitating the identification of key QTLs, this tool offers valuable insights for improving the genetic diversity and agronomic performance of tea cultivars. Graphical Abstract [46]graphic file with name 12951_2025_3533_Figa_HTML.jpg Supplementary Information The online version contains supplementary material available at 10.1186/s12951-025-03533-5. Keywords: Genotyping by target sequencing, Cultivar identification, Genetic evolution, Genetic maps, Gene mapping Introduction The history of plant breeding has been shaped by three major innovations: traditional cross-pollination techniques, transgenic technology, and more recently, marker-assisted selection (MAS). Unlike conventional breeding, which relies on field selection based on phenotypic traits, MAS significantly shortens breeding cycles and enhances selection efficiency [[47]1]. A variety of DNA markers—such as amplified fragment length polymorphism (AFLP), simple sequence repeats (SSRs), and single nucleotide polymorphisms (SNPs)—have been developed and applied in MAS [[48]2]. However, AFLP and SSR markers pose several limitations, including high costs, low throughput, and a high false-positive rate, making them less scalable for large breeding programs [[49]3]. In contrast, SNP markers are highly abundant, genetically stable, and widely used in genetic studies across various crops, including rice (Oryza sativa), mulberry (Morus spp.), and tea (Camellia sinensis), making them ideal for MAS applications [[50]4–[51]6]. SNP markers are commonly developed and applied using high-throughput genotyping platforms, including SNP arrays and sequencing-based technologies [[52]7–[53]9]. Commercial SNP arrays, such as the Affymetrix GeneChip and Illumina Infinium platforms, are widely employed in plants, where SNP detection occurs through hybridization of genomic DNA with oligonucleotide probes [[54]10]. For example, the Affymetrix 44 K SNP array has been applied in genome-wide association studies (GWAS) in rice [[55]11, [56]12], and the Brassica 60 K Illumina Infinium array has facilitated fine genetic mapping [[57]13, [58]14]. Nevertheless, the development and utilization of such SNP chips in tea plants are rarely reported. To date, only a 200 K Affymetrix solid SNP chip has been developed and applied to analyze the population structure and evolutionary relationships between Camellia sinensis var. assamica (CSA) and Camellia sinensis var. sinensis (CSS) [[59]15]. Although solid SNP chip genotyping offers high reliability and reproducibility, it involves substantial development costs and is limited by the fixed content of commercial arrays, which significantly restricts flexibility in marker selection and utilization [[60]2, [61]3]. On the other hand, sequencing-based approaches, such as whole-genome sequencing (WGS) and genotyping by sequencing (GBS), have gained popularity in plant genomics. WGS provides high-resolution genome-wide variation detection [[62]16], but its high cost restricts its practical application in large-scale breeding programs. GBS, which selectively sequences partial genomic regions through restriction enzyme digestion, is a more cost-effective approach; however, it provides lower SNP resolution and demands substantial bioinformatics processing [[63]6]. Given these limitations, there is a growing demand for more flexible, cost-efficient genotyping platforms tailored for commercial breeding applications. A recently developed method, Genotyping by Target Sequencing (GBTS), has emerged as a promising alternative for species such as maize (Zea mays) and wheat (Triticum aestivum) [[64]17, [65]18]. The GBTS captures specific genomic regions using solution-based probes or primers, effectively integrating the strengths of both solid-phase SNP array and GBS. This hybrid method provides enhanced flexibility, cost-efficiency, and high sequencing depth, making it particularly suitable for commercial breeding and genetic research. Tea is an economically significant crop worldwide, with its cultivation dating back thousands of years. Despite the availability of rich genetic resources, their effective utilization in breeding remains challenging due to the species’ self-incompatibility, large genome size, and prolonged breeding cycles [[66]19]. While whole-genome resequencing is not ideal for MAS in tea breeding due to its high cost. GBTS, by contrast, offers a more affordable sequencing strategy while maintaining high SNP resolution, making it an attractive alternative for genotyping in tea plants. In this study, we developed a novel SNP array for tea using GBTS technology, termed the TEA5K mSNP liquid-phase array, comprising 5,781 core SNP sites along with more than 30,000 flanking SNP sites. We further demonstrate the utility of this platform in a range of applications, including cultivar identification, evolutionary analysis, genetic map construction, and QTL mapping. Materials and methods Development of TEA5K mSNPs liquid-phase array Liquid-phase array technology is based on the principle of in-solution hybrid capture, in which specifically designed probes hybridize with complementary target sequences, allowing for high-throughput sequencing of predefined genomic regions. In this study, liquid-phase probes were designed based on a set of SNPs identified from the resequencing data of 811 tea accessions archived in our previously established Tea Genomic Variation Database (TeaGVD) [[67]20]. The SNP identification pipeline began with quality trimming of raw reads using Sickle ([68]https://github.com/najoshi/sickle) with a Phred quality threshold of Q20, followed by alignment to the ‘Shuchazao’ v2.0 reference genome via Burrows-Wheeler Aligner [[69]21, [70]22]. PCR duplicates were removed using Sambamba [[71]23] with parameters “--overflow-list-size 1000000 --hash-table-size 1000000”. The SNPs and InDels were then called using FreeBayes [[72]24], and filtered with PLINK and VCFtools [[73]25] under the criteria: minor allele frequency (MAF) > 25%, heterozygosity < 15%, and missing data < 10%. This yielded 13,760 candidate regions. A 100 bp sliding window was applied to quantify SNP density, and regions with fewer than 15 SNPs, as well as repetitive sequences, were excluded. Regions were selected based on uniform chromosomal distribution and GC content (25–75%), resulting in 6,477 target loci for probe design. Probes were created using GenoBaits Probe Designer (Molbreeding Biotech) and synthesized via semiconductor-based in-situ biotinylated synthesis [[74]17]. To assess probe performance, 268 tea accessions were tested. Probes were excluded if they exhibited a capture uniformity (defined as the ratio of sequencing depth in the target region to the average sequencing depth of the sample) was < 10%, an on-target rate (the proportion of reads covering the target region among all reads captured by the probe) was < 50%, or a missing data rate was > 20%. DNA extraction, library construction, sequencing and genotyping A total of 519 tea accessions were sequenced and genotyped using the TEA5K mSNP array to assess basic characteristics, enable cultivar identification, and perform population genetic analysis. These accessions encompassed a broad range of taxa, including Camellia. sinensis var. sinensis, Camellia. sinensis var. assamica, Camellia. sinensis var. pubilimba, Camellia. crassicolumna, Camellia. tachangensis and Camellia. Fangchengensis (Supplementary Table [75]1; Additional file 2). Additionally, 98 F₁ offspring derived from a controlled cross between ‘Huangjinya’ and ‘Longjing 43’ were used for genetic map construction (Additional file 2). C. crassicolumna and C. fangchengensis were collected from Yunnan and Guangxi provinces, respectively, while all other germplasm were sourced from the National Tea Germplasm Repository @ Hangzhou at the Tea Research Institute of the Chinese Academy of Agricultural Sciences (TRICAAS), the National Large-leaf Tea Germplasm Repository @Menghai at the Tea Research Institute of Yunnan Academy of Agricultural Sciences, and the National Medium- and Samll-leaf Tea Germplasm Repository @Changsha at the Tea Research Institute of Hunan Academy of Agricultural Sciences. All plant samples were stored at -80 °C before DNA extraction. Genomic DNA was extracted using the Plant Genomic DNA Extraction Kit (Aidelai Biotech, Beijing, China). DNA quality was assessed via 1% agarose gel electrophoresis, and DNA concentration was quantified using a Qubit 2.0 Fluorometer (Thermo Fisher Scientific, CA, USA). Subsequently, DNA libraries were constructed and quantified using a Qubit ssDNA Assay Kit (Thermo Fisher Scientific, CA, USA). The libraries were loaded onto the flow cell and sequenced using paired-end 150 bp (PE150) reads on the MGISEQ-2000 platform (MGI, Shenzhen, China) [[76]2]. The sequencing data were mapped to the ‘Shuchazao’ v2.0 reference genome [[77]21] using the Burrows-Wheeler Aligner (BWA) MEM algorithm [[78]22]. Variant calling was performed using the Genome Analysis Toolkit (GATK), generating genotype datasets for subsequent analyses [[79]26]. Functional annotations of SNP markers Using genotyping data from 519 tea accessions and the annotation of the ‘Shuchazao’ v2.0 reference genome [[80]21], functional genes containing SNP markers were subjected to Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses. These analyses were conducted using the agriGO v2.0 platform [[81]27] and OmicShare Tools ([82]https://www.omicshare.com/), respectively. GO terms and KEGG pathways with P-values < 0.05 were visualized using GraphPad Prism 10 and the ggplot2 package in R. Cultivar identification To evaluate the effectiveness of the TEA5K mSNP array for cultivar authentication, genotypic data from 231 modern tea cultivars were analyzed. For threshold determination, eight cultivars were selected, with four individuals per cultivar and three branches per individual (Supplementary Fig. [83]2). Genetic similarity (GS) was calculated as the proportion of matching markers between any two samples relative to the total number of markers. Each of the three marker panels yielded 26,795 GS values. As shown in Fig. [84]4, GS values between different cultivars clustered on the left, while those within the same cultivar grouped on the right—demonstrating a clear distinction. This pattern enables the establishment of a reliable GS threshold for differentiating tea cultivars. Fig. 4. [85]Fig. 4 [86]Open in a new tab Modern cultivar and bud mutant identification analysis. (A) Phenotypic variation among modern cultivars; (B) Vegetative propagation-based phenotypic variation in eight cultivars; (C) Bud mutant phenotype analysis, comparing green-leaf branches and albino mutant branches. (D), E, F. Genetic similarity (GS) analysis of cultivars, vegetative propagation individuals, and bud mutants, evaluated using the 5 K core SNP panel, 36 K SNP panel, and 5 K mSNP panel, respectively. Sample pair counts increase from blue to yellow Construction of linkage map The genotype data from 98 F₁ offspring and their parents ‘Huangjinya’ and ‘Longjing 43’ were used to construct a linkage map and perform QTL mapping. High-quality SNPs were selected based on specific criteria to ensure data accuracy and reliability. The detection rate of each marker was required to exceed 75% across all samples, ensuring that only consistently observed markers were retained. Specific SNP marker types that conformed to ab x cd, lm x ll, nn x np, hk x hk, and ef x e.g. were included in the analysis, as these segregation patterns are suitable for the population structure of tea plants. Markers exhibiting significant segregation distortion were removed based on a threshold of P < 0.01 to prevent potential bias in the mapping process. The remaining high-confidence SNP markers were subjected to linkage analysis using JoinMap software to construct the genetic map [[87]28]. Amino acid extraction and quantification During the spring seasons of 2022 and 2023, “two and a bud” samples were collected from each individual in the F₁ hybrid population derived from the tea cultivars ‘Huangjinya’ and ‘Longjing 43’. Fresh samples were immediately steamed using a hot-air dryer at 120 °C for several minutes, then dried at 80 °C for 1 h until fully dehydrated. The dried material was finely ground using a tissue lyser (MM 400, Retsch, Germany). For amino acid extraction, 0.1 g of powdered tissue was mixed with 10 mL of distilled water and incubated in a 90 °C water bath for 30 min, with gentle shaking every 10 min. The mixture was centrifuged at 4,000 rpm for 10 min. The resulting supernatant was filtered through a 0.22 μm Millipore membrane and diluted 1:1 with sample diluent. Quantitative analysis of amino acids was conducted using an automatic amino acid analyzer (Sykam, Germany), following the protocol of Liu et al. [[88]29]. The analysis used a Sykam LCA K07/Li column (7 μm, 4.6 mm × 150 mm), with the reactor maintained at 130 °C, an injection volume of 50 µL, and detection at 570 nm. The flow rate of the ninhydrin reagent in the S2100 module was 0.25 mL/min, while the mobile phase in the S4300 module flowed at 0.45 mL/min. Each run had a total elution time of 130 min. Amino acid concentrations—including theanine, serine, glutamate, glutamine, aspartate, arginine, alanine, and total amino acids—were calculated based on the relative response factors of standard amino acids. All results were reported as the mean of three biological replicates. QTL mapping of eight amino acid components The contents of eight amino acids in the F[1] population were integrated with genotypic data for QTL analysis using QTL IciMapping software. The Inclusive Composite Interval Mapping of Additive function (ICIM-ADD) was employed to identify significant associations [[89]30], with a logarithm of odds (LOD) threshold of 3.00 set to ensure statistical significance. The final genetic map and the distribution of QTLs across the linkage groups were visualized using MapChart software [[90]31, [91]32]. Measurement of chlorophyll and carotenoid contents Chlorophyll a, chlorophyll b, and carotenoid contents were measured in F₁ individuals of the ‘Huangjinya’ population using three biological replicates. Pigments were extracted from 0.1 g of young leaf tissue using 10 mL of an extraction solvent composed of acetone, ethanol, and water (4.5:4.5:1) and incubated for 24 h in darkness. Absorbance was measured at 663 nm, 645 nm, and 470 nm using ultraviolet spectrophotometry [[92]33]. Pigment concentrations were calculated using empirical formulas described by Lichtenthaler et al. Statistical comparisons between sample groups were performed using Student’s t-test. Map-based cloning of albino leaf color trait Due to the high heterozygosity of tea plants, their F₁ progeny often exhibit significant phenotypic segregation. In this study, the F₁ population derived from a cross between the green-leaf cultivar ‘Longjing 43’ and the albino cultivar ‘Huangjinya’ included 54 albino individuals and 44 green individuals. This segregating population served as the basis for mapping the albino leaf color trait. To facilitate linkage analysis, individuals with green leaves (and ‘Longjing 43’) were assigned a value of ‘1’, while those with albino leaves (and ‘Huangjinya’) were labeled as ‘0’. Linkage analysis was conducted using JoinMap software [[93]28], integrating phenotypic data with molecular markers used in genetic map construction. The Kosambi mapping function was applied to convert recombination frequencies into genetic distances, allowing the determination of marker positions and identification of those closely linked to the albino leaf color trait. The physical interval of the candidate region was then established based on the genomic coordinates of the flanking markers. Bulking and whole genome resequencing The genomic DNA of 60 F₁ offspring derived from a controlled cross between ‘Huangjinya’ and ‘Longjing 43’ was extracted using the Plant Genomic DNA Extraction Kit (Aidelai Biotech, Beijing, China). The bulks were generated by mixing equal amounts of DNA from 30 F₁ offspring individuals with green leaf and 30 F₁ offspring individuals with albino leaf. The two bulks were subjected to whole genome resequencing using the Illumina HiSeq platform. Population genetic analysis A total of 519 accessions of tea germplasm SNP datasets obtained from the TEA5K mSNP array were used for population genetic analysis in R software. To assess the population structure, we performed ADMIXTURE (v1.3.0) analysis, setting the number of ancestral populations (K) between 2 and 8, with each iteration consisting of 10,000 cycles [[94]34]. To further explore genetic relationships, principal component analysis (PCA) was conducted using R language. Additionally, a neighbor-joining phylogenetic tree was constructed using PLINK and FigTree, with 1,000 bootstrap replicates to ensure robust genome-wide phylogenetic relationships [[95]25]. Result Basic characteristics of the GBTS liquid-phase array Ultimately, 696 probes were eliminated due to suboptimal performance—defined as a missing rate greater than 20%, a sequencing depth within the amplicon less than 10% of the average sequencing depth across all samples, or a full coverage read ratio below 50%. As a result, 5,781 high-quality target sequences were retained to construct the TEA5K mSNP liquid-phase array for tea plants. These mSNP markers are evenly distributed across the 15 chromosomes of Camellia sinensis (Fig. [96]1A and B). Fig. 1. [97]Fig. 1 [98]Open in a new tab Marker site analysis of TEA5K mSNP array. (A) Design pipeline of the GenoBaits TEA5K mSNP array; (B) Distribution of 5 K mSNP markers across the 15 chromosomes of tea plant. Marker density is represented by bar color, with each bar corresponding to a 1-Mb genomic window; (C) Statistical analysis of sequencing depth across 519 samples genotyped using the TEA5K mSNP array; (D) Statistical analysis of the missing rate across 519 samples, based on 5,781 SNPs and 36,357 SNPs The mSNP marker selection strategy aimed to maximize variability within each target region (amplicon), enabling the amplification of multiple SNPs per amplicon. However, the actual number of SNPs detected per amplicon depended on the sequencing depth of the target regions. To assess the sequencing depth of each sample and marker, we analyzed a diverse set of tea germplasm, including 281 modern cultivars, 168 landraces, and 70 wild accessions, ensuring broad genetic representation (Supplementary Table [99]1). A total of 5,781 (5 K) mSNP markers and 36,357 (36 K) SNP markers were identified through sequencing at an average depth of 155.73X (Fig. [100]1C). Importantly, the average missing rates of mSNP and SNP markers per sample were 0.050 and 0.036, respectively (Fig. [101]1D), indicating a high calling rate for the developed markers. Additionally, since MAF is a crucial parameter reflecting genetic variation within a population, we set a minimum threshold of 5% to exclude markers with insufficient genotyping information. Among the identified markers, 5,764 (5 K) mSNPs and 32,550 (32 K) SNPs had MAF > 5%, whereas only 17 mSNP markers exhibited MAF below 5% (Supplementary Table [102]2, Supplementary Fig. [103]1). To further evaluate marker distribution, we analyzed the number of SNPs per amplicon, which ranged from 1 to 45 SNPs per amplicon, with an average of 6.29 SNPs, predominantly falling within the 2–9 SNPs range (Fig. [104]2). Additionally, the average amplicon coverage length was 149.45 bp (Supplementary Table [105]2). Fig. 2. [106]Fig. 2 [107]Open in a new tab Frequency distribution of TEA5K mSNP across 15 chromosomes. The number of SNPs per amplicon is color-coded to indicate variation in marker coverage across the genome The theoretical number of haplotypes per amplicon is determined by 2^n, where n represents the number of SNPs in the amplicon. Based on the SNP count per mSNP locus (Table [108]1), we inferred a large number of theoretical haplotypes. However, empirical data from 519 tea germplasm revealed 417 K actual haplotypes, with 404 K haplotypes exhibiting MAF > 5%. On average, each mSNP contained 72.48 realized haplotypes, of which 70.25 were high-frequency haplotypes (MAF > 5%). Chromosome-level analysis indicated that chromosome 1 harbored the highest number of mSNPs, SNPs, and realized haplotypes, whereas chromosome 14 exhibited the lowest values (Supplementary Table [109]2). These findings suggest that the mSNP array developed in this study possesses strong discriminatory power, making it a highly effective tool for tea genetic research and molecular breeding. Table 1. The evaluation of the number of SNP, PIC and GD value for 5 K core SNPs and 36 K SNPs in different genomic region by 519 tea plant resources Marker types UTR5 UTR3 exonic Intronic Intergenic Other region The numer of SNPs 5 K core SNPs 6 13 68 447 4,635 612 36 K SNPs 35 32 250 2,085 29,970 3,985 PIC Value 5 K core SNPs 0.375 0.377 0.388 0.393 0.405 0.402 36 K SNPs 0.301 0.298 0.328 0.303 0.297 0.295 GD Value 5 K core SNPs 0.493 0.482 0.488 0.491 0.499 0.497 36 K SNPs 0.400 0.362 0.390 0.359 0.348 0.350 [110]Open in a new tab Analysis of SNP and indel loci in genomic regions To develop SNP markers that provide comprehensive coverage across the whole genome, we selected markers based on their high or moderate polymorphism and their even distribution across the genome, rather than specifically targeting functional SNP markers or gene coding regions (CDS). This approach ensures the broad applicability of these markers for diverse genetic analyses, including genetic diversity assessments, evaluation of specific genetic variations, variety identification, genetic map construction, and QTL mapping. Additionally, some indel sites were detected at specific SNP loci in certain tea cultivars. Based on the genotyping data from 519 tea accessions, a total of 30,292 SNP sites and 6,065 indel sites were identified across 36,357 SNP markers (Fig. [111]3C), spanning 2,961 functional genes. Gene Ontology (GO) enrichment analysis revealed that a subset of these genes was significantly enriched in pathways related to phloem and xylem histogenesis, auxin polar transport, mitochondrial respiratory chain, and transporter activity (Fig. [112]3A). Furthermore, Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis indicated significant involvement of these genes in metabolic pathways such as valine, leucine, and isoleucine degradation; pantothenate and coenzyme A (CoA) biosynthesis; and photosynthesis (Fig. [113]3B). These findings suggest that the identified SNP markers may play a crucial role in regulating important botanical and agronomic traits. Fig. 3. [114]Fig. 3 [115]Open in a new tab Annotation analysis of 36,357 SNP sites. (A) GO enrichment analysis of genes including 36,357 SNP sites; (B) KEGG pathway enrichment analysis of genes including 36,357 SNP sites; (C) Genomic distribution of 36,357 SNP sites developed from different genomic regions; (D) Mutation type classification of 36,357 SNPs, distinguishing between transition and transversion mutations To further investigate the genomic distribution of the identified SNPs, we classified them based on their locations within the genome. The vast majority of mSNPs (82.43%) were located in intergenic regions, while 5.73% were in intronic regions. Additionally, 10.96% of mSNPs were located upstream or downstream of coding genes, and only 0.87% were found in exonic regions, 3’ untranslated regions (UTR3), and 5’ untranslated regions (UTR5) (Fig. [116]3C). Subsequently, we also counted different SNP types. The most frequently mutated SNP types was [A/G] (35.95%) and [C/T] (35.83%), followed by [A/T] (8.17%), [A/C] (7.21%) and [G/T] (7.05%), while [C/G] SNP type was the least common at 5.56% (Fig. [117]3D). To assess the effectiveness of the developed markers in detecting DNA variation, we analyzed their polymorphism information content (PIC) and gene diversity (GD) based on their genomic location and marker type. In the 5 K core SNP panel, intergenic SNPs exhibited relatively high PIC and GD values, indicating a strong ability to capture genetic variation. Conversely, in the 36 K SNP panel, exonic SNPs displayed the highest average PIC and GD values (Table [118]1, Supplementary Fig. [119]1). These results suggest that both intergenic and exonic SNPs provide strong discriminatory power for detecting DNA variation, making them valuable tools for genetic studies and molecular breeding applications. High accuracy identification of different tea cultivars China possesses extensive germplasm resources of tea plant, with the tea cultivars currently used in commercial production predominantly derived from individual selection or crossbreeding. A smaller number of cultivars have been developed through radiation mutagenesis. These tea cultivars exhibit varying degrees of genomic similarity, ranging from closely related to genetically distant accessions. To evaluate the identification power of the TEA5K mSNP array in distinguishing tea cultivars, we applied it to genotype 231 modern cultivars. The resulting genotyping data were analyzed using three panels: the TEA5K core SNP panel, the TEA36K SNP panel, and the TEA5K mSNP panel (Supplementary Fig. [120]1). The genetic similarity (GS) between cultivars was calculated using all three panels, generating 26,795 GS values across the 231 modern cultivars (Fig. [121]4A). Since GS calculation in the TEA5K core SNP and TEA36K SNP panels was based on single SNP sites, the results from these two panels were similar, though the values from the TEA5K core SNP panel were slightly lower than those from the TEA36K SNP panel. In the TEA5K core SNP panel, GS values between different modern cultivars ranged from 14.36 to 82.36%, with an average of 51.45% (Table [122]2; Fig. [123]4D; Supplementary Table 3). In the TEA36K SNP panel, GS ranged from 46.53 to 82.69%, with an average of 62.12% (Table [124]2; Fig. [125]4E; Supplementary Table [126]3). However, in the TEA5K mSNP panel, where GS was calculated based on multiple SNP sites within a single amplicon, the values were significantly lower, with an average GS of 27.62%, ranging from 7.69 to 60.97% (Table [127]3; Fig. [128]4F; Supplementary Table [129]3). Table 2. The GS analysis of different tea plant sample Tea cultivars 5,781 core SNP sites 36,357 SNP sites 5,781 mSNP sites Min. GS Max.GS Average GS Min. GS Max. GS Average GS Min.GS Max. GS Average GS Baye 1 97.18% 98.15% 97.76% 96.89% 97.58% 97.23% 89.88% 91.60% 90.76% Cuifeng 97.13% 98.69% 97.84% 96.68% 98.18% 97.31% 89.82% 94.20% 91.76% Fuding Dabaicha 97.28% 98.65% 97.88% 96.21% 98.01% 97.95% 88.52% 93.58% 90.75% Longjing 43 96.37% 97.79% 97.02% 95.96% 97.19% 96.59% 87.27% 91.02% 89.04% Shuchazao 93.57% 95.85% 94.70% 94.68% 96.10% 95.38% 82.91% 88.00% 85.20% Zhongcha 108 96.42% 97.66% 97.05% 96.06% 97.39% 96.74% 87.36% 91.47% 89.43% Zhonghuang 1 97.27% 98.65% 97.95% 97.50% 98.44% 97.83% 91.63% 94.51% 92.53% Zhonghuang 3 96.61% 97.86% 97.31% 96.83% 97.69% 97.23% 89.44% 91.77% 90.69% bud mutation materials 97.08% 98.20% 97.72% 96.74% 97.68% 97.25% 90.36% 92.86% 91.74% 231 developed cultivar 14.36% 82.36% 51.45% 46.53% 82.69% 62.12% 7.69% 60.97% 27.62% [130]Open in a new tab Table 3. The information and statistics of genetic map Linkage group Number of markers Genetic Length (cM) Average distance (cm) LG1 438 166.07 0.38 LG2 266 140.74 0.53 LG3 144 147.16 1.02 LG4 330 199.11 0.60 LG5 141 152.53 1.08 LG6 211 144.73 0.69 LG7 211 142.46 0.68 LG8 231 135.2 0.59 LG9 207 165.47 0.80 LG10 229 150.5 0.66 LG11 161 172.71 1.07 LG12 267 119.39 0.45 LG13 190 108.6 0.57 LG14 103 105.52 1.02 LG15 145 175.00 1.21 Average 218.27 148.35 0.76 Total 3274 2225.19 / [131]Open in a new tab Since the current breeding mode of these tea cultivars is mainly vegetative propagation, we analyzed the GS distribution between different individual plants within a cultivar using the three panels mentioned above. For this, three biological replicates from four individual plants were tested for each of eight modern cultivars: ‘Baiye 1’, ‘Cuifeng’, ‘Fuding Dabaicha’, ‘Longjing 43’, ‘Shuchazao’, ‘Zhongcha 108’, ‘Zhonghuang 1’ and ‘Zhonghuang 3’ (Fig. [132]4B; Supplementary Fig. [133]2). A high level of GS was observed among biological replicates when using the TEA5K core SNP and TEA36K SNP panels. Although ‘Shuchazao’ exhibited the lowest GS values in both panels, it still reached 94.70% and 95.38%, respectively. ‘Zhonghuang 1’ and ‘Fuding Dabaicha’ exhibited the highest GS values in both panels, reaching 97.95% (Table [134]2; Fig. [135]4D and E). However, when analyzed using the TEA5K mSNP panel, slightly lower GS values were observed for the same tea plant samples. The lowest average GS (85.20%) was found in ‘Shuchazao,’ while the highest (92.53%) was observed in ‘Fuding Dabaicha’ (Table [136]2; Fig. [137]4F). These findings indicate that the mSNP-based analysis (TEA5K mSNP panel) has stronger discrimination ability between different cultivars compared to single-SNP-based analyses (TEA5K core SNP and TEA36K SNP panels). However, when assessing genetic similarity within a single cultivar, the TEA5K core SNP and TEA36K SNP panels provided more consistent results than the mSNP panel. To further verify these conclusions, we analyzed two tea cultivars, ‘Longjing 43’ and ‘Zhongcha 108’ which exhibit significant phenotypic differences. ‘Zhongcha 108’ is well known to have been derived from ‘Longjing 43’ through ^60Co γ-ray radiation mutagenesis. Radiation mutagenesis is characterized by a high mutation frequency and random genetic modifications, leading to genetic divergence between mutant and original cultivars. Here, we determined that the GS value between ‘Zhongcha 108’ and ‘Longjing 43’ calculated using the TEA5K mSNP panel was significantly lower than the values obtained using the TEA5K core SNP and TEA36K SNP panels, suggesting that the mSNP approach provides a more precise distinction between genetically related but phenotypically distinct cultivars. Additionally, a bud mutation tea plant resource was analyzed using the TEA5K mSNP array and evaluated using the three panels (Fig. [138]4C). Bud mutation is a common phenomenon in plants, where point mutations occurring in a single branch or leaf are triggered by environmental factors, leading to phenotypic changes such as alterations in leaf color or shape. This suggests that the genetic distance between normal plants and their bud mutants is typically very small. Our analysis confirmed this hypothesis, as the GS values between bud mutants and normal tea plants were extremely high when calculated using the TEA5K core SNP and TEA36K SNP panels, reaching 97.72% and 97.25%, respectively (Table [139]2). However, in the TEA5K mSNP panel, the GS value was slightly lower, averaging 91.74% (Table [140]2). These results suggest that for bud mutation analysis, using single SNP sites provides a more accurate GS calculation, as such mutations generally result in minimal genetic divergence. QTL mapping of amino acid and map-based cloning of leaf color Tea leaf color mutants have garnered growing interest in the market due to their attractive appearance and fresh flavor, largely attributed to elevated amino acid content. To assess the utility of the TEA5K mSNP array for mapping important agronomic traits in tea, we conducted both QTL mapping of free amino acid components and map-based cloning of the leaf color trait using an F₁ population derived from a cross between ‘Huangjinya’ and ‘Longjing 43’. In this population, ‘Huangjinya’—a light-sensitive albino cultivar with high amino acid content—served as the male parent, while ‘Longjing 43’—a cultivar with typical green leaf color—served as the female parent (Fig. [141]5A). Among the segregating individuals in the F₁ population, genomic DNA was extracted from 54 albino individuals and 44 green individuals, along with both parental cultivars. Genotyping was performed using the TEA5K mSNP array, which identified 17,248 polymorphic SNPs after excluding missing data. For genetic map construction, only segregation types ab × cd, lm × ll, nn × np, hk × hk, and ef × e.g. were used. After filtering out low-integrity markers and removing those with segregation distortion based on a chi-square test (P < 0.001), a final set of 8,882 markers remained. Further refinement resulted in 3,274 markers being used for genetic map construction. These markers were mapped into 15 linkage groups (LGs), corresponding to the chromosome number of tea plants (Fig. [142]5B). Linkage analysis revealed that the total length of the genetic map was 2,225.19 cM, with individual linkage groups ranging from 105.52 cM (LG14) to 199.11 cM (LG4). The average number of markers per linkage group was 218, ranging from 103 (LG14) to 330 (LG4). The smallest average marker distance was observed in LG1 (0.38 cM), while LG15 had the largest average marker distance (1.21 cM) (Table [143]3). Fig. 5. [144]Fig. 5 [145]Open in a new tab Gene mapping of albino-leaf trait by using TEA5K mSNP array. (A) Parental and offspring phenotypic variation; (B) Genetic map constructed using the TEA5K mSNP array; (C) map-based clone results for the albino-leaf trait; D QTL mapping results for various amino acid components; (E) Result of Bulked Segregant Analysis mapping for albino leaf color Using the constructed genetic map, the biochemical data of eight amino acid components (theanine, serine, glutamate, glutamine, aspartate, arginine, alanine, total amino acids) for each individual in the F₁ population, QTL analysis of various amino acid components traits was conducted using QTL IciMapping software (Supplementary Table 4). QTLs with an LOD score above 3 were considered statistically significant. As a result, a total of 33 QTLs were screened for 8 amino acids. These QTLs were distributed on 12 linkage groups, excluding LG3, LG11, and LG13, with a LOD score ranging from 3.55 to 8.48. The number of QTL for each trait ranged from one to twelve. There are two QTLs associated with theanine, arginine and total amino acids respectively. In addition, three and five QTLs control glutamine and glutamate, respectively. The phenotypic variance explained by these QTLs ranges from 15.8 to 33.7% (Fig. [146]5D; Supplementary Table 5). The distributions of chlorophyll and carotenoid contents in 54 albino and 44 green F₁ individuals showed a clear bimodal pattern—characteristic of qualitative traits—suggesting that the albino leaf color is likely controlled by a few major genes (Supplementary Fig. [147]3). To pinpoint the genetic locus responsible for this trait, linkage analysis was conducted using 3,274 SNP markers from the constructed genetic map, with albino individuals coded as ‘0’ and green individuals as ‘1’. The albino trait was ultimately mapped to a region between markers chr8_SNP222 and chr8_SNP223 on chromosome 8, with genetic distances of 1.1 cM and 1.0 cM, respectively, and a physical span of 23.15 Mb. To validate the accuracy of the SNP markers and the map-based cloning results, Bulked Segregant Analysis (BSA) was performed using randomly selected DNA samples from 30 albino and 30 green individuals. As shown in Fig. [148]5E, BSA also identified a candidate region on chromosome 8 spanning from 76.25 Mb to 163.07 Mb, which is broadly consistent with the interval identified through linkage mapping. Based on genome annotations, two candidate genes within this region were preliminarily identified: Photosystem I reaction center subunit PsaK (CSS0007721) and Coproporphyrinogen-III oxidase 1 (CSS0003887), both of which are implicated in chlorophyll biosynthesis and light-responsive pathways (Supplementary Table 6). These results underscore the TEA5K mSNP array’s efficacy in both high-resolution genetic map construction and precise QTL mapping of key agronomic traits in tea plants. Population structure and phylogenetic analyses of 519 tea germplasm across China To investigate the phylogenetic relationships among 519 tea accessions, we used C. sasanqua as an outgroup and analyzed 5,443 core SNPs. These accessions included 328 C. sinensis var. sinensis, 49 C. sinensis var. assamica, 68 C. sinensis var. pubilimba, 29 C. fangchengensis, and 46 C. crassicolumna, collected from 15 provinces across four major tea-growing regions of China (Fig. [149]6A; Supplementary Table [150]1). Phylogenetic analysis revealed that these tea germplasm were primarily grouped into three major clusters. The wild group comprised C. fangchengensis and C. crassicolumna, while the landrace group consisted of C. sinensis var. assamica, C. sinensis var. pubilimba, and a subset of C. sinensis var. sinensis. The modern cultivar group was predominantly composed of C. sinensis var. sinensis (Fig. [151]6D). These clustering patterns were further validated by principal component analysis (PCA), which consistently identified the same three major groups (Fig. [152]6B). To further assess the genetic relationships among these accessions, we conducted population structure analysis using ADMIXTURE, testing K values from 2 to 4. At K = 2, clear introgression was observed among the three major groups. At K = 3, the landrace group exhibited additional differentiation, aligning with the phylogenetic classification. Specifically, Subgroup I consisted mainly of C. sinensis var. pubilimba and C. sinensis var. assamica, while Subgroup II included C. sinensis var. sinensis and C. sinensis var. assamica. At K = 4, further separation of wild accessions became evident, with closely related germplasm C. fangchengensis from the wild forests of Guangxi and C. crassicolumna from Yunan province forming distinct clusters (Fig. [153]6C, Supplementary Fig. [154]4). Fig. 6. [155]Fig. 6 [156]Open in a new tab Distribution and population evolution analysis of 519 tea plant resources. (A) Geographical distribution of 519 tea plant accessions, highlighting the collection regions across multiple Chinese provinces. (B) Principal Component Analysis (PCA) of 519 tea germplasm. (C) Population structure analysis of 519 tea plant accessions, evaluated under K = 2 and K = 3 conditions. (D) Phylogenetic tree of 519 tea resources. The green section represents the majority of modern tea cultivars, the red section corresponds to most landrace resources, and the blue section encompasses almost all wild tea resources. (E) Simulated domestication pathway of tea plants, illustrating the evolutionary transition from wild tea populations to modern cultivars Discussion Advantages of TEA5K mSNPs array The liquid-phase array based on in-solution hybrid capture is a widely utilized genotyping technology in both plant and animal research. This method enables large-scale parallel sequencing, specifically targeting predefined genomic regions, thereby enhancing efficiency and reducing sequencing redundancy [[157]3]. This technology has been successfully applied to various species, including wheat (T. aestivum), rice (O. sativa), maize (Z. mays), soybean (Glycine max), peanut (Arachis hypogaea), tomato (Solanum lycopersicum), pepper (Capsicum spp.), and slash pines (Pinus elliottii), playing a pivotal role in functional genomics research, molecular marker-assisted breeding, and genomic selection [[158]2, [159]35, [160]36]. The tea plant is a globally valued non-alcoholic beverage crop, yet its genetic diversity and differentiation—particularly regarding key agronomic and metabolomic traits—remain insufficiently understood. A recent large-scale de novo sequencing effort involving 802 tea accessions has provided important insights into the genetic diversity and domestication history of ancient tea germplasm [[161]37]. While de novo sequencing enables comprehensive detection of genomic variation across diverse collections, its application is often limited by the high cost associated with the large genome size of tea (~ 3.0 Gb). In this study, we introduce the TEA5K mSNP array as the first liquid-phase genotyping platform developed specifically for tea plants, offering a cost-effective and scalable alternative for high-resolution genetic analysis. When compared to existing genotyping methods—such as genotyping-by-sequencing (GBS), Kompetitive Allele-Specific PCR (KASP), and solid-phase arrays like Illumina Infinium and Affymetrix platforms—the TEA5K mSNP array offers several distinct advantages. One of its key strengths is its ability to capture multiple SNP markers within each target region, significantly increasing the number of detected SNPs compared to single-SNP technologies. In this study, an average of 6.29 SNPs per target region was identified in tea plants, a value consistent with findings in wheat (5.50 SNPs), rice (4.50 SNPs), and maize (6.50 SNPs) [[162]2]. This high marker density enhances the TEA5K mSNP array’s capacity for high-resolution genetic analysis. Another major advantage of the TEA5K mSNP array is its ability to generate multiple marker datasets from the same genotyping system. For example, it can be used to construct within-amplicon haplotypes or to select high-PIC SNP sets, providing a flexible approach for breeding applications. Previous studies have demonstrated that by combining SNPs across different amplicons or prioritizing SNPs with high PIC values, researchers have successfully generated 690 K haplotype sets and 40 K high-PIC SNP sets in maize [[163]2]. In our study, we identified 417 K haplotypes and 5,781 high-PIC SNPs across 519 tea accessions, highlighting the array’s ability to capture extensive genetic variation and tailor marker sets for specific breeding and genetic studies. The TEA5K mSNP array also facilitates the integration of genotyping data across multiple time points and sequencing platforms, a challenge commonly encountered with other genotyping techniques. For instance, in GBS-based approaches, the use of restriction enzyme digestion results in the random sampling of genomic regions, making it difficult to reproduce datasets across sequencing runs [[164]38]. In contrast, targeted sequencing approaches have demonstrated high consistency, as shown in wheat, where an 85.7% average similarity was achieved between results from two different platforms [[165]35]. In our study, genotyping of the same tea cultivars at different time points demonstrated a consistency rate exceeding 90%, with very low error rates. For example, the similarity between two sequencing runs of the cultivar ‘Huangjinya’ was 97.66%, with missing rates as low as 0.016 and 0.011, respectively. This high reproducibility underscores the robustness and reliability of the TEA5K mSNP array in tea plant genotyping. Furthermore, the liquid-phase array design of the TEA5K mSNP array is highly adaptable. The number of mSNP markers can be increased by incorporating additional markers as needed. This scalability ensures long-term applicability and enables continuous updates to the marker panel as new genomic information becomes available. Moreover, the cost-effectiveness of the TEA5K array makes it a highly economical alternative to conventional solid-phase genotyping platforms, significantly reducing genotyping expenses without compromising data accuracy or reproducibility [[166]39]. Compared to de novo sequencing, the TEA5K array lowers genotyping costs by at least 75%, while providing a six-fold increase in the number of SNP markers included in the mSNP panels relative to solid-chip platforms. In this study, the per-sample genotyping cost using the TEA5K array was as low as $7, underscoring its suitability as a high-throughput and budget-friendly solution for tea plant research [[167]2]. Collectively, these advantages establish the TEA5K mSNP array as a powerful, flexible, and cost-efficient genotyping platform—offering high marker density, excellent reproducibility, and broad utility for genetic research and molecular breeding in tea. TEA5K mSNPs array as a powerful tool for cultivar identification With the increasing number of newly developed elite tea cultivars, a common issue in the tea industry is the misidentification of plant materials. Tea plants with the same genotype are often assigned different names, while plants with different genotypes are mistakenly classified under the same cultivar name. This misclassification not only affects the accuracy of cultivar identification but also undermines breeders’ rights and intellectual property protection [[168]40, [169]41]. Therefore, developing an efficient, accurate, and cost-effective genotyping tool for cultivar authentication is essential to ensure the traceability and authenticity of tea seedlings [[170]42]. The advancement of liquid-phase array technology provides a promising solution by enabling high-throughput genotyping of target genomic regions [[171]43, [172]44]. For example, in melon (Cucumis melon), a 2 K liquid-phase SNP array was developed to assess genetic diversity, revealing an average of 754 polymorphic SNPs per plant pair [[173]44]. Similarly, in this study, the TEA5K mSNP array was applied to evaluate genetic variation within and between tea cultivars, as well as the genetic consistency within individual cultivars. Based on the genotyping data, GS values were calculated for all cultivar pairs using three marker panels: the 5 K core SNP panel, the 36 K SNP panel, and the 5 K mSNP panel. Each panel produced 26,795 GS values, with minimum GS values of 14.36%, 46.53%, and 7.59%, and maximum values of 82.36%, 82.69%, and 60.97%, respectively. These differences reflect the unique characteristics of each panel—particularly the inclusion of multiple linked SNPs within an amplicon in the TEA5K mSNP panel versus the single-site focus of the TEA5K core SNP and TEA36K SNP panels. Notably, genetic linkage among SNPs was not accounted for in the TEA5K core SNP and TEA36K SNP panels. Despite these methodological differences, all three panels effectively distinguished between tea cultivars. For instance, using the TEA5K core SNP panel, the maximum GS observed between different cultivars was 82.36%, whereas the minimum GS within the same cultivar (across eight representative cultivars) was 93.57%, providing a well-defined gap suitable for threshold determination. Given the high heterozygosity of the tea genome and the distribution of GS values observed across the dataset, we recommend a 90% GS threshold for confirming cultivar authenticity using the TEA5K core SNP panel. To further enhance discriminatory power, particularly among closely related genotypes, incorporating SNPs from highly variable regions such as the internal transcribed spacer (ITS) may be beneficial, as this region has shown strong potential for resolving cultivar identity [[174]45]. Application of TEA5K mSNPs array for gene mapping In recent years, liquid-phase array technology has been increasingly used in genetic mapping and trait analysis due to its cost efficiency and rapid processing time. For instance, the WheatSNP16K liquid array was applied in wheat to construct a genetic map and identify four QTLs associated with stripe rust resistance [[175]46]. Similarly, a 51 K liquid-phase array was used in slash pine (P. elliottii), identifying 95 SNPs significantly associated with growth and wood quality traits through GWAS [[176]36]. In pepper (Capsicum spp.), a 45 K GBTS liquid-phase gene array was used to analyze helical fruit shape, leading to the identification of three key QTLs and a candidate gene encoding the tubulin alpha chain, which regulates fruit curvature [[177]47]. To evaluate the utility of the TEA5K mSNP array in dissecting key agronomic traits in tea plants, we conducted QTL mapping of amino acid components and map-based cloning of the albino leaf color trait using a hybrid F₁ population derived from ‘Huangjinya’. A total of 33 QTLs related to various amino acid components were identified, while the albino leaf color locus was precisely mapped to the interval between markers chr8_SNP222 and chr8_SNP223 on chromosome 8. Notably, the results of map-based cloning were consistent with those obtained from BSA, further validating the reliability of the TEA5K array for trait mapping. Within the identified candidate region, two genes were preliminarily annotated: CsPsaK (CSS0007721) and CsHEMF (CSS0003887). Prior research has shown that downregulation of PsaK disrupts the photosynthetic system, leading to albino leaf phenotypes in pecan (Carya illinoinensis) [[178]48]. Similarly, the differential expression of HEMF affects chlorophyll biosynthesis and degradation under high-temperature stress, leading to leaf albinism in alfalfa (Medicago sativa) [[179]49]. These findings indicate that the TEA5K mSNP array is a powerful tool for identifying regulatory loci involved in agronomic traits, further expanding its applicability in tea breeding programs. The improvement bottleneck in modern tea breeding in China Advancements in high-throughput sequencing technologies have significantly expanded tea genomics research, with multiple tea genomes now available. A recent pan-genome analysis involving 22 tea genomes represents a major milestone in the field [[180]50]. However, despite these advancements, the practical application of such genomic resources in tea breeding remains largely unexplored. One of the critical challenges is understanding the genetic structure of elite tea cultivars, which plays a key role in breeding efficiency. The formal approval process for tea cultivars in China began in 1985, and to date, over 300 cultivars have been officially registered [[181]51]. In this study, the TEA5K mSNP array was used to analyze the genetic structure of these cultivars at the molecular level, making it the first large-scale genomic study of Chinese tea cultivars. The results revealed that genetic diversity among these cultivars is relatively low, primarily due to the prevalence of hybridization and asexual reproduction. Current breeding practices often rely on “cultivar-to-cultivar” and “landrace-to-cultivar” crosses, with minimal genetic contributions from wild tea populations. This lack of wild introgression suggests that modern tea breeding is facing an “improvement bottleneck,” a limitation similar to those observed in many perennial crops (Fig. [182]6E) [[183]52]. Furthermore, this study provides insights into the evolutionary status of C. fangchengensis, a rare and endangered tea species. Genomic analysis suggests that C. fangchengensis is a potential ancestral species of modern cultivated tea, yet its evolutionary significance remains largely unknown. Further research, integrating genomic and botanical evidence, is needed to fully explore the evolutionary history and potential breeding value of this species. Conclusion In brief, we have pioneeringly developed a novel liquid-phase assay with flexibility and high-resolution in tea genetic resources, achieved cultivar identification, genetic map construction, gene mapping of important traits in tea plants, as well as genetic evolution analysis. In particular, we have set a 90% threshold that can effectively differentiate between different tea genetic resources by utilizing the liquid-phase array. We firmly believe that our research offers a reliable technology for genetic research and MAS breeding in tea plant. Electronic supplementary material Below is the link to the electronic supplementary material. [184]Supplementary Material 1^ (3.3MB, xlsx) [185]Supplementary Material 2^ (75.9MB, xlsx) [186]Supplementary Material 3^ (7.7MB, docx) Acknowledgements