Abstract

Background

   High-throughput genotyping technology has become an indispensable tool
   for advancing molecular breeding and genetic research in plants,
   facilitating large-scale exploration of genomic variation. Genotyping
   technology based on liquid-phase array utilizes streptavidin-coated
   nanomagnetic beads to capture biotin-modified probes, thereby capturing
   the target sequence on the genome, achieving the purpose of genotyping.
   This study aims to develop a novel liquid-phase for tea plant, which
   can be used for cultivar identification, genetic map construction,
   Quantitative Trait Locus (QTL) mapping of key agronomic traits in tea
   plants, and genetic evolution analysis.

Result

   We developed a highly efficient multiple-SNP array, the TEA5K mSNP
   array, which comprises 5,781 liquid-phase probes based on the
   Genotyping by Target Sequencing (GBTS) system. Using this array, we
   genotyped 231 developed tea cultivars, revealing that genetic
   similarity within the same cultivar ranged from 92.53–97.95%, whereas
   genetic similarity between different cultivars generally remained below
   82.36%. Furthermore, utilizing this array, we constructed a
   high-density genetic map consisting of 3,274 markers, covering a total
   genetic distance of 2,225.19 cM, with an average marker interval of
   0.76 cM. The high-resolution genetic map facilitated the identification
   of multiple QTLs linked to eight amino acid components, as well as two
   molecular markers strongly associated with the albino-leaf trait in the
   ‘Huangjinya’ cultivar, both mapped to chromosome 8. Moreover, we
   applied the array to analyze the population structure and phylogenetic
   relationships of 519 tea germplasm, classifying them into three major
   groups: wild accessions, landraces, and modern cultivars. Notably,
   modern cultivars exhibited lower genetic diversity compared to
   landraces. Additionally, we observed substantial genetic
   differentiation between wild resources and modern cultivars, with
   minimal to no gene flow from wild populations into domesticated
   cultivars. These findings suggest that modern tea breeding faces an
   “improvement bottleneck,” a challenge similar to that encountered in
   other perennial crops.

Conclusion

   The TEA5K mSNP array is presented as a flexible, cost-effective, and
   low-maintenance genotyping tool that significantly enhances both
   genetic research and molecular breeding in tea plants. By providing a
   robust platform for genome-wide analysis and facilitating the
   identification of key QTLs, this tool offers valuable insights for
   improving the genetic diversity and agronomic performance of tea
   cultivars.

Graphical Abstract

   [46]graphic file with name 12951_2025_3533_Figa_HTML.jpg

Supplementary Information

   The online version contains supplementary material available at
   10.1186/s12951-025-03533-5.

   Keywords: Genotyping by target sequencing, Cultivar identification,
   Genetic evolution, Genetic maps, Gene mapping

Introduction

   The history of plant breeding has been shaped by three major
   innovations: traditional cross-pollination techniques, transgenic
   technology, and more recently, marker-assisted selection (MAS). Unlike
   conventional breeding, which relies on field selection based on
   phenotypic traits, MAS significantly shortens breeding cycles and
   enhances selection efficiency [[47]1]. A variety of DNA markers—such as
   amplified fragment length polymorphism (AFLP), simple sequence repeats
   (SSRs), and single nucleotide polymorphisms (SNPs)—have been developed
   and applied in MAS [[48]2]. However, AFLP and SSR markers pose several
   limitations, including high costs, low throughput, and a high
   false-positive rate, making them less scalable for large breeding
   programs [[49]3]. In contrast, SNP markers are highly abundant,
   genetically stable, and widely used in genetic studies across various
   crops, including rice (Oryza sativa), mulberry (Morus spp.), and tea
   (Camellia sinensis), making them ideal for MAS applications
   [[50]4–[51]6].

   SNP markers are commonly developed and applied using high-throughput
   genotyping platforms, including SNP arrays and sequencing-based
   technologies [[52]7–[53]9]. Commercial SNP arrays, such as the
   Affymetrix GeneChip and Illumina Infinium platforms, are widely
   employed in plants, where SNP detection occurs through hybridization of
   genomic DNA with oligonucleotide probes [[54]10]. For example, the
   Affymetrix 44 K SNP array has been applied in genome-wide association
   studies (GWAS) in rice [[55]11, [56]12], and the Brassica 60 K Illumina
   Infinium array has facilitated fine genetic mapping [[57]13, [58]14].
   Nevertheless, the development and utilization of such SNP chips in tea
   plants are rarely reported. To date, only a 200 K Affymetrix solid SNP
   chip has been developed and applied to analyze the population structure
   and evolutionary relationships between Camellia sinensis var. assamica
   (CSA) and Camellia sinensis var. sinensis (CSS) [[59]15]. Although
   solid SNP chip genotyping offers high reliability and reproducibility,
   it involves substantial development costs and is limited by the fixed
   content of commercial arrays, which significantly restricts flexibility
   in marker selection and utilization [[60]2, [61]3]. On the other hand,
   sequencing-based approaches, such as whole-genome sequencing (WGS) and
   genotyping by sequencing (GBS), have gained popularity in plant
   genomics. WGS provides high-resolution genome-wide variation detection
   [[62]16], but its high cost restricts its practical application in
   large-scale breeding programs. GBS, which selectively sequences partial
   genomic regions through restriction enzyme digestion, is a more
   cost-effective approach; however, it provides lower SNP resolution and
   demands substantial bioinformatics processing [[63]6]. Given these
   limitations, there is a growing demand for more flexible,
   cost-efficient genotyping platforms tailored for commercial breeding
   applications.

   A recently developed method, Genotyping by Target Sequencing (GBTS),
   has emerged as a promising alternative for species such as maize (Zea
   mays) and wheat (Triticum aestivum) [[64]17, [65]18]. The GBTS captures
   specific genomic regions using solution-based probes or primers,
   effectively integrating the strengths of both solid-phase SNP array and
   GBS. This hybrid method provides enhanced flexibility, cost-efficiency,
   and high sequencing depth, making it particularly suitable for
   commercial breeding and genetic research. Tea is an economically
   significant crop worldwide, with its cultivation dating back thousands
   of years. Despite the availability of rich genetic resources, their
   effective utilization in breeding remains challenging due to the
   species’ self-incompatibility, large genome size, and prolonged
   breeding cycles [[66]19]. While whole-genome resequencing is not ideal
   for MAS in tea breeding due to its high cost. GBTS, by contrast, offers
   a more affordable sequencing strategy while maintaining high SNP
   resolution, making it an attractive alternative for genotyping in tea
   plants. In this study, we developed a novel SNP array for tea using
   GBTS technology, termed the TEA5K mSNP liquid-phase array, comprising
   5,781 core SNP sites along with more than 30,000 flanking SNP sites. We
   further demonstrate the utility of this platform in a range of
   applications, including cultivar identification, evolutionary analysis,
   genetic map construction, and QTL mapping.

Materials and methods

Development of TEA5K mSNPs liquid-phase array

   Liquid-phase array technology is based on the principle of in-solution
   hybrid capture, in which specifically designed probes hybridize with
   complementary target sequences, allowing for high-throughput sequencing
   of predefined genomic regions. In this study, liquid-phase probes were
   designed based on a set of SNPs identified from the resequencing data
   of 811 tea accessions archived in our previously established Tea
   Genomic Variation Database (TeaGVD) [[67]20]. The SNP identification
   pipeline began with quality trimming of raw reads using Sickle
   ([68]https://github.com/najoshi/sickle) with a Phred quality threshold
   of Q20, followed by alignment to the ‘Shuchazao’ v2.0 reference genome
   via Burrows-Wheeler Aligner [[69]21, [70]22]. PCR duplicates were
   removed using Sambamba [[71]23] with parameters “--overflow-list-size
   1000000 --hash-table-size 1000000”. The SNPs and InDels were then
   called using FreeBayes [[72]24], and filtered with PLINK and VCFtools
   [[73]25] under the criteria: minor allele frequency (MAF) > 25%,
   heterozygosity < 15%, and missing data < 10%. This yielded 13,760
   candidate regions. A 100 bp sliding window was applied to quantify SNP
   density, and regions with fewer than 15 SNPs, as well as repetitive
   sequences, were excluded. Regions were selected based on uniform
   chromosomal distribution and GC content (25–75%), resulting in 6,477
   target loci for probe design. Probes were created using GenoBaits Probe
   Designer (Molbreeding Biotech) and synthesized via semiconductor-based
   in-situ biotinylated synthesis [[74]17]. To assess probe performance,
   268 tea accessions were tested. Probes were excluded if they exhibited
   a capture uniformity (defined as the ratio of sequencing depth in the
   target region to the average sequencing depth of the sample) was < 10%,
   an on-target rate (the proportion of reads covering the target region
   among all reads captured by the probe) was < 50%, or a missing data
   rate was > 20%.

DNA extraction, library construction, sequencing and genotyping

   A total of 519 tea accessions were sequenced and genotyped using the
   TEA5K mSNP array to assess basic characteristics, enable cultivar
   identification, and perform population genetic analysis. These
   accessions encompassed a broad range of taxa, including Camellia.
   sinensis var. sinensis, Camellia. sinensis var. assamica, Camellia.
   sinensis var. pubilimba, Camellia. crassicolumna, Camellia.
   tachangensis and Camellia. Fangchengensis (Supplementary Table [75]1;
   Additional file 2). Additionally, 98 F₁ offspring derived from a
   controlled cross between ‘Huangjinya’ and ‘Longjing 43’ were used for
   genetic map construction (Additional file 2). C. crassicolumna and C.
   fangchengensis were collected from Yunnan and Guangxi provinces,
   respectively, while all other germplasm were sourced from the National
   Tea Germplasm Repository @ Hangzhou at the Tea Research Institute of
   the Chinese Academy of Agricultural Sciences (TRICAAS), the National
   Large-leaf Tea Germplasm Repository @Menghai at the Tea Research
   Institute of Yunnan Academy of Agricultural Sciences, and the National
   Medium- and Samll-leaf Tea Germplasm Repository @Changsha at the Tea
   Research Institute of Hunan Academy of Agricultural Sciences. All plant
   samples were stored at -80 °C before DNA extraction.

   Genomic DNA was extracted using the Plant Genomic DNA Extraction Kit
   (Aidelai Biotech, Beijing, China). DNA quality was assessed via 1%
   agarose gel electrophoresis, and DNA concentration was quantified using
   a Qubit 2.0 Fluorometer (Thermo Fisher Scientific, CA, USA).
   Subsequently, DNA libraries were constructed and quantified using a
   Qubit ssDNA Assay Kit (Thermo Fisher Scientific, CA, USA). The
   libraries were loaded onto the flow cell and sequenced using paired-end
   150 bp (PE150) reads on the MGISEQ-2000 platform (MGI, Shenzhen, China)
   [[76]2]. The sequencing data were mapped to the ‘Shuchazao’ v2.0
   reference genome [[77]21] using the Burrows-Wheeler Aligner (BWA) MEM
   algorithm [[78]22]. Variant calling was performed using the Genome
   Analysis Toolkit (GATK), generating genotype datasets for subsequent
   analyses [[79]26].

Functional annotations of SNP markers

   Using genotyping data from 519 tea accessions and the annotation of the
   ‘Shuchazao’ v2.0 reference genome [[80]21], functional genes containing
   SNP markers were subjected to Gene Ontology (GO) and Kyoto Encyclopedia
   of Genes and Genomes (KEGG) enrichment analyses. These analyses were
   conducted using the agriGO v2.0 platform [[81]27] and OmicShare Tools
   ([82]https://www.omicshare.com/), respectively. GO terms and KEGG
   pathways with P-values < 0.05 were visualized using GraphPad Prism 10
   and the ggplot2 package in R.

Cultivar identification

   To evaluate the effectiveness of the TEA5K mSNP array for cultivar
   authentication, genotypic data from 231 modern tea cultivars were
   analyzed. For threshold determination, eight cultivars were selected,
   with four individuals per cultivar and three branches per individual
   (Supplementary Fig. [83]2). Genetic similarity (GS) was calculated as
   the proportion of matching markers between any two samples relative to
   the total number of markers. Each of the three marker panels yielded
   26,795 GS values. As shown in Fig. [84]4, GS values between different
   cultivars clustered on the left, while those within the same cultivar
   grouped on the right—demonstrating a clear distinction. This pattern
   enables the establishment of a reliable GS threshold for
   differentiating tea cultivars.

Fig. 4.

   [85]Fig. 4
   [86]Open in a new tab

   Modern cultivar and bud mutant identification analysis. (A) Phenotypic
   variation among modern cultivars; (B) Vegetative propagation-based
   phenotypic variation in eight cultivars; (C) Bud mutant phenotype
   analysis, comparing green-leaf branches and albino mutant branches.
   (D), E, F. Genetic similarity (GS) analysis of cultivars, vegetative
   propagation individuals, and bud mutants, evaluated using the 5 K core
   SNP panel, 36 K SNP panel, and 5 K mSNP panel, respectively. Sample
   pair counts increase from blue to yellow

Construction of linkage map

   The genotype data from 98 F₁ offspring and their parents ‘Huangjinya’
   and ‘Longjing 43’ were used to construct a linkage map and perform QTL
   mapping. High-quality SNPs were selected based on specific criteria to
   ensure data accuracy and reliability. The detection rate of each marker
   was required to exceed 75% across all samples, ensuring that only
   consistently observed markers were retained. Specific SNP marker types
   that conformed to ab x cd, lm x ll, nn x np, hk x hk, and ef x e.g.
   were included in the analysis, as these segregation patterns are
   suitable for the population structure of tea plants. Markers exhibiting
   significant segregation distortion were removed based on a threshold of
   P < 0.01 to prevent potential bias in the mapping process. The
   remaining high-confidence SNP markers were subjected to linkage
   analysis using JoinMap software to construct the genetic map [[87]28].

Amino acid extraction and quantification

   During the spring seasons of 2022 and 2023, “two and a bud” samples
   were collected from each individual in the F₁ hybrid population derived
   from the tea cultivars ‘Huangjinya’ and ‘Longjing 43’. Fresh samples
   were immediately steamed using a hot-air dryer at 120 °C for several
   minutes, then dried at 80 °C for 1 h until fully dehydrated. The dried
   material was finely ground using a tissue lyser (MM 400, Retsch,
   Germany). For amino acid extraction, 0.1 g of powdered tissue was mixed
   with 10 mL of distilled water and incubated in a 90 °C water bath for
   30 min, with gentle shaking every 10 min. The mixture was centrifuged
   at 4,000 rpm for 10 min. The resulting supernatant was filtered through
   a 0.22 μm Millipore membrane and diluted 1:1 with sample diluent.

   Quantitative analysis of amino acids was conducted using an automatic
   amino acid analyzer (Sykam, Germany), following the protocol of Liu et
   al. [[88]29]. The analysis used a Sykam LCA K07/Li column (7 μm, 4.6 mm
   × 150 mm), with the reactor maintained at 130 °C, an injection volume
   of 50 µL, and detection at 570 nm. The flow rate of the ninhydrin
   reagent in the S2100 module was 0.25 mL/min, while the mobile phase in
   the S4300 module flowed at 0.45 mL/min. Each run had a total elution
   time of 130 min.

   Amino acid concentrations—including theanine, serine, glutamate,
   glutamine, aspartate, arginine, alanine, and total amino acids—were
   calculated based on the relative response factors of standard amino
   acids. All results were reported as the mean of three biological
   replicates.

QTL mapping of eight amino acid components

   The contents of eight amino acids in the F[1] population were
   integrated with genotypic data for QTL analysis using QTL IciMapping
   software. The Inclusive Composite Interval Mapping of Additive function
   (ICIM-ADD) was employed to identify significant associations [[89]30],
   with a logarithm of odds (LOD) threshold of 3.00 set to ensure
   statistical significance. The final genetic map and the distribution of
   QTLs across the linkage groups were visualized using MapChart software
   [[90]31, [91]32].

Measurement of chlorophyll and carotenoid contents

   Chlorophyll a, chlorophyll b, and carotenoid contents were measured in
   F₁ individuals of the ‘Huangjinya’ population using three biological
   replicates. Pigments were extracted from 0.1 g of young leaf tissue
   using 10 mL of an extraction solvent composed of acetone, ethanol, and
   water (4.5:4.5:1) and incubated for 24 h in darkness. Absorbance was
   measured at 663 nm, 645 nm, and 470 nm using ultraviolet
   spectrophotometry [[92]33]. Pigment concentrations were calculated
   using empirical formulas described by Lichtenthaler et al. Statistical
   comparisons between sample groups were performed using Student’s
   t-test.

Map-based cloning of albino leaf color trait

   Due to the high heterozygosity of tea plants, their F₁ progeny often
   exhibit significant phenotypic segregation. In this study, the F₁
   population derived from a cross between the green-leaf cultivar
   ‘Longjing 43’ and the albino cultivar ‘Huangjinya’ included 54 albino
   individuals and 44 green individuals. This segregating population
   served as the basis for mapping the albino leaf color trait. To
   facilitate linkage analysis, individuals with green leaves (and
   ‘Longjing 43’) were assigned a value of ‘1’, while those with albino
   leaves (and ‘Huangjinya’) were labeled as ‘0’. Linkage analysis was
   conducted using JoinMap software [[93]28], integrating phenotypic data
   with molecular markers used in genetic map construction. The Kosambi
   mapping function was applied to convert recombination frequencies into
   genetic distances, allowing the determination of marker positions and
   identification of those closely linked to the albino leaf color trait.
   The physical interval of the candidate region was then established
   based on the genomic coordinates of the flanking markers.

Bulking and whole genome resequencing

   The genomic DNA of 60 F₁ offspring derived from a controlled cross
   between ‘Huangjinya’ and ‘Longjing 43’ was extracted using the Plant
   Genomic DNA Extraction Kit (Aidelai Biotech, Beijing, China). The bulks
   were generated by mixing equal amounts of DNA from 30 F₁ offspring
   individuals with green leaf and 30 F₁ offspring individuals with albino
   leaf. The two bulks were subjected to whole genome resequencing using
   the Illumina HiSeq platform.

Population genetic analysis

   A total of 519 accessions of tea germplasm SNP datasets obtained from
   the TEA5K mSNP array were used for population genetic analysis in R
   software. To assess the population structure, we performed ADMIXTURE
   (v1.3.0) analysis, setting the number of ancestral populations (K)
   between 2 and 8, with each iteration consisting of 10,000 cycles
   [[94]34]. To further explore genetic relationships, principal component
   analysis (PCA) was conducted using R language. Additionally, a
   neighbor-joining phylogenetic tree was constructed using PLINK and
   FigTree, with 1,000 bootstrap replicates to ensure robust genome-wide
   phylogenetic relationships [[95]25].

Result

Basic characteristics of the GBTS liquid-phase array

   Ultimately, 696 probes were eliminated due to suboptimal
   performance—defined as a missing rate greater than 20%, a sequencing
   depth within the amplicon less than 10% of the average sequencing depth
   across all samples, or a full coverage read ratio below 50%. As a
   result, 5,781 high-quality target sequences were retained to construct
   the TEA5K mSNP liquid-phase array for tea plants. These mSNP markers
   are evenly distributed across the 15 chromosomes of Camellia sinensis
   (Fig. [96]1A and B).

Fig. 1.

   [97]Fig. 1
   [98]Open in a new tab

   Marker site analysis of TEA5K mSNP array. (A) Design pipeline of the
   GenoBaits TEA5K mSNP array; (B) Distribution of 5 K mSNP markers across
   the 15 chromosomes of tea plant. Marker density is represented by bar
   color, with each bar corresponding to a 1-Mb genomic window; (C)
   Statistical analysis of sequencing depth across 519 samples genotyped
   using the TEA5K mSNP array; (D) Statistical analysis of the missing
   rate across 519 samples, based on 5,781 SNPs and 36,357 SNPs

   The mSNP marker selection strategy aimed to maximize variability within
   each target region (amplicon), enabling the amplification of multiple
   SNPs per amplicon. However, the actual number of SNPs detected per
   amplicon depended on the sequencing depth of the target regions. To
   assess the sequencing depth of each sample and marker, we analyzed a
   diverse set of tea germplasm, including 281 modern cultivars, 168
   landraces, and 70 wild accessions, ensuring broad genetic
   representation (Supplementary Table [99]1). A total of 5,781 (5 K) mSNP
   markers and 36,357 (36 K) SNP markers were identified through
   sequencing at an average depth of 155.73X (Fig. [100]1C). Importantly,
   the average missing rates of mSNP and SNP markers per sample were 0.050
   and 0.036, respectively (Fig. [101]1D), indicating a high calling rate
   for the developed markers. Additionally, since MAF is a crucial
   parameter reflecting genetic variation within a population, we set a
   minimum threshold of 5% to exclude markers with insufficient genotyping
   information. Among the identified markers, 5,764 (5 K) mSNPs and 32,550
   (32 K) SNPs had MAF > 5%, whereas only 17 mSNP markers exhibited MAF
   below 5% (Supplementary Table [102]2, Supplementary Fig. [103]1). To
   further evaluate marker distribution, we analyzed the number of SNPs
   per amplicon, which ranged from 1 to 45 SNPs per amplicon, with an
   average of 6.29 SNPs, predominantly falling within the 2–9 SNPs range
   (Fig. [104]2). Additionally, the average amplicon coverage length was
   149.45 bp (Supplementary Table [105]2).

Fig. 2.

   [106]Fig. 2
   [107]Open in a new tab

   Frequency distribution of TEA5K mSNP across 15 chromosomes. The number
   of SNPs per amplicon is color-coded to indicate variation in marker
   coverage across the genome

   The theoretical number of haplotypes per amplicon is determined by 2^n,
   where n represents the number of SNPs in the amplicon. Based on the SNP
   count per mSNP locus (Table [108]1), we inferred a large number of
   theoretical haplotypes. However, empirical data from 519 tea germplasm
   revealed 417 K actual haplotypes, with 404 K haplotypes exhibiting
   MAF > 5%. On average, each mSNP contained 72.48 realized haplotypes, of
   which 70.25 were high-frequency haplotypes (MAF > 5%). Chromosome-level
   analysis indicated that chromosome 1 harbored the highest number of
   mSNPs, SNPs, and realized haplotypes, whereas chromosome 14 exhibited
   the lowest values (Supplementary Table [109]2). These findings suggest
   that the mSNP array developed in this study possesses strong
   discriminatory power, making it a highly effective tool for tea genetic
   research and molecular breeding.

Table 1.

   The evaluation of the number of SNP, PIC and GD value for 5 K core SNPs
   and 36 K SNPs in different genomic region by 519 tea plant resources
   Marker types UTR5 UTR3 exonic Intronic Intergenic Other region
   The numer of SNPs 5 K core SNPs 6 13 68 447 4,635 612
   36 K SNPs 35 32 250 2,085 29,970 3,985
   PIC Value 5 K core SNPs 0.375 0.377 0.388 0.393 0.405 0.402
   36 K SNPs 0.301 0.298 0.328 0.303 0.297 0.295
   GD Value 5 K core SNPs 0.493 0.482 0.488 0.491 0.499 0.497
   36 K SNPs 0.400 0.362 0.390 0.359 0.348 0.350
   [110]Open in a new tab

Analysis of SNP and indel loci in genomic regions

   To develop SNP markers that provide comprehensive coverage across the
   whole genome, we selected markers based on their high or moderate
   polymorphism and their even distribution across the genome, rather than
   specifically targeting functional SNP markers or gene coding regions
   (CDS). This approach ensures the broad applicability of these markers
   for diverse genetic analyses, including genetic diversity assessments,
   evaluation of specific genetic variations, variety identification,
   genetic map construction, and QTL mapping. Additionally, some indel
   sites were detected at specific SNP loci in certain tea cultivars.
   Based on the genotyping data from 519 tea accessions, a total of 30,292
   SNP sites and 6,065 indel sites were identified across 36,357 SNP
   markers (Fig. [111]3C), spanning 2,961 functional genes. Gene Ontology
   (GO) enrichment analysis revealed that a subset of these genes was
   significantly enriched in pathways related to phloem and xylem
   histogenesis, auxin polar transport, mitochondrial respiratory chain,
   and transporter activity (Fig. [112]3A). Furthermore, Kyoto
   Encyclopedia of Genes and Genomes (KEGG) enrichment analysis indicated
   significant involvement of these genes in metabolic pathways such as
   valine, leucine, and isoleucine degradation; pantothenate and coenzyme
   A (CoA) biosynthesis; and photosynthesis (Fig. [113]3B). These findings
   suggest that the identified SNP markers may play a crucial role in
   regulating important botanical and agronomic traits.

Fig. 3.

   [114]Fig. 3
   [115]Open in a new tab

   Annotation analysis of 36,357 SNP sites. (A) GO enrichment analysis of
   genes including 36,357 SNP sites; (B) KEGG pathway enrichment analysis
   of genes including 36,357 SNP sites; (C) Genomic distribution of 36,357
   SNP sites developed from different genomic regions; (D) Mutation type
   classification of 36,357 SNPs, distinguishing between transition and
   transversion mutations

   To further investigate the genomic distribution of the identified SNPs,
   we classified them based on their locations within the genome. The vast
   majority of mSNPs (82.43%) were located in intergenic regions, while
   5.73% were in intronic regions. Additionally, 10.96% of mSNPs were
   located upstream or downstream of coding genes, and only 0.87% were
   found in exonic regions, 3’ untranslated regions (UTR3), and 5’
   untranslated regions (UTR5) (Fig. [116]3C). Subsequently, we also
   counted different SNP types. The most frequently mutated SNP types was
   [A/G] (35.95%) and [C/T] (35.83%), followed by [A/T] (8.17%), [A/C]
   (7.21%) and [G/T] (7.05%), while [C/G] SNP type was the least common at
   5.56% (Fig. [117]3D).

   To assess the effectiveness of the developed markers in detecting DNA
   variation, we analyzed their polymorphism information content (PIC) and
   gene diversity (GD) based on their genomic location and marker type. In
   the 5 K core SNP panel, intergenic SNPs exhibited relatively high PIC
   and GD values, indicating a strong ability to capture genetic
   variation. Conversely, in the 36 K SNP panel, exonic SNPs displayed the
   highest average PIC and GD values (Table [118]1, Supplementary
   Fig. [119]1). These results suggest that both intergenic and exonic
   SNPs provide strong discriminatory power for detecting DNA variation,
   making them valuable tools for genetic studies and molecular breeding
   applications.

High accuracy identification of different tea cultivars

   China possesses extensive germplasm resources of tea plant, with the
   tea cultivars currently used in commercial production predominantly
   derived from individual selection or crossbreeding. A smaller number of
   cultivars have been developed through radiation mutagenesis. These tea
   cultivars exhibit varying degrees of genomic similarity, ranging from
   closely related to genetically distant accessions. To evaluate the
   identification power of the TEA5K mSNP array in distinguishing tea
   cultivars, we applied it to genotype 231 modern cultivars. The
   resulting genotyping data were analyzed using three panels: the TEA5K
   core SNP panel, the TEA36K SNP panel, and the TEA5K mSNP panel
   (Supplementary Fig. [120]1). The genetic similarity (GS) between
   cultivars was calculated using all three panels, generating 26,795 GS
   values across the 231 modern cultivars (Fig. [121]4A). Since GS
   calculation in the TEA5K core SNP and TEA36K SNP panels was based on
   single SNP sites, the results from these two panels were similar,
   though the values from the TEA5K core SNP panel were slightly lower
   than those from the TEA36K SNP panel. In the TEA5K core SNP panel, GS
   values between different modern cultivars ranged from 14.36 to 82.36%,
   with an average of 51.45% (Table [122]2; Fig. [123]4D; Supplementary
   Table 3). In the TEA36K SNP panel, GS ranged from 46.53 to 82.69%, with
   an average of 62.12% (Table [124]2; Fig. [125]4E; Supplementary Table
   [126]3). However, in the TEA5K mSNP panel, where GS was calculated
   based on multiple SNP sites within a single amplicon, the values were
   significantly lower, with an average GS of 27.62%, ranging from 7.69 to
   60.97% (Table [127]3; Fig. [128]4F; Supplementary Table [129]3).

Table 2.

   The GS analysis of different tea plant sample
   Tea cultivars 5,781 core SNP sites 36,357 SNP sites 5,781 mSNP sites
   Min. GS Max.GS Average GS Min. GS Max. GS Average GS Min.GS Max. GS
   Average GS
   Baye 1 97.18% 98.15% 97.76% 96.89% 97.58% 97.23% 89.88% 91.60% 90.76%
   Cuifeng 97.13% 98.69% 97.84% 96.68% 98.18% 97.31% 89.82% 94.20% 91.76%
   Fuding Dabaicha 97.28% 98.65% 97.88% 96.21% 98.01% 97.95% 88.52% 93.58%
   90.75%
   Longjing 43 96.37% 97.79% 97.02% 95.96% 97.19% 96.59% 87.27% 91.02%
   89.04%
   Shuchazao 93.57% 95.85% 94.70% 94.68% 96.10% 95.38% 82.91% 88.00%
   85.20%
   Zhongcha 108 96.42% 97.66% 97.05% 96.06% 97.39% 96.74% 87.36% 91.47%
   89.43%
   Zhonghuang 1 97.27% 98.65% 97.95% 97.50% 98.44% 97.83% 91.63% 94.51%
   92.53%
   Zhonghuang 3 96.61% 97.86% 97.31% 96.83% 97.69% 97.23% 89.44% 91.77%
   90.69%
   bud mutation materials 97.08% 98.20% 97.72% 96.74% 97.68% 97.25% 90.36%
   92.86% 91.74%
   231 developed cultivar 14.36% 82.36% 51.45% 46.53% 82.69% 62.12% 7.69%
   60.97% 27.62%
   [130]Open in a new tab

Table 3.

   The information and statistics of genetic map
   Linkage group Number of markers Genetic Length (cM) Average distance
   (cm)
   LG1 438 166.07 0.38
   LG2 266 140.74 0.53
   LG3 144 147.16 1.02
   LG4 330 199.11 0.60
   LG5 141 152.53 1.08
   LG6 211 144.73 0.69
   LG7 211 142.46 0.68
   LG8 231 135.2 0.59
   LG9 207 165.47 0.80
   LG10 229 150.5 0.66
   LG11 161 172.71 1.07
   LG12 267 119.39 0.45
   LG13 190 108.6 0.57
   LG14 103 105.52 1.02
   LG15 145 175.00 1.21
   Average 218.27 148.35 0.76
   Total 3274 2225.19 /
   [131]Open in a new tab

   Since the current breeding mode of these tea cultivars is mainly
   vegetative propagation, we analyzed the GS distribution between
   different individual plants within a cultivar using the three panels
   mentioned above. For this, three biological replicates from four
   individual plants were tested for each of eight modern cultivars:
   ‘Baiye 1’, ‘Cuifeng’, ‘Fuding Dabaicha’, ‘Longjing 43’, ‘Shuchazao’,
   ‘Zhongcha 108’, ‘Zhonghuang 1’ and ‘Zhonghuang 3’ (Fig. [132]4B;
   Supplementary Fig. [133]2). A high level of GS was observed among
   biological replicates when using the TEA5K core SNP and TEA36K SNP
   panels. Although ‘Shuchazao’ exhibited the lowest GS values in both
   panels, it still reached 94.70% and 95.38%, respectively. ‘Zhonghuang
   1’ and ‘Fuding Dabaicha’ exhibited the highest GS values in both
   panels, reaching 97.95% (Table [134]2; Fig. [135]4D and E). However,
   when analyzed using the TEA5K mSNP panel, slightly lower GS values were
   observed for the same tea plant samples. The lowest average GS (85.20%)
   was found in ‘Shuchazao,’ while the highest (92.53%) was observed in
   ‘Fuding Dabaicha’ (Table [136]2; Fig. [137]4F). These findings indicate
   that the mSNP-based analysis (TEA5K mSNP panel) has stronger
   discrimination ability between different cultivars compared to
   single-SNP-based analyses (TEA5K core SNP and TEA36K SNP panels).
   However, when assessing genetic similarity within a single cultivar,
   the TEA5K core SNP and TEA36K SNP panels provided more consistent
   results than the mSNP panel.

   To further verify these conclusions, we analyzed two tea cultivars,
   ‘Longjing 43’ and ‘Zhongcha 108’ which exhibit significant phenotypic
   differences. ‘Zhongcha 108’ is well known to have been derived from
   ‘Longjing 43’ through ^60Co γ-ray radiation mutagenesis. Radiation
   mutagenesis is characterized by a high mutation frequency and random
   genetic modifications, leading to genetic divergence between mutant and
   original cultivars. Here, we determined that the GS value between
   ‘Zhongcha 108’ and ‘Longjing 43’ calculated using the TEA5K mSNP panel
   was significantly lower than the values obtained using the TEA5K core
   SNP and TEA36K SNP panels, suggesting that the mSNP approach provides a
   more precise distinction between genetically related but phenotypically
   distinct cultivars.

   Additionally, a bud mutation tea plant resource was analyzed using the
   TEA5K mSNP array and evaluated using the three panels (Fig. [138]4C).
   Bud mutation is a common phenomenon in plants, where point mutations
   occurring in a single branch or leaf are triggered by environmental
   factors, leading to phenotypic changes such as alterations in leaf
   color or shape. This suggests that the genetic distance between normal
   plants and their bud mutants is typically very small. Our analysis
   confirmed this hypothesis, as the GS values between bud mutants and
   normal tea plants were extremely high when calculated using the TEA5K
   core SNP and TEA36K SNP panels, reaching 97.72% and 97.25%,
   respectively (Table [139]2). However, in the TEA5K mSNP panel, the GS
   value was slightly lower, averaging 91.74% (Table [140]2). These
   results suggest that for bud mutation analysis, using single SNP sites
   provides a more accurate GS calculation, as such mutations generally
   result in minimal genetic divergence.

QTL mapping of amino acid and map-based cloning of leaf color

   Tea leaf color mutants have garnered growing interest in the market due
   to their attractive appearance and fresh flavor, largely attributed to
   elevated amino acid content. To assess the utility of the TEA5K mSNP
   array for mapping important agronomic traits in tea, we conducted both
   QTL mapping of free amino acid components and map-based cloning of the
   leaf color trait using an F₁ population derived from a cross between
   ‘Huangjinya’ and ‘Longjing 43’. In this population, ‘Huangjinya’—a
   light-sensitive albino cultivar with high amino acid content—served as
   the male parent, while ‘Longjing 43’—a cultivar with typical green leaf
   color—served as the female parent (Fig. [141]5A). Among the segregating
   individuals in the F₁ population, genomic DNA was extracted from 54
   albino individuals and 44 green individuals, along with both parental
   cultivars. Genotyping was performed using the TEA5K mSNP array, which
   identified 17,248 polymorphic SNPs after excluding missing data. For
   genetic map construction, only segregation types ab × cd, lm × ll, nn ×
   np, hk × hk, and ef × e.g. were used. After filtering out low-integrity
   markers and removing those with segregation distortion based on a
   chi-square test (P < 0.001), a final set of 8,882 markers remained.
   Further refinement resulted in 3,274 markers being used for genetic map
   construction. These markers were mapped into 15 linkage groups (LGs),
   corresponding to the chromosome number of tea plants (Fig. [142]5B).
   Linkage analysis revealed that the total length of the genetic map was
   2,225.19 cM, with individual linkage groups ranging from 105.52 cM
   (LG14) to 199.11 cM (LG4). The average number of markers per linkage
   group was 218, ranging from 103 (LG14) to 330 (LG4). The smallest
   average marker distance was observed in LG1 (0.38 cM), while LG15 had
   the largest average marker distance (1.21 cM) (Table [143]3).

Fig. 5.

   [144]Fig. 5
   [145]Open in a new tab

   Gene mapping of albino-leaf trait by using TEA5K mSNP array. (A)
   Parental and offspring phenotypic variation; (B) Genetic map
   constructed using the TEA5K mSNP array; (C) map-based clone results for
   the albino-leaf trait; D QTL mapping results for various amino acid
   components; (E) Result of Bulked Segregant Analysis mapping for albino
   leaf color

   Using the constructed genetic map, the biochemical data of eight amino
   acid components (theanine, serine, glutamate, glutamine, aspartate,
   arginine, alanine, total amino acids) for each individual in the F₁
   population, QTL analysis of various amino acid components traits was
   conducted using QTL IciMapping software (Supplementary Table 4). QTLs
   with an LOD score above 3 were considered statistically significant. As
   a result, a total of 33 QTLs were screened for 8 amino acids. These
   QTLs were distributed on 12 linkage groups, excluding LG3, LG11, and
   LG13, with a LOD score ranging from 3.55 to 8.48. The number of QTL for
   each trait ranged from one to twelve. There are two QTLs associated
   with theanine, arginine and total amino acids respectively. In
   addition, three and five QTLs control glutamine and glutamate,
   respectively. The phenotypic variance explained by these QTLs ranges
   from 15.8 to 33.7% (Fig. [146]5D; Supplementary Table 5).

   The distributions of chlorophyll and carotenoid contents in 54 albino
   and 44 green F₁ individuals showed a clear bimodal
   pattern—characteristic of qualitative traits—suggesting that the albino
   leaf color is likely controlled by a few major genes (Supplementary
   Fig. [147]3). To pinpoint the genetic locus responsible for this trait,
   linkage analysis was conducted using 3,274 SNP markers from the
   constructed genetic map, with albino individuals coded as ‘0’ and green
   individuals as ‘1’. The albino trait was ultimately mapped to a region
   between markers chr8_SNP222 and chr8_SNP223 on chromosome 8, with
   genetic distances of 1.1 cM and 1.0 cM, respectively, and a physical
   span of 23.15 Mb. To validate the accuracy of the SNP markers and the
   map-based cloning results, Bulked Segregant Analysis (BSA) was
   performed using randomly selected DNA samples from 30 albino and 30
   green individuals. As shown in Fig. [148]5E, BSA also identified a
   candidate region on chromosome 8 spanning from 76.25 Mb to 163.07 Mb,
   which is broadly consistent with the interval identified through
   linkage mapping. Based on genome annotations, two candidate genes
   within this region were preliminarily identified: Photosystem I
   reaction center subunit PsaK (CSS0007721) and Coproporphyrinogen-III
   oxidase 1 (CSS0003887), both of which are implicated in chlorophyll
   biosynthesis and light-responsive pathways (Supplementary Table 6).
   These results underscore the TEA5K mSNP array’s efficacy in both
   high-resolution genetic map construction and precise QTL mapping of key
   agronomic traits in tea plants.

Population structure and phylogenetic analyses of 519 tea germplasm across
China

   To investigate the phylogenetic relationships among 519 tea accessions,
   we used C. sasanqua as an outgroup and analyzed 5,443 core SNPs. These
   accessions included 328 C. sinensis var. sinensis, 49 C. sinensis var.
   assamica, 68 C. sinensis var. pubilimba, 29 C. fangchengensis, and 46
   C. crassicolumna, collected from 15 provinces across four major
   tea-growing regions of China (Fig. [149]6A; Supplementary Table
   [150]1). Phylogenetic analysis revealed that these tea germplasm were
   primarily grouped into three major clusters. The wild group comprised
   C. fangchengensis and C. crassicolumna, while the landrace group
   consisted of C. sinensis var. assamica, C. sinensis var. pubilimba, and
   a subset of C. sinensis var. sinensis. The modern cultivar group was
   predominantly composed of C. sinensis var. sinensis (Fig. [151]6D).
   These clustering patterns were further validated by principal component
   analysis (PCA), which consistently identified the same three major
   groups (Fig. [152]6B). To further assess the genetic relationships
   among these accessions, we conducted population structure analysis
   using ADMIXTURE, testing K values from 2 to 4. At K = 2, clear
   introgression was observed among the three major groups. At K = 3, the
   landrace group exhibited additional differentiation, aligning with the
   phylogenetic classification. Specifically, Subgroup I consisted mainly
   of C. sinensis var. pubilimba and C. sinensis var. assamica, while
   Subgroup II included C. sinensis var. sinensis and C. sinensis var.
   assamica. At K = 4, further separation of wild accessions became
   evident, with closely related germplasm C. fangchengensis from the wild
   forests of Guangxi and C. crassicolumna from Yunan province forming
   distinct clusters (Fig. [153]6C, Supplementary Fig. [154]4).

Fig. 6.

   [155]Fig. 6
   [156]Open in a new tab

   Distribution and population evolution analysis of 519 tea plant
   resources. (A) Geographical distribution of 519 tea plant accessions,
   highlighting the collection regions across multiple Chinese provinces.
   (B) Principal Component Analysis (PCA) of 519 tea germplasm. (C)
   Population structure analysis of 519 tea plant accessions, evaluated
   under K = 2 and K = 3 conditions. (D) Phylogenetic tree of 519 tea
   resources. The green section represents the majority of modern tea
   cultivars, the red section corresponds to most landrace resources, and
   the blue section encompasses almost all wild tea resources. (E)
   Simulated domestication pathway of tea plants, illustrating the
   evolutionary transition from wild tea populations to modern cultivars

Discussion

Advantages of TEA5K mSNPs array

   The liquid-phase array based on in-solution hybrid capture is a widely
   utilized genotyping technology in both plant and animal research. This
   method enables large-scale parallel sequencing, specifically targeting
   predefined genomic regions, thereby enhancing efficiency and reducing
   sequencing redundancy [[157]3]. This technology has been successfully
   applied to various species, including wheat (T. aestivum), rice (O.
   sativa), maize (Z. mays), soybean (Glycine max), peanut (Arachis
   hypogaea), tomato (Solanum lycopersicum), pepper (Capsicum spp.), and
   slash pines (Pinus elliottii), playing a pivotal role in functional
   genomics research, molecular marker-assisted breeding, and genomic
   selection [[158]2, [159]35, [160]36]. The tea plant is a globally
   valued non-alcoholic beverage crop, yet its genetic diversity and
   differentiation—particularly regarding key agronomic and metabolomic
   traits—remain insufficiently understood. A recent large-scale de novo
   sequencing effort involving 802 tea accessions has provided important
   insights into the genetic diversity and domestication history of
   ancient tea germplasm [[161]37]. While de novo sequencing enables
   comprehensive detection of genomic variation across diverse
   collections, its application is often limited by the high cost
   associated with the large genome size of tea (~ 3.0 Gb). In this study,
   we introduce the TEA5K mSNP array as the first liquid-phase genotyping
   platform developed specifically for tea plants, offering a
   cost-effective and scalable alternative for high-resolution genetic
   analysis. When compared to existing genotyping methods—such as
   genotyping-by-sequencing (GBS), Kompetitive Allele-Specific PCR (KASP),
   and solid-phase arrays like Illumina Infinium and Affymetrix
   platforms—the TEA5K mSNP array offers several distinct advantages. One
   of its key strengths is its ability to capture multiple SNP markers
   within each target region, significantly increasing the number of
   detected SNPs compared to single-SNP technologies. In this study, an
   average of 6.29 SNPs per target region was identified in tea plants, a
   value consistent with findings in wheat (5.50 SNPs), rice (4.50 SNPs),
   and maize (6.50 SNPs) [[162]2]. This high marker density enhances the
   TEA5K mSNP array’s capacity for high-resolution genetic analysis.

   Another major advantage of the TEA5K mSNP array is its ability to
   generate multiple marker datasets from the same genotyping system. For
   example, it can be used to construct within-amplicon haplotypes or to
   select high-PIC SNP sets, providing a flexible approach for breeding
   applications. Previous studies have demonstrated that by combining SNPs
   across different amplicons or prioritizing SNPs with high PIC values,
   researchers have successfully generated 690 K haplotype sets and 40 K
   high-PIC SNP sets in maize [[163]2]. In our study, we identified 417 K
   haplotypes and 5,781 high-PIC SNPs across 519 tea accessions,
   highlighting the array’s ability to capture extensive genetic variation
   and tailor marker sets for specific breeding and genetic studies. The
   TEA5K mSNP array also facilitates the integration of genotyping data
   across multiple time points and sequencing platforms, a challenge
   commonly encountered with other genotyping techniques. For instance, in
   GBS-based approaches, the use of restriction enzyme digestion results
   in the random sampling of genomic regions, making it difficult to
   reproduce datasets across sequencing runs [[164]38]. In contrast,
   targeted sequencing approaches have demonstrated high consistency, as
   shown in wheat, where an 85.7% average similarity was achieved between
   results from two different platforms [[165]35]. In our study,
   genotyping of the same tea cultivars at different time points
   demonstrated a consistency rate exceeding 90%, with very low error
   rates. For example, the similarity between two sequencing runs of the
   cultivar ‘Huangjinya’ was 97.66%, with missing rates as low as 0.016
   and 0.011, respectively. This high reproducibility underscores the
   robustness and reliability of the TEA5K mSNP array in tea plant
   genotyping. Furthermore, the liquid-phase array design of the TEA5K
   mSNP array is highly adaptable. The number of mSNP markers can be
   increased by incorporating additional markers as needed. This
   scalability ensures long-term applicability and enables continuous
   updates to the marker panel as new genomic information becomes
   available. Moreover, the cost-effectiveness of the TEA5K array makes it
   a highly economical alternative to conventional solid-phase genotyping
   platforms, significantly reducing genotyping expenses without
   compromising data accuracy or reproducibility [[166]39]. Compared to de
   novo sequencing, the TEA5K array lowers genotyping costs by at least
   75%, while providing a six-fold increase in the number of SNP markers
   included in the mSNP panels relative to solid-chip platforms. In this
   study, the per-sample genotyping cost using the TEA5K array was as low
   as $7, underscoring its suitability as a high-throughput and
   budget-friendly solution for tea plant research [[167]2]. Collectively,
   these advantages establish the TEA5K mSNP array as a powerful,
   flexible, and cost-efficient genotyping platform—offering high marker
   density, excellent reproducibility, and broad utility for genetic
   research and molecular breeding in tea.

TEA5K mSNPs array as a powerful tool for cultivar identification

   With the increasing number of newly developed elite tea cultivars, a
   common issue in the tea industry is the misidentification of plant
   materials. Tea plants with the same genotype are often assigned
   different names, while plants with different genotypes are mistakenly
   classified under the same cultivar name. This misclassification not
   only affects the accuracy of cultivar identification but also
   undermines breeders’ rights and intellectual property protection
   [[168]40, [169]41]. Therefore, developing an efficient, accurate, and
   cost-effective genotyping tool for cultivar authentication is essential
   to ensure the traceability and authenticity of tea seedlings [[170]42].
   The advancement of liquid-phase array technology provides a promising
   solution by enabling high-throughput genotyping of target genomic
   regions [[171]43, [172]44]. For example, in melon (Cucumis melon), a
   2 K liquid-phase SNP array was developed to assess genetic diversity,
   revealing an average of 754 polymorphic SNPs per plant pair [[173]44].
   Similarly, in this study, the TEA5K mSNP array was applied to evaluate
   genetic variation within and between tea cultivars, as well as the
   genetic consistency within individual cultivars.

   Based on the genotyping data, GS values were calculated for all
   cultivar pairs using three marker panels: the 5 K core SNP panel, the
   36 K SNP panel, and the 5 K mSNP panel. Each panel produced 26,795 GS
   values, with minimum GS values of 14.36%, 46.53%, and 7.59%, and
   maximum values of 82.36%, 82.69%, and 60.97%, respectively. These
   differences reflect the unique characteristics of each
   panel—particularly the inclusion of multiple linked SNPs within an
   amplicon in the TEA5K mSNP panel versus the single-site focus of the
   TEA5K core SNP and TEA36K SNP panels. Notably, genetic linkage among
   SNPs was not accounted for in the TEA5K core SNP and TEA36K SNP panels.
   Despite these methodological differences, all three panels effectively
   distinguished between tea cultivars. For instance, using the TEA5K core
   SNP panel, the maximum GS observed between different cultivars was
   82.36%, whereas the minimum GS within the same cultivar (across eight
   representative cultivars) was 93.57%, providing a well-defined gap
   suitable for threshold determination. Given the high heterozygosity of
   the tea genome and the distribution of GS values observed across the
   dataset, we recommend a 90% GS threshold for confirming cultivar
   authenticity using the TEA5K core SNP panel. To further enhance
   discriminatory power, particularly among closely related genotypes,
   incorporating SNPs from highly variable regions such as the internal
   transcribed spacer (ITS) may be beneficial, as this region has shown
   strong potential for resolving cultivar identity [[174]45].

Application of TEA5K mSNPs array for gene mapping

   In recent years, liquid-phase array technology has been increasingly
   used in genetic mapping and trait analysis due to its cost efficiency
   and rapid processing time. For instance, the WheatSNP16K liquid array
   was applied in wheat to construct a genetic map and identify four QTLs
   associated with stripe rust resistance [[175]46]. Similarly, a 51 K
   liquid-phase array was used in slash pine (P. elliottii), identifying
   95 SNPs significantly associated with growth and wood quality traits
   through GWAS [[176]36]. In pepper (Capsicum spp.), a 45 K GBTS
   liquid-phase gene array was used to analyze helical fruit shape,
   leading to the identification of three key QTLs and a candidate gene
   encoding the tubulin alpha chain, which regulates fruit curvature
   [[177]47]. To evaluate the utility of the TEA5K mSNP array in
   dissecting key agronomic traits in tea plants, we conducted QTL mapping
   of amino acid components and map-based cloning of the albino leaf color
   trait using a hybrid F₁ population derived from ‘Huangjinya’. A total
   of 33 QTLs related to various amino acid components were identified,
   while the albino leaf color locus was precisely mapped to the interval
   between markers chr8_SNP222 and chr8_SNP223 on chromosome 8. Notably,
   the results of map-based cloning were consistent with those obtained
   from BSA, further validating the reliability of the TEA5K array for
   trait mapping. Within the identified candidate region, two genes were
   preliminarily annotated: CsPsaK (CSS0007721) and CsHEMF (CSS0003887).
   Prior research has shown that downregulation of PsaK disrupts the
   photosynthetic system, leading to albino leaf phenotypes in pecan
   (Carya illinoinensis) [[178]48]. Similarly, the differential expression
   of HEMF affects chlorophyll biosynthesis and degradation under
   high-temperature stress, leading to leaf albinism in alfalfa (Medicago
   sativa) [[179]49]. These findings indicate that the TEA5K mSNP array is
   a powerful tool for identifying regulatory loci involved in agronomic
   traits, further expanding its applicability in tea breeding programs.

The improvement bottleneck in modern tea breeding in China

   Advancements in high-throughput sequencing technologies have
   significantly expanded tea genomics research, with multiple tea genomes
   now available. A recent pan-genome analysis involving 22 tea genomes
   represents a major milestone in the field [[180]50]. However, despite
   these advancements, the practical application of such genomic resources
   in tea breeding remains largely unexplored. One of the critical
   challenges is understanding the genetic structure of elite tea
   cultivars, which plays a key role in breeding efficiency. The formal
   approval process for tea cultivars in China began in 1985, and to date,
   over 300 cultivars have been officially registered [[181]51]. In this
   study, the TEA5K mSNP array was used to analyze the genetic structure
   of these cultivars at the molecular level, making it the first
   large-scale genomic study of Chinese tea cultivars. The results
   revealed that genetic diversity among these cultivars is relatively
   low, primarily due to the prevalence of hybridization and asexual
   reproduction. Current breeding practices often rely on
   “cultivar-to-cultivar” and “landrace-to-cultivar” crosses, with minimal
   genetic contributions from wild tea populations. This lack of wild
   introgression suggests that modern tea breeding is facing an
   “improvement bottleneck,” a limitation similar to those observed in
   many perennial crops (Fig. [182]6E) [[183]52].

   Furthermore, this study provides insights into the evolutionary status
   of C. fangchengensis, a rare and endangered tea species. Genomic
   analysis suggests that C. fangchengensis is a potential ancestral
   species of modern cultivated tea, yet its evolutionary significance
   remains largely unknown. Further research, integrating genomic and
   botanical evidence, is needed to fully explore the evolutionary history
   and potential breeding value of this species.

Conclusion

   In brief, we have pioneeringly developed a novel liquid-phase assay
   with flexibility and high-resolution in tea genetic resources, achieved
   cultivar identification, genetic map construction, gene mapping of
   important traits in tea plants, as well as genetic evolution analysis.
   In particular, we have set a 90% threshold that can effectively
   differentiate between different tea genetic resources by utilizing the
   liquid-phase array. We firmly believe that our research offers a
   reliable technology for genetic research and MAS breeding in tea plant.

Electronic supplementary material

   Below is the link to the electronic supplementary material.
   [184]Supplementary Material 1^ (3.3MB, xlsx)
   [185]Supplementary Material 2^ (75.9MB, xlsx)
   [186]Supplementary Material 3^ (7.7MB, docx)

Acknowledgements