Abstract Background The Rhododendron sanguineum complex is endemic to alpine mountains of northwest Yunnan and southeast Tibet of China. Varieties in this complex exhibit distinct flower colors even at the bud stage. However, the underlying molecular regulations for the flower color variation have not been well characterized. Here, we investigated this via measuring flower reflectance profiles and comparative transcriptome analyses on three coexisting varieties of the R. sanguineum complex, with yellow flush pink, bright crimson, and deep blackish crimson flowers respectively. We compared the expression levels of differentially-expressed-genes (DEGs) of the anthocyanin / flavonoid biosynthesis pathway using RNA-seq and qRT-PCR data. We performed clustering analysis based on transcriptome-derived Single Nucleotide Polymorphisms (SNPs) data, and finally analyzed the promoter architecture of DEGs. Results Reflectance spectra of the three color morphs varied distinctively in the range between 400 and 700 nm, with distinct differences in saturation, brightness, hue, and saturation/hue ratio, an indirect measurement of anthocyanin content. We identified 15,164 orthogroups that were shared among the three varieties. The SNP clustering analysis indicated that the varieties were not monophyletic. A total of 40 paralogous genes encoding 12 enzymes contributed to the flower color polymorphism. These anthocyanin biosynthesis-related genes were associated with synthesis, modification and transportation properties (RsCHS, RsCHI, RsF3H, RsF3′H, RsFLS, RsANS, RsAT, RsOMT, RsGST), as well as genes involved in catabolism and degradation (RsBGLU, RsPER, RsCAD). Variations in sequence and cis-acting elements of these genes might correlate with the anthocyanin accumulation, thus may contribute to the divergence of flower color in the R. sanguineum complex. Conclusions Our results suggested that the varieties are very closely related and flower color variations in the R. sanguineum complex correlate tightly with the differential expression levels of genes involved in the anabolic and catabolic synthesis network of anthocyanin. Our study provides a scenario involving intricate relationships between genetic mechanisms for floral coloration accompanied by gene flow among the varieties that may represent an early case of pollinator-mediated incipient sympatric speciation. Supplementary Information The online version contains supplementary material available at 10.1186/s12870-021-02977-9. Keywords: Anthocyanin synthesis, Comparative transcriptomics, Flower coloration, Gene expression, Rhododendron sanguineum complex, Sympatric speciation Background The remarkable diversity of flower colors, especially in wild plants has fascinated botanists, ecologists, and horticulturists for centuries [[45]1–[46]3]. The coloring of floral organs, a remarkable character of flowering plants, is a striking feature of the angiosperm radiation [[47]4, [48]5]. Flower color diversity is recognized to be one of key adaptive traits correlated predominantly with pollinators (e.g. insects, birds) and animals for seed dispersal [[49]6, [50]7]. Moreover, the flower color phenotype is an important feature for plants used for their classification by taxonomists. However, flower color appears evolutionarily to be one of the most labile traits, down to populations in the same species [[51]7, [52]8]. The cellular compounds of flowers that contribute to the color profile and visually perceived by humans are generally referred to as “pigments”. A group of secondary metabolites belonging to flavonoids are the main determinants of pigments for coloration in plants, where anthocyanins are responsible for red orange to red, purple to violet pigments found in flowers, leaves, fruits, seeds and other tissues [[53]9, [54]10]. Anthocyanins are the predominant compounds of floral coloration, existing in over 90% of angiosperms [[55]11]. The flavonoid biosynthetic pathway leading to accumulation of anthocyanins is highly conserved and well characterized, and has been extensively studied in many species, most of which are in model plants or agriculturally and horticulturally important plants [[56]12–[57]15]. Few studies have examined the molecular basis underlying the formation and accumulation of anthocyanin in wild species [[58]16, [59]17]. Based on these studies, three major associated factors have been proposed to be involved in anthocyanin accumulation, including transcription regulatory genes (MYB-bHLH-WD40 complex) that occur in the nucleus, structural genes (CHS, FLS, DFR, ANS) acting in the biosynthetic pathway, and transporter genes (GST) transferring anthocyanin from the cytosol into the vacuole [[60]10, [61]18, [62]19]. The expression of these genes could also be affected by natural variation in sequences and cis-regulatory elements as well as epigenetic modifications (such as DNA methylation) in the promoter regions [[63]18, [64]20]. Moreover, the color of flowers can be stabilized and enhanced by co-pigmentation of anthocyanins by flavonols, where it is observed as hyperchromic effect, in which the intensity of an anthocyanin content is fortified [[65]21]. For instance, the DFR gene along with the FLS gene can compete for a substrate leading to the production of different anthocyanins and flavanols through two primary branches [[66]22, [67]23], thus resulting in co-pigmentation. In contrast to the biosynthesis pathways, knowledge of anthocyanin catabolism in plants is limited. Some catabolic genes like BGLU and PER have been shown to be responsible for anthocyanin degradation [[68]24, [69]25]. Nevertheless, the molecular mechanism regulating anthocyanin synthesis has been shown to vary among plant species resulting in structural diversity of anthocyanins, because the biosynthesis pathway is regulated by multiple factors through regulatory networks [[70]26]. Color is a form of electromagnetic radiation in the range of the visible spectrum. The wavelengths reflected by pigments determine the color of a flower [[71]27]. Color can be defined and classified in terms of Brightness (the intensity of a signal, B), Saturation (the purity of a color, S) and Hue (the spectral descriptor of color, H), and those features are commonly used to distinguish colors [[72]27, [73]28]. Brightness refers to the color intensity that is determined by the amount of anthocyanin [[74]29, [75]30], and different color component combinations such as B/H, S/H were found to be significantly correlated with anthocyanin content as well [[76]31]. Liu et al. [[77]32] proposed that the color brightness decreased as the total anthocyanin content increased. It was also demonstrated that a correlation exists between the saturation/hue ratio (S/H) and anthocyanin content [[78]31]. With these parameters, the anthocyanin content can be rapidly and non-destructively determined. In evergreen azaleas (Rhododendron), anthocyanins and flavanols are the main flower pigments, and especially the composition of anthocyanin constituents (i.e. cyanidin, delphinidin, malvidin, pelargonidin, peonidin, and petunidin), and their quantities determine their flower color that ranges from light pink to violet [[79]11, [80]33]. Some studies have reported that R. kiusianum with purple-colored flowers contain derivatives of both anthocyanidins cyanidin and delphinidin, whereas the red-colored flowers of R. kaempferi contain only cyanidin derivatives [[81]34]. Le Maitre et al. [[82]35, [83]36] studying Erica species, belonging to the same family Ericaceae as Rhododendron, used qRT-PCR and UPLC-MS, unraveled the anthocyanin genetic network of floral color shifts between red or pink and white or yellow flowered species and found losses of single pathway gene expression, abrogation of the entire pathway due to loss of the expression of a transcription factor or loss of function mutations in pathway genes resulted in striking floral color shifts. Here, we investigated the genetic basis of flower coloration using a highly color polymorphic Rhododendron sanguineum complex. The complex (R. subgen. Hymenanthes) includes plants with yellow to pink or crimson to blackish crimson flowers that are classified into six varieties mainly based on their flower color differences [[84]37]. Members of this complex are basically located at high elevations (> 3000 m) associated with snow cover [[85]37]. They are endemic to northwest Yunnan and southeast Tibet, one of the global biodiversity hotspots [[86]38]. This region is also recognized as one of the centers for diversification and differentiation of Rhododendron [[87]37, [88]39]. The flower color polymorphisms of this genus have been traditionally viewed as an ecologically adaptive trait that is essential in attracting specific pollinators [[89]40–[90]42], and may also be the response to environmental variation, such as UV radiations at different elevations, temperatures, and soil conditions [[91]32]. Although there are studies published on the anthocyanin components and contents in Rhododendron flowers, most were solely dedicated to the identification of the pigment constituents in the petals of some wild and cultivated azaleas using thin-layer chromatography (TLC) and high-performance liquid chromatography (HPLC) [[92]11, [93]33]. No study so far focused on the molecular mechanisms underlying infraspecific color polymorphisms in Rhododendron. The study of closely related entities such as a species complex has the advantage of a fairly homogeneous genetic background where flower color genes vary and cases of homoplasy are limited. Previous studies mainly focused on color shifts at different developmental stages of a single species [[94]14, [95]18], or covered a number of related species [[96]26, [97]35]. In the present study, we combined transcriptome sequencing (RNA-seq) and genome resequencing with reflectance spectra analyses to elucidate molecular and anthocyanin content differences among three differently colored naturally occurring varieties of the R. sanguineum complex, with yellow flushed pink to deep blackish crimson colored flowers. We aimed at studying the correlation between infraspecific flower color variation and the expression of candidate genes of the anthocyanin / flavonoid biosynthesis pathway. Our findings may allow the proposal of a hypothesis for the genetic mechanism of the expression of flower color variation and a representative case of pollinator-mediated incipient sympatric speciation in the R. sanguineum complex. In addition, it is the first study to compare transcriptome profiles in a natural system of a non-model species of Rhododendron. Results Reflectance spectra and color morph differences The sampled individuals can be grouped into three color categories representing R. sanguineum var. sanguineum (RsS) with bright crimson flowers (Fig. [98]1d), R. var. haemaleum (RsH) where the flowers were deep blackish crimson (Fig. [99]1e), and R. var. didymoides (RsD) with yellow flushed pink tubes and red petals (Fig. [100]1f). The reflectance spectra of the three color morphs varied distinctively across the wavelength range, especially between 400 and 620 nm (Fig. [101]2a). RsS showed one discrete peak in the red spectrum (~ 620 nm), while RsH had a peak further into the far red range (> 700 nm). The reflectance of RsD increased across the blue to yellow spectrum and peaked in the orange-red range (~ 600 nm) (Fig. [102]2a). The flower colors of the three varieties exhibited marked differences in brightness, saturation and hue. RsD had the highest brightness values followed by RsS, and RsH with the lowest values, while RsH had the highest saturation, then RsS with RsD the lowest (Fig. [103]2b, c). The hue values were highest in samples of RsD, then RsS and RsH (Fig. [104]2b). The ratio of saturation and hue (an indirect measurement of anthocyanin content) of RsH was the highest compared to those of the RsD and RsS samples (Fig. [105]2d). The ratio values of the RsD samples varied much wider than those of the other two varieties that maybe linked to variation in the color composition of its bicolored flowers (Fig. [106]1). Fig. 1. [107]Fig. 1 [108]Open in a new tab Morphology and sampling stages of the three different flower morphs found in the Rhododendron sanguineum complex in the wild. a Habitat of the sampling site; sampling stages of the two tissues for RNA-seq, including late flower bud (b) and leaf bud (c) of the varieties (left to right) R. sanguineum var. sanguineum, R. var. haemaleum, and R. var. didymoides; open flowers of R. var. sanguineum (d), R. var. haemaleum (e), and R. var. didymoides (f) Fig. 2. [109]Fig. 2 [110]Open in a new tab Comparisons of the reflectance spectra of three varieties of the Rhododendron sanguineum complex. a Reflectance spectra were measured at the midpoint of the corolla tubes of the three varieties. b Three-dimensional plots of the visible spectra that classified by brightness (x-axis), hue (y-axis), and saturation (z-axis) of the three varieties. c Boxplots of brightness values of the reflectance spectra of the three varieties. d Boxplots of the ratios of saturation over hue from reflectance spectra for the three varieties. Boxplots show the median, and the box edges represent the 25th and 75th percentiles of values for a total of 15 individuals for each variety. Statistical significance was determined by a two-tailed Student’s t test. The significant differences are noted as asterisks (*P < 0.05; **P < 0.01; ***P < 0.001). Red represents R. var. sanguineum (RsS); Purple represents R. var. haemaleum (RsH); Yellow represents R. var. didymoides (RsD) De novo transcriptome assembly and quality assessment We sequenced a total of 18 RNA-seq libraries from two tissues (flower buds and leaf buds) for three individuals for each of the three varieties in the R. sanguineum complex using Illumina paired-end sequencing. After quality control, approximately 655.8 million (M) clean reads (~ 96 gigabase pairs, Gbp) remained with a very uniform number of reads between the libraries ranging between 32,111,674 and 43,353,842 (Table [111]S1). The contig N50 values for the CORSET across the three varieties had similar lengths and ranged from 903 to 1125 base pairs (bp) and the numbers of transcripts ranged from 117,976 to 171,725 (Table [112]S1). The reads of the 18 individual libraries were aligned by mapping the reads back to their variety-specific references,