Abstract Houttuynia cordata, also known as Yuxingcao in Chinese, is a perennial herb in the Saururaceae family. It is highly regarded for its medicinal properties, particularly in treating respiratory infections and inflammatory conditions, as well as boosting the human immune system. However, a lack of genomic information has hindered research on the functional genomics and potential improvements of H. cordata. In this study, we present a near-complete assembly of H. cordata genome and investigate the biosynthetic pathway of flavonoids, specifically quercetin, using genomics, transcriptomics, and metabolomics analyses. The genome of H. cordata diverged from that of Saururus chinensis around 33.4 million years ago; it consists of 2.24 Gb with 76 chromosomes (4n = 76) and has undergone three whole-genome duplication (WGD) events. These WGDs played a crucial role in shaping the H. cordata genome and influencing the gene families associated with its medicinal properties. Through metabolomics and transcriptomics analyses, we identified key genes involved in the β-oxidation process for biosynthesis of houttuynin, one of the volatile oils responsible for the plant’s fishy smell. In addition, using the reference genome, we identified genes involved in flavonoid biosynthesis, particularly quercetin metabolism, in H. cordata. This discovery has important implications for understanding the regulatory mechanisms that underlie production of active pharmaceutical ingredients in traditional Chinese medicine. Overall, the high-quality genome assembly of H. cordata serves as a valuable resource for future functional genomics research and provides a solid foundation for genetic improvement of H. cordata for the benefit of human health. Key words: Houttuynia cordata, flavonoid biosynthesis, genome assembly, houttuynin, quercetin, whole-genome duplication __________________________________________________________________ This study reports a chromosome-level genome assembly for Houttuynia cordata and reveals its evolution and whole-genome duplication history. A combination of genome, transcriptome, and metabolome analyses provide insights into the regulatory mechanism of flavonoid biosynthesis and elucidate possible enzymatic processes involved in houttuynin biosynthesis in H. cordata. Introduction Houttuynia cordata, commonly known as Yuxingcao or fishy-smelling herb in China, is a perennial herbaceous plant from the Saururaceae family ([43]Bahadur Gurung et al., 2021; [44]Yin et al., 2023). It is the only species in the Houttuynia genus of the Saururaceae family ([45]Han, 1995) and is found in various regions, including China, Japan, Korea, northeastern India, and Southeast Asia ([46]Xu et al., 2021; [47]Luo et al., 2022; [48]Wei et al., 2024). It typically grows in moist to soggy soils on shady hillsides, waysides, and ridges at altitudes ranging from 300 to 2600 m ([49]Luo et al., 2022). H. cordata has a unique reproductive method, relying primarily on underground rhizome formation and parthenogenesis instead of traditional sexual reproduction ([50]Laldinsangi, 2022). It holds great value as both a culinary ingredient and a medicinal herb, with numerous studies showing its notable antiviral activity ([51]Yuan et al., 2022; [52]Jiu et al., 2023). The medicinal properties of H. cordata are largely attributed to its volatile oils and flavonoid compounds, which have been extensively studied for their antibacterial, anti-inflammatory, antiviral, and antioxidant effects ([53]Laldinsangi, 2022; [54]Rafiq et al., 2022; [55]Pradhan et al., 2023; [56]Wei et al., 2024). The primary components of the volatile oil in H. cordata include houttuynin, β-pinene, 2-undecanone (methyl nonyl ketone), ethyl caprylate, and α-pinene ([57]Yang et al., 2019; [58]Lin et al., 2022). Houttuynin is particularly noteworthy for its unique fishy smell and potent antibacterial properties. Its derivative, sodium houttuyfonate (SH), is water stable and exhibits a broad range of antibacterial, anti-inflammatory, antioxidant, and antitumor activities against various pathogens ([59]Liu et al., 2021b; [60]Shen et al., 2021; [61]Cheng et al., 2023; [62]He et al., 2023). In addition, SH has been found to modulate the immune system and intestinal flora ([63]Zhang et al., 2020). During the process of steam distillation, SH undergoes a transformation into 2-undecanone, which further enhances its therapeutic potential by activating the Nrf2/HO-1/NQO-1 signaling pathway ([64]Lou et al., 2019). Despite the significant medical benefits of SH, its biosynthetic pathway remains unknown, leading to the need for artificial synthesis. Flavonoids are vital plant secondary metabolites synthesized via the phenylpropanoid pathway by enzymes such as chalcone synthase (CHS) and flavonoid 3-hydroxylase (F3H) ([65]Nabavi et al., 2020; [66]Dong and Lin, 2021). These compounds, which include flavonols, flavones, and anthocyanins, exhibit antioxidant, antibacterial, and antiviral properties ([67]Dias et al., 2021; [68]Ekalu and Habila, 2020; [69]Liu et al., 2021a). Research has demonstrated that flavonoids from H. cordata can alleviate lung inflammation and mitigate H1N1-induced lung injury in mice by inhibiting influenza virus and Toll-like receptor signaling ([70]Hung et al., 2015; [71]Lee et al., 2015; [72]Zhou et al., 2022). Quercetin, a flavonol found in H. cordata, is efficiently absorbed in the human digestive system, particularly in the small intestine and stomach ([73]Michala and Pritsa, 2022). This compound forms glycosides such as quercitrin, isoquercitrin, baimaside, hyperoside, rutin, and isohyperoside, which have demonstrated remarkable medicinal properties. Recent studies have highlighted quercetin’s protective effects against UVB-induced skin damage and lung and liver injuries, as well as its potential to inhibit tumor growth ([74]Mapoung et al., 2021; [75]Wang et al., 2022a; [76]Pradhan et al., 2023). Advances in sequencing technology have made it possible and affordable to sequence the genomes of medicinal plants with large genome sizes. This breakthrough has greatly enhanced research on medicinal plants at the molecular and genetic levels ([77]Cheng et al., 2021b). In line with this progress, the “1K Medicinal Plant Genome Project” has been proposed, aiming to complete the collection and sequencing of 1000 important medicinal plant genomes within 3–5 years ([78]Su et al., 2022). The 1K Medicinal Plant Genome Database (1K-MPGD) is now available at [79]http://www.herbgenome.com/. Within the Saururaceae family are six species spread across four relictual genera: Saururus, Gymnotheca, Anemopsis, and Houttuynia ([80]Han, 1995). Many of these species are herbaceous plants. However, only the diploid genome of Saururus chinensis (2n = 22) has been reported to date ([81]Xue et al., 2023). H. cordata is a perennial polyploid herb, existing in diploid or tetraploid form, and features a diverse chromosome count, ranging from 18 to 108 ([82]Luo et al., 2022). Despite its broad genetic diversity, the precise chromosome number and ploidy level of H. cordata remain undefined at the genomic level. This absence of detailed genomic data limits the effective utilization of this medicinal plant. Therefore, it is crucial to decode the complete genome of H. cordata to gain a comprehensive understanding of its genome structure. This information will also facilitate the dissection of biosynthetic pathways for bioactive compounds, aiding genomics-assisted breeding and herbal synthetic biology. In the current study, we sequenced and assembled a high-quality genome of H. cordata, achieving an exceptional chromosome-level assembly in its field. We then examined the phylogenetic relationships and evolution of H. cordata, explored whole-genome duplications (WGDs) in this species, and identified crucial components of the flavonoid biosynthetic pathway. Leveraging the assembled genome, together with transcriptomic and metabolomic analyses, we identified key genes responsible for houttuynin biosynthesis and quercetin metabolism in H. cordata. This study not only broadens our understanding of the genetic and metabolic complexity of H. cordata but also establishes a foundational genome resource that can catalyze future functional genomics studies and enhance the pharmacological exploitation of this important herbaceous plant. Results Sequencing, assembly, and annotation of the genome H. cordata consists of underground white rhizomes, ovate-shaped leaves above ground, four white bracts, and a pale-yellow spike inflorescence without petals ([83]Figure 1A). According to the genome survey, the identification of multiple coverage peaks in the k-mer depth distribution indicated the potential polyploidy of H. cordata ([84]Supplemental Figure 1A). Moreover, the coverage of the AAAB (0.32) type was greater than that of the AABB (0.07) type in Smudgeplot, providing strong evidence for autotetraploidy ([85]Supplemental Figure 1B). The haplotype size (n) was estimated as approximately 600 Mb by k-mer analysis ([86]Supplemental Table 1), which was consistent with the total genome size (4n) estimated by flow cytometry (1.9–2.5 Gb) ([87]Supplemental Figure 2). Figure 1. [88]Figure 1 [89]Open in a new tab Morphological and genomic features of H. cordata. (A and B) (A) Overview of the entire H. cordata plant, with enlarged images of the leaf, inflorescences, bract, and rhizome. Scale bar, 10 cm. (B) Features of the assembled H. cordata genome. The outer to inner circles represent chromosome-scale pseudochromosomes (chr01–19[a–d]), class I transposable element (TE) density, class II TE density, coding gene density, GC content, collinear blocks, and window size. Each linking line in the center of the plot represents a pair of homologous genes. TE density refers to the density of TEs in a specific genomic region, and coding gene density refers to the density of protein-coding genes. The proportion of tandemly repeated DNA sequences in a given genomic region is represented by the tandem repeat proportion. GC content indicates the proportion of guanine (G) and cytosine (C) nucleotides in a specific genomic region. Collinear blocks, with a minimum length of 100 kb, represent collinear genomic segments that share homologous DNA sequences. The genomic region being analyzed is divided into windows of 500 kb each, represented by the window size. The four sets of haplotype chromosomes are denoted by a–d. (C) Locations of telomere and centromere regions in selected chromosomes. The purple line in each subgraph represents the distribution of “TTTAGGG” copies on each chromosome, with the positions of telomeres marked by purple triangles. The gray line in each subgraph represents the copy number of tandem repeats (50–1000 bp monomer) on each chromosome, with possible centromere positions marked by green rectangles. TRF, Telomere Repeats Finder. Using a multi-platform approach, we sequenced and de novo assembled the genome of H. cordata ([90]Figure 1B). The selected plant was identified as an autotetraploid with a predicted heterozygosity rate exceeding 2% and an estimated genome size of 2.4 Gb ([91]Supplemental Figure 1), consistent with the range of 1.9–2.5 Gb estimated by flow cytometry ([92]Supplemental Figure 2). We obtained 65 Gb of PacBio high-fidelity (HiFi) reads by optimizing circular consensus sequencing (CCS) technology, which enabled us to generate highly accurate (99.94%) long HiFi reads with an average length of 16 kb. These assembled HiFi reads were then grouped, ordered, and oriented into pseudomolecules (hereafter referred to as Hi-C pseudomolecules) This initial assembly yielded a draft genome of 2.24 Gb, consisting of 86 contigs with a contig N50 length of 28.73 Mb. Subsequently, we used 111 Gb (∼50×) of chromosome conformation capture (Hi-C) data to scaffold these contigs into 78 pseudomolecules, representing 76 chromosomes, one mitochondrial genome, and one chloroplast genome ([93]Supplemental Table 1 and [94]Figure 1B). The resulting scaffolded assembly was found to be consistent with the karyotype analysis ([95]Supplemental Figure 3), confirming the presence of chr01 [abcd] to chr19 [abcd] (4n = 76) ([96]Supplemental Figures 4–6). At this stage, eight gaps were identified within the pseudomolecules. To enhance the assembly quality, we polished the pseudochromosome sequences using Illumina short reads and attempted gap filling using HiFi long reads. Ultimately, this process produced a 2.24-Gb genome of H. cordata, consisting of 78 pseudomolecules with a pseudomolecule N50 length of 29.19 Mb and a GC content of 39.57% ([97]Supplemental Table 1). Comparison of the haplotype sequences of the autotetraploid revealed a high level of comparability among the homologous chromosomes, with an alignment rate of 84%–86%, and abundant structural variations such as inversions, translocation, and duplications were also detected ([98]Supplemental Table 1 and [99]Supplemental Figure 7). To predict gene models, we used a combination of de novo prediction, homology-based searches, and RNA-sequencing data, resulting in the prediction of 126 864 protein-coding genes in H. cordata. The primary gene models exhibited a mean length of 1629 bp, with an average exon length of 293 bp and an average intron length of 589 bp ([100]Supplemental Table 1). On the basis of a Benchmarking Universal Single-Copy Orthologs (BUSCO) assessment, the completeness of the gene annotations was approximately 99.1% ([101]Supplemental Table 2). We functionally annotated the protein-coding genes using various databases, including Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), Swiss-Prot, and others, an over 98.66% of the genes could be mapped to corresponding functional items, with only 0.82% remaining unannotated ([102]Supplemental Table 3). These analyses indicate a relatively comprehensive structural annotation of the reference genome. RepeatMasker was used to identify repetitive sequences, which accounted for 53.60% of the genome. Long terminal repeat (LTR) retrotransposons were the predominant type of repetitive element, comprising 33.93% of the genome, including 4.12% Copia, 22.64% Gypsy, and 7.18% unknown. The LTR assembly index (LAI) was estimated as 15.49, consistent with the criteria for a reference genome ([103]Ou et al., 2018) ([104]Supplemental Table 4). Telomeres and centromeres play essential roles in maintaining genome stability. Telomeres, located at the ends of chromosomes, consist of repeated sequences (TTTAGGG) and help to protect the integrity of chromosomes. We identified telomere sequences at both ends of 72 out of the 76 chromosomes. This enabled us to assemble the four sets of haplotype genomes, providing further evidence for the presence of intact telomeres in the genome ([105]Figure 1C, [106]Supplemental Figure 8, and [107]Supplemental Table 5). However, four chromosomes—chr7b, chr7c, chr7d, and chr8d—displayed telomere sequences at only one terminus. This unusual pattern suggests potential structural variations, such as telomere loss or rearrangements. This finding emphasizes the need for further investigation of the genomic stability and integrity of these specific regions ([108]Supplemental Figure 8 and [109]Supplemental Table 5). Centromeres consist of tandem repeat sequences of various lengths that can vary among closely related species. Using a tool called Telomere Repeats Finder (TRF), we identified potential centromere positions based on the distribution of tandem repeat sequences. We identified candidate centromere regions in 35 of the 76 chromosomes ([110]Figure 1C, [111]Supplemental Figure 8, and [112]Supplemental Table 5). Notably, some chromosomes exhibited distinctive patterns of highly enriched tandem repeats, whereas others did not show clear clusters of enriched sequences ([113]Figure 1C). The centromere regions in H. cordata were relatively short, ranging from 2 to 71 kb in length. In addition, there were variations in the lengths of the repeating monomers within these regions, which ranged from 66 to 754 bp ([114]Supplemental Table 5). For instance, the centromeres of chr18 [a, c, d] were the shortest, consisting of a repeating monomer length of 94 bp that was repeated approximately 29.5 times ([115]Supplemental Figure 8 and [116]Supplemental Table 5). By contrast, the centromere of chr14c had a larger monomer length of 754 bp that was repeated 45.9 times ([117]Supplemental Figure 8 and [118]Supplemental Table 5). These variations suggest potential differences in the organization and structure of centromeres among different chromosomes. The identification and characterization of centromere regions in H. cordata provides valuable insights into the genome dynamics and stability of this plant species. To assess the completeness of the genome assembly, we examined the presence of ribosomal RNA (rRNA) sequences, specifically the 18S and 5S rRNA genes. These sequences are highly repetitive and are useful indicators of genome quality. We identified the distribution of the 18S rRNA gene primarily at the ends of chr06 [a–d] and chr07 [a–d], confirming the presence of these rRNA sequences in the assembled genome ([119]Supplemental Figure 9A). The 5S rRNA gene arrays were predominantly located on chr09 [a–d] ([120]Supplemental Figure 9B). The presence of these rRNA sequences further supports the completeness and reliability of the assembled genome. Evolution and whole-genome duplication of the H. cordata genome The genome of H. cordata provides a valuable resource for studying the evolution of the magnoliid branch ([121]Soltis et al., 2009). To examine the evolutionary position of H. cordata, we performed a comparative analysis of 21 angiosperm species, including representatives from Magnoliaceae, eudicots, monocots, and basal angiosperms such as Amborella trichopoda ([122]Amborella Genome Project, 2013) and Nymphaea colorata ([123]Dong et al., 2018). We identified a total of 19 455 gene families, 25 of which contained 86 genes specific to H. cordata ([124]Supplemental Table 6). To establish phylogenetic relationships and estimate divergence times, we selected 150 single-copy orthologous gene families to construct a phylogenetic tree ([125]Figure 2A). H. cordata clustered together with other magnoliid species, including Aristolochia fimbriata, Aristolochia contorta, Piper nigrum, and Saururus chinensis. The divergence between H. cordata and Aristolochia species occurred approximately 106.8 million years ago (MYA), whereas the divergence between H. cordata and S. chinensis took place approximately 33.7 MYA ([126]Figure 2A and [127]Supplemental Table 7). Figure 2. [128]Figure 2 [129]Open in a new tab Phylogenetic analysis and whole-genome duplications of the H. cordata genome. (A) Phylogenetic tree depicting the relationships among H. cordata and 20 other angiosperm species. The pie charts show the proportions of gene families with expansions (pink) and contractions (light blue) at each node or species. The estimated divergence time (in million years ago, MYA) is indicated at each node, with bars representing 95% confidence. Nodes that have undergone whole-genome duplications (WGDs) are marked with a star. Purple stars indicate well-known ancient WGD events, and red stars represent WGD events specific to certain species. (B and C) Synonymous substitution rate (K[s]) distributions. The K[s] distributions of syntenic blocks for orthologous genes (B) and paralogous genes (C) are shown for H. cordata, haplotype A (HapA) of H. cordata, Amborella trichopoda, Aristolochia contorta, Piper nigrum, and Schisandra chinensis. (D) Collinear relationships among A. contorta, S. chinensis, and H. cordata chromosomes. The gray lines in the background indicate syntenic blocks between the genomes that span more than 15 genes, with some representative blocks highlighted in red. (E) Phylogenetic tree of homologous genes from S. chinensis, P. nigrum, and H. cordata. The genes from H. cordata, S. chinensis, and P. nigrum are highlighted in green, orange, and purple, respectively. The branch color indicates the bootstrap support value, decreasing from red to green. We next analyzed gene family expansion and contraction across the selected species. We found that 74.0% of the gene families (4974) had experienced expansion in H. cordata, and 26.0% (1752) of the gene families had undergone contraction ([130]Supplemental Table 8). The expanded gene families were enriched in GO terms associated with transcriptional regulation and the cell cycle ([131]Supplemental Table 9), indicating their involvement in gene expression and cell division. In the KEGG pathway enrichment analysis, the expanded gene families were primarily associated with plant hormone signal transduction and thiamine, cysteine, and methionine metabolism ([132]Supplemental Table 10). Among the expanded gene families, particular emphasis should be placed on those rapidly expanded gene families that clustered in the GO enrichment analysis categories of DNA-binding transcription factor activity, secondary metabolic processes, defense response to other organisms, and UDP-glucosyltransferase activity ([133]Supplemental Table 11). In the KEGG enrichment analysis, these rapidly expanded gene families were involved in phenylpropanoid and tryptophan metabolism, glucosinolate biosynthesis, flavonoid biosynthesis, and diterpenoid biosynthesis ([134]Supplemental Table 12). WGDs and the amplification of repetitive sequences are recognized as key drivers of species evolution. To explore the potential occurrence of WGDs in H. cordata, we analyzed the synonymous substitution rates (K[s]) of orthologous gene pairs between H. cordata and four other species. The K[s] peak distributions between H. cordata and A. trichopoda, between H. cordata and A. fimbriata, and between H. cordata and S. chinensis generally aligned with their predicted evolutionary relationships ([135]Figure 2B). We identified WGDs in P. nigrum, S. chinensis, and H. cordata by analyzing the density distribution of K[s] of paralogous gene pairs ([136]Figure 3C). In the H. cordata genome, we detected three distinct K[s] peaks of 1.22, 0.16, and 0.01 ([137]Figure 2C), indicating the occurrence of three WGDs in H. cordata. The most recent WGD, WGD3, as a polyploidization event, resulted in the formation of a tetraploid species of H. cordata from a diploid species ([138]Figure 2D). However, the collinear gene pairs between haplotype A of H. cordata (HcoHapA) and A. contorta, which did not experience independent WGD events, exhibited a 4:1 ratio ([139]Supplemental Figure 10A and 10B), reflecting the doubling of WGD1 and WGD2. The detected K[s] peak was approximately 1.12 in S. chinensis, and a notable number of gene pairs exhibited K[s] values concentrated below 0.25, suggesting that S. chinensis has undergone only one WGD, consistent with previous findings ([140]Xue et al., 2023) ([141]Figure 2C). The syntenic depth between S. chinensis and HcoHapA showed a 2:4 pattern ([142]Supplemental Figure 11A and 11B), confirming an additional WGD event in H. cordata. By contrast, two distinct peaks, measuring around 0.88 and 0.16, were observed in the P. nigrum genome ([143]Figure 2C). In the collinearity analysis between HcoHapA and P. nigrum, the syntenic depths showed a 4:2 pattern. However, some of the HcoHapA blocks exhibited alignments with five to eight corresponding P. nigrum blocks ([144]Supplemental Figure 12A and 12B); for example, chr01a, chr02a, chr04a, and chr11a aligned to Pn1, Pn3, Pn5, Pn6, Pn7, Pn13, Pn15, and Pn24 ([145]Supplemental Figure 12A and 12B). This indicates that P. nigrum has undergone at least two WGDs. However, relying solely on evaluation of the density distribution of K[s], it is challenging to determine whether P. nigrum, S. chinensis, and H. cordata have experienced a common WGD. Alternatively, these data suggest a shared WGD1 event between H. cordata and P. nigrum. Our analysis of paralogous genes further supports this observation, as we detected paralogous genes in P. nigrum that originated from WGD1 (block K[s] ≈ 1.0–2.0; [146]Supplemental Figures 13–15). On the basis of genes within the identified blocks and orthologous genes of H. cordata, we deduced the evolutionary relationships ([147]Figure 2D, [148]Supplemental Figure 16, and [149]Supplemental Table 13). Notably, orthologous gene pairs tended to cluster together, rather than paralogous gene pairs ([150]Figure 2A and 2E), further supporting the occurrence of WGD1 in the ancestor of H. cordata and P. nigrum. Figure 3. [151]Figure 3 [152]Open in a new tab Inference of karyotype evolution in H. cordata. (A) The dot plot displays collinear blocks within haplotype A (HapA) of H. cordata. Chromosomes have been sorted and colored on the basis of homology, with pairs of similar colors indicating highly homologous chromosomes resulting from whole-genome duplication (WGD). (B) Speculative processes of chromosome duplications and three end-to-end joining (EEJ) events following WGD2 and WGD3. (C) Dot plot of collinear blocks among nine chromosomes in H. cordata, representing the ancestral karyotype prior to WGD2. Chromosome colors correspond to those in (A) and (B). (D) Dot plot of collinear blocks among nine chromosomes of H. cordata and A. contorta. Chromosomes are color-coded according to A. contorta. (E) Speculative models for chromosomal changes in the WGD1 event. Inference of karyotype evolution in the H. cordata genome Advances in genome sequencing in multiple organisms have opened up new possibilities for understanding the evolution of ancestral karyotypes ([153]Murat et al., 2017). After multiple WGD events, the H. cordata genome still exhibits traces of duplication, suggesting the preservation of ancient genomic information. By analyzing conserved blocks in HcoHapA, we identified duplications and inferred chromosome rearrangement events following WGD2 ([154]Figure 3A). Notably, Chr1a and Chr11a displayed high homology, indicating their duplication through WGD2. Three evident chromosome rearrangement events occurred in H. cordata after WGD2. Prominent homology between the lower half of Chr9a and Chr5a, as well as the upper half of Chr9a and Chr15a, suggested end-to-end joining (EEJ) following duplications from two ancestral chromosomes ([155]Figure 3A and 3B). Similarly, Chr19a exhibited high homology with the lower half of Chr15a and Chr8a, indicating its gradual evolution through EEJ from two ancestral chromosomes ([156]Figure 3A and 3B). On the basis of this inference, the chromosome count initially started at n = 11. It then doubled through WGD2, resulting in a count of n = 22. Subsequently, three EEJ events occurred, leading to a final count of n = 19. Thereafter, the H. cordata species underwent WGD3 (polyploidization), leading to the formation of a tetraploid genome. (4n = 19 × 4) ([157]Figure 3B). To detect duplications caused by WGD1, we selected nine chromosomes (Chr1a/2a/3a/6a/7a/9a/10a/14a/19a) to avoid interference from duplications that resulted from WGD2. Collinearity analysis revealed clear duplications between Chr1a and Chr2a as well as between Chr6a and Chr7a ([158]Figure 3C). In addition, Chr9a was formed through EEJ of an ancestral chromosome following duplication ([159]Figure 3B). Incomplete reciprocal homologous fragments were also detected among Chr10a, Chr14a, Chr19a, and Chr3a ([160]Figure 3C), indicating subsequent changes to WGD1. Insertion of a fragment from Chr19a into the middle of Chr14a ([161]Figure 3C) resulted in chromosomal reduction, and fragment exchanges between Chr10a and Chr3a were observed. On the basis of these findings, the chromosome count of the common ancestor species of H. cordata and P. nigrum was estimated to be 6 before WGD1. A. contorta (n = 7), an outgroup species of H. cordata and P. nigrum, did not undergo independent WGDs. This suggests that chromosome fusions occurred in the ancestral species, resulting in a change in chromosome number from 7 to 6. Collinearity analysis between H. cordata and A. contorta revealed that the collinearity of almost all chromosomes was not well maintained, likely owing to the distant evolutionary relationship between A. contorta and H. cordata ([162]Figure 3D). However, it is evident that Chr6a and Chr7a in H. cordata evolved through the fusion of Ac3 and Ac5 in A. contorta ([163]Figure 3D). In addition, the fusion of Ac3 and Ac5 was also detected in the collinearity between A. contorta and P. nigrum ([164]Supplemental Figure 17A and 17B). Overall, a fusion event occurred in the ancestral species of H. cordata and P. nigrum, resulting in a shift from 7 to 6 chromosomes ([165]Figure 3E). Following WGD1 in the ancestral species (n = 6), a replica of a Chr19a fragment was inserted into Chr14a ([166]Figure 3C), resulting in a species with 11 chromosomes ([167]Figure 3E). Sodium houttuyfonate biosynthesis in H. cordata Houttuynin, also known as decanoyl acetaldehyde, is an important compound found in H. cordata that is responsible for its distinct fishy odor ([168]Laldinsangi, 2022). The water-soluble derivative, SH, is highly stable and exhibits important medicinal properties ([169]Liu et al., 2021b). As a result, it is widely used in various medical applications. An initial ultra-high performance liquid chromatography with quadrupole time-of-flight mass spectrometry (UPLC-Q-TOF-MS) analysis successfully detected the presence of SH in both the rhizome ([170]Supplemental Figure 18) and leaves ([171]Supplemental Figure 19) of H. cordata. This was determined based on retention time (11.89) and mass spectral data, which included a parent ion at m/z 302.30493 and fragment ions at m/z 295.28714, 297.28287, and so forth. These results were consistent with those of the SH standard sample ([172]Supplemental Figure 20). A noteworthy finding was a significantly higher (p < 0.05) concentration of SH in rhizomes (79.59 ppm) compared with leaves (2.40 ppm) ([173]Figure 4A). This disparity aligns with the stronger fishy odor evident in the rhizomes, suggesting a potential biosynthetic pathway for SH within the plant. Figure 4. [174]Figure 4 [175]Open in a new tab Proposed sodium houttuyfonate biosynthetic pathway in H. cordata. (A) Analysis of sodium houttuyfonate (SH) in different tissues using high-performance liquid chromatography. The retention time of the peak associated with SH is indicated by a white arrow. The accompanying bar graph provides a quantitative representation of the SH content in various tissues. The letters a and b indicate significant differences between groups based on ANOVA analysis, followed by Tukey’s HSD test for post hoc analysis. (B) Proposed biosynthetic pathway of SH. Houttuynin originates from fatty acid metabolism, specifically palmitic acid (C[16]H[32]O[2], molecular weight [MW] 256.42). Palmitic acid is converted to palmitoyl-CoA (C16) by acetyl-CoA synthetase, followed by mitochondrial fatty acid β-oxidation. This cycle continues until decanoyl-CoA is formed, which is then converted to decanoic acid (C[10]H[20]O[2], MW 172.26) by thioesterase. Decanoic acid is further oxidized to decanoylacetaldehyde (houttuynin, C[12]H[22]O[2], MW 198.16). Subsequently, the aldehyde group is transformed into a hydroxyl group, reacting with sodium sulfate to form water-soluble sodium houttuyfonate (SH, C[12]H[23]NaO[5]S, MW 302.36). In addition, houttuynin is prone to instability and can readily transform into 2-undecanone (methyl nonyl ketone, C[11]H[22]O) during production. Dashed lines represent potential intermediate steps, and question marks indicate uncertain enzyme involvement. Metabolites identified by UPLC-Q-TOF-MS in this study are marked with black boxes, and those reported in previous studies are enclosed in cyan boxes. (C) Clustered heatmap of differentially expressed genes (DEGs) in rhizome vs. leaf tissues (Rz vs. Lf) of H. cordata. Values from three biological replicates are shown. The DEGs are categorized into seven functional classes based on their roles in the fatty acid metabolic pathway. These classes include fatty acid synthases (FAS), fatty acid activation enzymes (FAAE), fatty acid oxidoreductases (FAOR), fatty acid transferases (FAT), fatty acid esterases (FAE), lipid enzymes (LE), and fatty acid dehydrogenases (FAD). Gene-expression differences are quantified as log[2] fold changes, with statistical significance set at p < 0.001. “Up” indicates a log[2] fold change > 1 (upregulation in rhizome relative to leaf), and “Down” indicates a log[2] fold change < −1 (downregulation in rhizome relative to leaf). Houttuynin differs from typical secondary metabolites such as terpenes with isoprene units, flavonoids with phenolic structures, or alkaloids with nitrogen atoms. Its unique characteristics, including the presence of an aldehyde group and decanoyl group, suggest that it is derived from fatty acid oxidation. Therefore, we propose that houttuynin is produced through the β-oxidation pathway of fatty acid metabolism. In this pathway, fatty acids abundant in H. cordata, such as palmitic acid (C16), are first converted into palmitoyl-coenzyme A (CoA) through acyl-CoA synthesis. This compound then undergoes sequential removal of two-carbon units through β-oxidation. After three cycles of β-oxidation, decanoyl-CoA (C10-CoA) is formed. We hypothesize that thioesterase enzyme converts decanoyl-CoA into decanoic acid (C[10]H[20]O[2]), which is further transformed into decanal (C[10]H[20]O). Subsequently, aldehyde dehydrogenase likely catalyzes the conversion to decanoyl acetaldehyde (houttuynin, C[12]H[22]O[2]), ultimately leading to the synthesis of SH (C[12]H[23]NaO[4]S). Houttuynin is prone to instability and can readily transform into another compound known as 2-undecanone (methyl nonyl ketone, C[11]H[22]O) during production ([176]Chen et al., 2014) ([177]Figure 4B). Our hypothesis is supported by the identification of decanoic acid and SH in this study and palmitic acid, decanal, and 2-undecanone in previous research through UPLC-Q-TOF-MS analysis ([178]Persson et al., 2010). To gain further understanding of the differences in houttuynin synthesis between rhizome and leaf tissues, we performed transcriptome profiling of enzyme-coding candidate genes involved in the fatty acid biosynthetic pathway in H. cordata. By comparing rhizome and leaf tissues (Rz vs. Lf), we identified 49 differentially expressed genes (DEGs) that may play a role in this process ([179]Figure 4C). These genes can be classified into seven categories based on their functions within the fatty acid pathway: fatty acid synthases, fatty acid activation enzymes, fatty acid oxidoreductases, fatty acid transferases, fatty acid esterases, lipid enzymes, and fatty acid dehydrogenases. Seventeen of these genes were upregulated, and 32 were downregulated ([180]Supplemental Table 14 and [181]15). This comparative analysis sheds light on the possible enzymatic processes involved in houttuynin biosynthesis, providing valuable insights into the unique fatty acid pathway for SH accumulation in H. cordata. Flavonoid biosynthesis and quercetin metabolism in H. cordata H. cordata is widely recognized for its remarkable flavonoid production and diverse array of beneficial properties ([182]Ekalu and Habila, 2020). This makes it an ideal subject for studying the metabolic pathways involved in flavonoid biosynthesis, which can be greatly influenced by the unique characteristics of the source organism. We carried out a thorough analysis of total flavonoids in various tissues of H. cordata, including rhizomes (Rz), leaves (Lf), bracts (Br), and the inflorescence (If). We found that rhizomes had the lowest total flavonoid content (2.38 mg/g), whereas leaves (5.80 mg/g), the inflorescence (5.59 mg/g), and bracts (9.49 mg/g) exhibited higher levels ([183]Figure 5A). To further investigate the distribution of flavonoids within H. cordata, we used a comprehensive targeted metabolomics approach, which identified 212 enriched flavonoid metabolites ([184]Figure 5B). Among these compounds, flavonols and flavones were the most prominent, making up 32% and 31% of the total identified flavonoids, respectively ([185]Figure 5B). This accumulation pattern highlights the diverse range of flavonoids present in H. cordata and emphasizes the notable presence of flavonols and flavones in this medicinal plant. Figure 5. [186]Figure 5 [187]Open in a new tab Expression analysis of enzyme-coding genes involved in flavonoid biosynthesis in H. cordata. (A) Analysis of total flavonoid content in different tissues. Lf, leaves; Rz, rhizomes; Br, bracts; If, inflorescences. The letters represent significant differences between groups determined by ANOVA, followed by Tukey’s HSD test for post hoc analysis. (B) Number of flavonoid metabolites in H. cordata. (C) Heatmap illustrating the expression levels of enzyme-coding genes involved in flavonoid biosynthesis across tissues of H. cordata. (D) Simplified flavonoid biosynthetic pathway in H. cordata. Colored circles represent specific metabolites: cyan for naringenin, pink for flavones, blue for isoflavone, green for flavonol, and gray for anthocyanidins. To gain insight into flavonoid biosynthesis in H. cordata, we performed transcriptomic profiling of various tissues. This comprehensive analysis revealed a set of 76 conserved enzyme-coding genes that are closely associated with the flavonoid biosynthetic pathway. These identified genes play a crucial role in orchestrating critical steps in the pathway and encode key enzymes such as ammonia-lyase (PAL), 4-coumarate CoA ligase (4CL), cinnamate 4-hydroxylase (C4H), chalcone isomerase (CHI), CHS, flavonol synthase (FLS), flavanone 3-hydroxylase (F3H), anthocyanidin synthase (ANS), dihydroflavonol 4-reductase (DFR), isoflavone synthase (IFS), and flavone synthase (FNS) ([188]Figure 5C and [189]Supplemental Table 16). These findings provide a comprehensive understanding of the genetic basis underlying flavonoid biosynthesis in H. cordata. To further enhance our candidate selection process, we performed a comprehensive analysis of the relationships among gene-expression levels and the top five flavonoid metabolites ([190]Figure 5D). These metabolites included flavanone (naringenin, cyan), quercitrin (flavonols, green), luteolin (flavones, pink), delphinidin-3-O-rutinoside (anthocyanidins, gray), and glycitein (isoflavones, blue). Our findings revealed significant correlations (∣r∣ ≥ 0.6, p < 0.05) between 44 out of the 76 enzyme-coding candidate genes and at least one of the tested metabolites ([191]Figure 5D). For example, three PAL genes (PAL_10/7/8) were significantly correlated with all flavonoid types except for flavonols. PAL_5 showed notable correlations with flavonols (green) and anthocyanidins (gray), whereas PAL_3 exhibited significant correlations with naringenin (cyan) and isoflavones (blue). Overall, our analysis identified key genes, including four PAL (PAL_10/7/8/3), five 4CL (4CL_4/2/7/9/6), three CHS (CHS_1/2/3), and three CHI (CHI_4/1/2) genes that were all significantly correlated with naringenin (cyan), highlighting their crucial roles in naringenin biosynthesis ([192]Figure 5D). We also discovered a distinct gene set associated with flavonol biosynthesis (green), comprising one PAL (PAL_5), seven 4CL (4CL_4/2/7/13/11/12/10), one CHI (CHI_3), one CHS (CHS_3), one FNS (FNS_13), three F3H (F3H_8/6/2/1), and four FLS (FLS_4/3/6/1) genes. We observed a specific correlation with isoflavone (blue) biosynthesis involving four PAL (PAL_10/7/8/3), five 4CL (4CL_9/6), three CHS (CHS_1/2), three CHI (CHI_4/1/2), three FNS (FNS_4/11/16), three IFS (IFS_5/2), two F3H (F3H_8/1), four DFR (DFR_2/3/1/4), and one ANS gene. Similarly, we found a significant correlation with flavone (pink) biosynthesis that involved three PAL (PAL_10/7/8), five 4CL (4CL_4/2/7/9/6), three CHS (CHS_1/2), three CHI (CHI_4/1/2), four FNS (FNS_13/4/11/16), three IFS (IFS_5/2/4), four F3H (F3H_8/6/2/1), four DFR (DFR_2/3/1/4), and one ANS gene. Finally, we observed a significant correlation with anthocyanidins (gray) for four PAL (PAL_10/7/8/5), three 4CL (4CL_4/2/7), three CHS (CHS_1/2), three CHI (CHI_4/1/2), four FNS (FNS_13/4/11/16), two IFS (IFS_5/2), three F3H (F3H_8/6/2), four DFR (DFR_2/3/1/4), and one ANS gene ([193]Figure 5D and [194]Supplemental Table 16). These findings highlight specific gene-expression networks involved in flavonoid biosynthesis within H. cordata. Quercetin, one of the most well-known flavonoids, is known for its powerful antioxidant properties and other health benefits in traditional Chinese medicine ([195]Wang et al., 2016). Our study revealed significant variations (p < 0.05) in antioxidant capacity among different tissues of H. cordata. Notably, leaves (73.9 μmol/g) and inflorescences (67.3 μmol/g) exhibited considerably higher (p < 0.05) antioxidant capacity compared with rhizomes (25 μmol/g) and bracts (20 μmol/g) ([196]Figure 6A). Furthermore, glycosylated forms of quercetin, such as isoquercitrin, rutin, hyperoside, baimaside, and quercitrin, which play a crucial role in in vivo antioxidant activity, were markedly enriched (p < 0.05) in leaves, inflorescences, and bracts. These tissues displayed high concentrations (indicated by red color) of glycosylated quercetin forms in the UPLC-Q-TOF-MS metabolomics analysis, whereas rhizomes showed the lowest concentrations (indicated by blue color) ([197]Figure 6B). This distribution pattern (with the exception of bracts) is consistent with the observed antioxidant capacities, highlighting the distinct roles of different plant parts in accumulating quercetin glycosides. Figure 6. [198]Figure 6 [199]Open in a new tab Quercetin metabolic pathway in H. cordata. (A) Total antioxidant capacity in various tissues of H. cordata. Lf, leaves; Rz, rhizomes; Br, bracts; If, inflorescences. The letters a and b represent significant differences between groups determined by ANOVA, followed by Tukey’s HSD test for post hoc analysis. (B) Abundance of quercetin glycosides in different H. cordata tissues. (C) Overview of the quercetin metabolic pathway. Heatmaps accompanying the text display the expression level (TPM) of enzyme-coding genes involved in this pathway. Values from three biological replicates are shown. Further analysis focused on the expression levels of homologous glucosyltransferase (UGT) enzyme genes (F3RT, F3GT, F3GGT, and F3GRT) that catalyze the formation of these five quercetin glycosides in H. cordata. The expression trends of these UGT genes, i.e., three F3RT (F3RT_2/3/6), five F3GT (F3GT_7/9/3/2/6), four F3GGT (F3GGT_4/5/2/6), and three F3GRT (F3GRT_5/1/7) genes, mirrored the observed patterns of quercetin metabolite accumulation, with lower expression in the rhizome than in the other organs ([200]Figure 6C). This finding suggests a notable relationship between gene expression and quercetin metabolism, providing valuable insight into the regulation of quercetin biosynthesis and its distribution across various plant tissues. Gene co-expression analysis of flavonoid biosynthesis genes in H. cordata To gain a deeper understanding of the regulatory mechanisms that control flavonoid biosynthesis in H. cordata, we used WGCNA (weighted gene co-expression network analysis), a powerful computational tool that helped to reveal gene relationships and regulatory patterns within our dataset. Through this analysis, we organized all genes into 32 modules, distinguished by different colors such as MEblack, MEblue, and MEbrown, based on their similar expression patterns across samples ([201]Figure 7A and [202]Supplemental Table 17). Among these modules, the MEblue module stood out, exhibiting notably strong and statistically significant positive correlations (r > 0.8, p < 0.01) with flavones, isoflavones, anthocyanidins, and naringenin (excluding flavonols, [203]Supplemental Table 17). This finding suggests that the genes in the MEblue module play a crucial role in biosynthesis of these specific flavonoid compounds. Figure 7. [204]Figure 7 [205]Open in a new tab Gene co-expression network analysis in the flavonoid biosynthesis pathway. (A) Weighted gene co-expression network analysis between all genes and five metabolites in H. cordata. The network diagram visually depicts the co-expression patterns among these genes and metabolites. The heatmap provides a comprehensive overview of metabolite profiles in different tissues of H. cordata during the blooming stage, with lines indicating correlations. (B) Positive regulatory network illustrating the interactions between transcription factors (TFs) and pathway genes. Enzyme-coding genes are represented as hexagons, and TFs are depicted as circles. The lines indicate positive correlations between the TFs and enzyme-coding genes. (C) Identification of TF binding sites in the 2-kb upstream sequences of seven candidate genes and their expression levels in various tissues. (D) Expression levels (TPM) of 25 TFs in different tissues. (E) Subcellular localization of two H. cordata TFs fused to the GFP tag and expressed in Nicotiana tabacum cells. Nuclei were stained with DAPI. The merged images show co-localization of GFP and DAPI signals within the nucleus, confirming the nuclear localization of the TFs. Scale bars, 20 μm. (F) RT–qPCR analysis of two TFs in H. cordata. Real-time qPCR analysis of two TFs across various tissues of H. cordata, with expression levels normalized to those of the housekeeping genes GAPDH and ACTIN. Data are presented as mean ± SEM, and each bar represents the average expression from three biological replicates. Statistical significance was determined using ANOVA, with different letters indicating significant differences (p < 0.05) between tissues. Tissues examined included rhizomes, leaves, bracts, and inflorescences during the blooming period. Delving further into the MEblue module, we focused on predominantly significant correlations (r > 0.91, p < 1e−200) which had an extremely low probability of occurring by random chance, leading us to identify seven enzyme-coding candidate genes ([206]Supplemental Figure 21). These included one 4CL (Hocor15aG0005700), one FNS (Hocor16aG0024200), two DFRs (Hocor12aG0138000, Hocor01aG0150600), one ANS (Hocor14aG0093300), and two CHIs (Hocor10aG0148000, Hocor16aG0092100), which shared similar expression patterns across samples ([207]Figure 7B). We performed RT–qPCR to validate the expression of these seven candidate genes identified from the MEblue module. They were notably highly expressed in inflorescences, corroborating the expression patterns observed in the heatmaps ([208]Supplemental Figure 22). The strong positive correlations allowed us to perform an extensive analysis of species-specific networks within the blue module. This analysis led to the identification of key transcription factors (TFs) involved in regulation of the seven candidate genes associated with flavonoid biosynthesis ([209]Figure 7B). These 25 TFs belonged to 12 families, including well-known families such as bZIP, MYB, ARF, AP2, and GRAS, which have been extensively studied for their critical roles in secondary metabolism ([210]Yao et al., 2018; [211]Cao et al., 2020; [212]Li et al., 2020; [213]Han et al., 2023) ([214]Supplemental Table 18). To further support the hypothesis of co-regulation, we performed TF binding site predictions, revealing the presence of known cis elements in the upstream regions of the candidate genes. Specifically, we identified G-box binding sites for bZIP TFs, MYB binding sites for MYB TFs, TGA elements and the AUXRR core for ARF TFs, HD-ZIP binding sites for HD-ZIP TFs, and the W-box for WRKY TFs ([215]Figure 7C and [216]Supplemental Table 19). These findings provide compelling evidence for the potential hierarchical regulatory relationships between the TFs and the enzyme-coding genes involved in flavonoid biosynthesis. Importantly, our analysis also revealed higher expression levels of these TFs in inflorescences, consistent with the expression patterns of the enzyme-coding genes and the contents of flavonoid metabolites ([217]Figure 5A and 5C). We performed subcellular localization ([218]Figure 7E) and RT–qPCR assays ([219]Figure 7F) for two of the TFs, MYB_2 and AP2_2, confirming their nuclear localization and significant expression (p < 0.05) in inflorescences. These observations suggests that MYB_2 and AP2_2 may have specific roles in the regulation of flavonoid biosynthesis during inflorescence development. Their higher expression levels in inflorescences emphasize the importance of these TFs for shaping the flavonoid profiles of H. cordata ([220]Figure 7D and [221]Supplemental Table 20). Together, our analyses identify co-expressed genes involved in flavonoid biosynthesis in H. cordata and highlight the potential applications of TF-mediated regulation of flavonoid biosynthesis in H. cordata in future studies. Discussion The Saururaceae family consists of four genera: Anemopsis, Gymnotheca, Houttuynia, and Saururus. Members of this family typically have heart-shaped leaves and spike-like flowers ([222]Smith and Stockey, 2007). Among these genera, only the genome of S. chinensis has been reported previously. This genome is 537 Mb and diploid with 11 chromosome pairs (2n = 22) ([223]Xue et al., 2023). H. cordata, a species from the Surrogacies family of the magnoliid group, has been a topic of debate regarding its evolutionary position within the magnoliid branch and the relationships among its species ([224]Datta and Dasgupta 1977; [225]Soltis et al., 2009). In the present study, we expanded our sequencing efforts to include H. cordata and investigated the evolutionary relationships within the Saururaceae family. In addition, we identified candidate genes involved in the production of bioactive compounds in H. cordata. To overcome the challenges posed by the large genome size and high heterozygosity of H. cordata, we produced a chromosome-level and partially telomere-to-telomere high-precision genome assembly. We identified telomeres in 72 out of 76 chromosomes, as well as candidate centromere regions, with minimal gaps, providing insight into the organization and stability of these genomic elements. Previous studies have identified independent WGD events in species such as Liriodendron chinense and P. nigrum, whereas the Aristolochia species have not experienced such events ([226]Qin et al., 2021). Our phylogenetic analysis revealed a close evolutionary relationship between H. cordata and other magnoliid species such as S. chinensis, P. nigrum, A. fimbriata, and A. contorta. H. cordata diverged from the Aristolochia genus approximately 108.4 MYA, from P. nigrum 61.9 MYA, and from S. chinensis more recently, about 33.4 MYA. Our data support the occurrence of a WGD event in the common ancestor of H. cordata and P. nigrum following their divergence from Aristolochia. Subsequently, H. cordata and P. nigrum underwent species differentiation. The ancestral species of H. cordata experienced two additional WGD events, resulting in the formation of a homologous tetraploid. In the case of P. nigrum, its ancestral species underwent two independent WGD events and chromosome fusions, leading to the formation of the contemporary P. nigrum genome. The presence of a WGD before the divergence of H. cordata and P. nigrum suggests a shared evolutionary event between these two species. These findings align with previous observations of a single WGD event in S. chinensis ([227]Xue et al., 2023), indicating a shared WGD event in the Saururus genome after the divergence of Saururus and Liriodendron. Chromosome duplications and structural changes resulting from WGDs have influenced their genetic makeup and have potentially contributed to their divergence and speciation. The observation of fusion events in the common ancestor of H. cordata and P. nigrum, leading to a reduction in the chromosome basis, highlights the significance of chromosomal rearrangements during their evolutionary history. These findings provide important insights into the evolution of H. cordata and its species within the Saururaceae, shedding light on the genetic mechanisms that underlie their adaptation and diversification. The prevalence of gene family expansion suggests that it played a vital role in the adaptive evolution of H. cordata, potentially facilitating its adaptation to specific environmental conditions or ecological niches. By contrast, the contraction of gene families implies possible gene loss or reduction in certain functional categories. This pattern suggests that rapidly expanding gene families are particularly crucial for defense mechanisms, growth, and developmental processes in H. cordata. H. cordata, which emits a distinct fish-like odor when its leaves are crushed, has valuable medical applications ([228]Luo et al., 2022). One specific component of H. cordata, houttuynin (decanoyl acetaldehyde, C[12]H[22]O[2]), is responsible for causing “fishy burps.” However, houttuynin is unstable and easily converts into 2-undecanone (methyl nonyl ketone, C[11]H[22]O) during production ([229]Chen et al., 2014). To address this instability, SH (C[12]H[23]O[5]SNa), the water-stable form of houttuynin, which has pharmaceutical activity similar to that of houttuynin, is synthesized ([230]Chen et al., 2014). Currently, SH is obtained through a synthetic process starting from lauric acid, acetic acid, and slaked lime. However, this process has limitations in terms of environmental impact, yield, and purity ([231]Liu et al., 2021b). Therefore, it is important to understand the in vivo biosynthetic pathway of SH in H. cordata. In this study, high accumulation of SH was found in the rhizome of H. cordata, and a biosynthetic pathway for SH production in H. cordata was proposed. This pathway is related to the fatty acid oxidation pathway. Unlike other plant secondary metabolites, houttuynin is structurally similar to primary metabolic products derived from the fatty acid oxidation pathway. It is likely formed through the breakdown of long-chain fatty acids into acetyl-CoA units. Finally, houttuynin is converted to SH, possibly by aldehyde dehydrogenase. The H. cordata genome provides an opportunity to further study fatty acid metabolism and better understand the in vivo production of SH, as well as providing support for in vitro synthetic biology research. H. cordata is well known for its exceptional accumulation of flavonoids, beneficial compounds found in many traditional Chinese medicinal plants ([232]Bai et al., 2019; [233]Zhu et al., 2021). However, the biosynthetic pathways of these compounds in H. cordata are not fully understood, limiting advances in its molecular breeding. By performing transcriptomic and metabolic profiling analysis, we have identified 76 conserved enzyme-coding genes that play a crucial role in flavonoid biosynthesis in H. cordata. These genes, which are differentially expressed across various tissues, contribute to our understanding of the molecular mechanisms that underlie flavonoid biosynthesis in medicinal plants. Quercetin, a type of flavanol, possesses a wide range of biological activities, including antioxidant, anti-inflammatory, and anticancer effects ([234]Di Petrillo et al., 2022; [235]Wang et al., 2022a). We found that quercetin and quercetin glycosides (isoquercitrin, rutin, hyperoside, baimaside, and quercitrin) were present at high levels in the leaves, inflorescences, and bracts of H. cordata. The process of glycosylation, which involves the addition of sugar moieties to flavonoid compounds, is facilitated by an enzyme called glucosyltransferase (UGT). This enzymatic activity results in the formation of various flavonoid glycosides, including flavonol aglycones that are known for their high bioavailability and bioactivity ([236]Ross et al., 2001). Our current study identified several UGT genes that are differentially expressed in different tissues of H. cordata, suggesting their involvement in quercetin glycoside formation. Further investigation of their functions will provide valuable insights into the quercitin biosynthetic pathway in plants and can support future synthetic biology. In summary, we assembled a near-complete genome of H. cordata with 76 chromosomes (4n = 76) and identified key genes involved in the biosynthesis of houttuynin and flavonoids. Our findings establish a genomic foundation that enhances our understanding of H. cordata’s medicinal properties and supports future research and genetic improvements to maximize its therapeutic potential. Methods Sample preparation and genome sequencing In June 2022, we obtained botanical samples of H. cordata from Kaili (107°58′ E, 26°05′–27°38′ N) in Guizhou Province, China. To ensure sample quality, we carefully selected a limited number of healthy and pathogen-free plants. High-molecular-weight genomic DNA for genome sequencing was isolated from young leaves using a standard CTAB (hexadecyltrimethylammonium bromide) protocol, and DNA libraries were constructed for the PacBio HiFi reads. The young leaves were used for cell culture and crosslinking of chromatin to construct the Hi-C library. Genome survey and de novo genome assembly Data-quality assessment and pre-processing were carried out using fastp v0.20.1 ([237]Chen, 2023b) with default options. We used clean data for k-mer analysis with jellyfish 1.1.6 ([238]Marçais and Kingsford, 2011) using a k-mer size of 17 bp. The k-mers from sequencing reads were visualized for Smudgeplot (v0.2.5) ([239]Ranallo-Benavidez et al., 2020), inferring the reference genome as autotetraploid (4n; [240]Supplemental Figure 1B). We then used genomescope 2.0 ([241]Ranallo-Benavidez et al., 2020) to estimate the genome size, heterozygosity, and repeat content in a polyploid model. The H. cordata genome was assembled using HiFi and Hi-C sequencing data. SMRTbell libraries were sequenced on a PacBio Sequel II system, and consensus reads (HiFi reads) were generated using CCS software ([242]https://github.com/pacificbiosciences/ccs) with default parameters. Hifiasm v0.166-r375 was used to construct the primary genome and haplotype draft contig genomes from the long and highly accurate HiFi and Hi-C reads ([243]Cheng et al., 2021a; [244]Han et al., 2022). The Hi-C reads were quality controlled and mapped to the H. cordata contig assembly using Juicer ([245]Durand et al., 2016). The 3D-DNA pipeline (v180419) was then used to generate a candidate chromosome-level assembly, correct mis-joins, and order and orient the contigs. Manual inspection and refinement of the draft assembly were performed using Juicebox Assembly Tools ([246]Seppey et al., 2019). Gap filling was carried out using LR_Gapcloser software (v1.0) with HiFi reads. The chloroplast and mitochondrial genomes were separately assembled using GetOrganelle software (v1.7.5) using the PacBio CCS data ([247]Xu et al., 2019; [248]Baeza et al., 2023). Two rounds of polishing were performed on the short-read data using nextpolish ([249]Chang et al., 2023). The completeness of the chromosome-level genome was evaluated using BUSCO software (v4.0.6) ([250]Vanholme et al., 2019) with the embryophyta_odb10 ortholog set. Genome annotation To identify transposable elements in the H. cordata genome, we used the EDTA pipeline v1.8 ([251]Ou et al., 2019). This pipeline enabled us to identify LTR, TIR, and non-TIR elements. To further analyze the genome and annotate the TEs, we used RepeatMasker (v4.1.6). In addition, the MAKER2 pipeline ([252]Stanke et al., 2008) was used to predict the structures of coding genes. This pipeline incorporated three main approaches: ab initio predictions, protein homology, and transcriptome data. Prior to gene prediction, the assembled genome underwent hard and soft masking using RepeatMasker (v4.1.6). Ab initio gene prediction was performed using Augustus v3.1 ([253]Stanke et al., 2006), and homology-based gene prediction was carried out using Exonerate v2.2.2 ([254]Slater and Birney, 2005). To predict gene models, EVidenceModeler v2.1.0 was used to integrate the results from all three prediction methods ([255]Haas et al., 2008). The functions of protein-coding genes were determined using three methods. First, eggNOG-mapper v2.1.12 ([256]Huerta-Cepas et al., 2017) annotation was used to compare genes with the eggNOG homologous gene database, enabling the annotation of gene functions, including GO and KEGG annotations. Second, a sequence similarity search was performed using Diamond v2.1.3 ([257]Buchfink et al., 2015) to compare protein sequences with protein databases such as Swiss-Prot and NR, enabling the characterization of protein sequences. Finally, a structural domain similarity search was performed using InterProScan 5.68-100.0 to compare subdatabases in InterPro, thereby obtaining conserved amino acid sequences, motifs, and domains of the proteins ([258]Jones et al., 2014). tRNAScan-SE 2.0 ([259]Lowe and Eddy, 1997) was used with default parameters to identify tRNA genes, and Barrnap 0.8 ([260]Lagesen et al., 2007; [261]Nawrocki and Eddy, 2013) was used to predict rRNA sequences. Infernal 2.5.2 ([262]Nawrocki and Eddy, 2013) was used to search for non-coding RNAs in the Rfam database. The original sequence data of the H. cordata genome produced in this study have been deposited in the Genome Sequence Archive (Genomics, Proteomics and Bioinformatics, 2021) of the National Center for Biological Information of China/National Genomics Data Center of Beijing Institute of Genomics, Chinese Academy of Sciences (Nucleic Acid Research Center, 2022) (GSA: CRA010237). Open access to the data is available at [263]https://ngdc.cncb.ac.cn/gsa. Haplotype comparison We constructed pairwise alignments of the sequences of the four haplotypes using minimap2 (2.24-r1122) ([264]Li, 2021) with the -asm5 option, then used SyRI (1.6.3) ([265]Goel et al., 2019) to convert the corresponding BAM files into variant information files. We then used plotsr (0.5.4) ([266]Goel and Schneeberger, 2022) to visualize all variant information files. The variant information files between haplotypes were also uploaded to cloud storage: [267]https://drive.google.com/drive/folders/1seLh1gEZZyf2E2KrVr3hkZK3fZ 6BAcZI?usp=sharing. Phylogenetic analysis Gene-evolution analysis was performed by identifying paralogs and orthologs among 21 plant species. The orthologous gene groups were clustered using OrthoFinder2 with default parameters ([268]Emms and Kelly, 2019). The amino acid sequences were aligned using MAFFT v7 ([269]Katoh and Standley, 2013) and trimmed with trimAI v2.0 ([270]Capella-Gutiérrez et al., 2009). A maximum-likelihood phylogenetic tree was constructed using RAxML v7.2.8 with 1000 bootstrap replicates based on the PROTGAMMAJTT model ([271]Stamatakis, 2014). Amborella trichopoda was used as the outgroup. The species tree was then used to estimate divergence times using the MCMCTree program in the PAML package v4.10.6 ([272]Yang, 2007). Multiple fossil times were obtained from TimeTree ([273]http://www.timetree.org/) and used for time calibrations. CAFE5 ([274]Mendes et al., 2021) was used to infer the expansion and contraction of gene families on the basis of the chronogram of 21 plant species mentioned above. Synteny analysis and gene-duplication identification Syntenic blocks within and between species were identified using MCScanX-h v1.1.11 ([275]Wang et al., 2012) with homologous gene sets and BLASTP. To predict WGD events in one species and estimate the divergence time between two species, we analyzed the distribution of synonymous substitution rates (K[s]) ([276]Huang et al., 2023). Gene families were subjected to GO and KEGG analyses using the R package clusterProfiler v3.19 for further investigation ([277]Wu et al., 2021). Synteny network analyses and phylogenetic profiling Twenty-one plant species were analyzed, including seven Magnoliaceae, five monocots, six eudicots, and two basal angiosperms. Protein sequences and GFF/GFF3 attachments from fully sequenced genomes were obtained from various databases, such as Ensembl ([278]https://plants.ensembl.org/index.html), National Omics Center ([279]https://www.nstda.or.th/noc/), CGDB ([280]http://bio2db.com/), Phytozome ([281]https://phytozome-next.jgi.doe.gov/), NCBI ([282]https://www.ncbi.nlm.nih.gov/), and CPBD ([283]http://citrus.hzau.edu.cn/download.php). Multiple protein sequence alignments were created using MUSCLE v3.8.1551 ([284]Edgar, 2004), and phylogenetic trees were built using the maximum-likelihood technique with IQ-TREE v2.1.4 (-m MFP -bb 1000) ([285]Nguyen et al., 2015). The resulting trees were annotated and visualized using iTOL v6 ([286]Letunic and Bork, 2016) and were grouped on the basis of gene functions and species relatedness. To identify collinear genes, BLASTP v2.10.0 was used to compare protein sequences from each of the 21 plant genomes, with an E-value cutoff of 1e−5. Genomic collinearity was detected and a library dataset of all possible syntenic gene pairs among the 21 plant genomes was constructed using MCScanX ([287]Wang et al., 2012) with default parameters. Protein sequences were compared using MUSCLE v5.1 ([288]Edgar, 2004) and then transformed into codon comparisons using PAL2NAL ([289]Suyama et al., 2006). Finally, the K[a] and K[s] values between homologous gene pairs were calculated using the YN model ([290]Yang and Nielsen, 2000). Karyotype analysis of H. cordata We used karyotype analysis to identify and analyze the chromosomes of H. cordata ([291]Wang et al., 2022b). Root tips from H. cordata plants were collected and treated with a colchicine solution, then fixed using Kanoy solution to ensure that the root-tip cells were in metaphase. The fixed root-tip samples were then hydrolyzed and stained with Giemsa dye. Finally, the root-tip samples were prepared using the extrusion method. The chromosomes of Houttuynia were captured and measured using a high-resolution optical microscope. The chromosomes were classified on the basis of their size, banding pattern, and centromeric location. To clarify chromosomal alterations in H. cordata, we used the core eudicot karyotype (ACEK; n = 21) ([292]Wang et al., 2022) as a reference to perform a comparative genomic analysis with Houttuynia and A. contorta using WGDI (version 0.6.2) ([293]Sun et al., 2022). RNA extraction, library construction, and sequencing We collected samples from four different tissue segments during the blooming period: leaves, rhizomes, inflorescences, and bracts. To maintain sterility and preserve the samples, we followed a rigorous protocol. Total RNA was extracted using TRIzol reagent according to the manufacturer’s instructions ([294]Yang et al., 2020). RNA libraries were prepared using the NEBNext Ultra RNA Library Prep Kit for Illumina and sequenced on the Illumina NovaSeq 6000 platform to capture comprehensive transcriptome profiles ([295]Mildrum et al., 2020). The quality of the cDNA libraries was assessed using an Agilent Bioanalyzer 2100 instrument, and three replicates were performed for each sample ([296]Grissom et al., 2005). Identification of genes involved in flavonoid biosynthesis We performed transcriptomic and metabolic profiling of H. cordata leaves, rhizomes, bracts, and inflorescences, focusing on flavonoid biosynthetic genes in 11 gene families, namely PAL, 4CL, C4H, CHI, CHS, FLS, F3H, ANS, DFR, IFS, and FNS. The data were visualized using phylogenetic trees, sequence alignments, gene structure diagrams, and heatmaps ([297]Chen et al., 2023a). These graph panels were combined or arranged using TBtools v2.102 ([298]Chen et al., 2023a). Transcriptomic analysis To ensure data integrity, our data pre-processing included rigorous quality assessment and pre-processing using FastQC v0.12.1 and fastp v0.20.1 ([299]Zarour et al., 2021) with default options. These stringent filtering steps were designed to generate a high-quality dataset for subsequent analysis, and metrics such as Q20, Q30, and GC content were calculated. The filtered reads were aligned to the reference genome using HISAT2 v2.2.1 ([300]Kim et al., 2019) in paired-end mode. The results were processed with SAMtools v1.10 ([301]Li et al., 2009) to sort and index the alignment files. Read counting was performed using featureCounts v2.0.1 ([302]Liao et al., 2014) to quantify gene counts, and the gene counts were normalized into transcripts per million (TPM) expression levels. To identify DEGs among different plant tissues, stringent criteria were applied ([303]Shi and Gu, 2020). A log[2] fold change of ≥1 and a significance level (p[adj]) of ≤0.05 were required. Transcriptional regulation of flavonoid biosynthesis We performed a series of gene-expression and co-expression network analyses using the entire transcriptome dataset to uncover transcriptional regulatory networks involving the flavonoid biosynthetic genes and TFs. Genes with TPM < 1 in all tissues were filtered out, and the remaining genes were used to construct a co-expression network using WGCNA ([304]Chen and Ma, 2021). The co-expression network modules were obtained using the blockwise modules function with the following parameters: soft threshold power = 14; TOMtype = signed; merge-CutHeight = 0.25; and minModuleSize = 50. PlantTFDB ([305]Guo et al., 2008) was then used with default parameters to identify TFs in the H. cordata genome. Finally, the networks between the genes and TFs were visualized in Cytoscape v3.7.1 ([306]Chen et al., 2023c). Metabolomic analysis Initial preparation of plant samples for metabolomic analysis included lyophilization and pulverization of individual plant parts, i.e. leaves, rhizomes, inflorescences, and bracts. Metabolites were then extracted from 100 mg of each pulverized sample using a well-established method ([307]Meng et al., 2022). We performed liquid chromatography–mass spectrometry (LC–MS) using the high-sensitivity SCIEX QTRAP 6500+ system for targeted metabolomics studies. This platform enabled us to perform a class-targeted metabolomics analysis, providing detailed insights into the metabolite profiles relevant to our study. Metabolites that met specific criteria, including a VIP (variable importance in projection) score of ≥1 and fold change of ≥2 or ≤0.5, were classified as differentially abundant metabolites. The analysis was performed rigorously, with three independent replicates for each plant tissue. To determine variations in metabolite composition among different components of H. cordata, advanced statistical methodologies were used. The identified metabolites were then annotated using the KEGG compound database and mapped to the KEGG pathway database ([308]Altman et al., 2013). This systematic approach enabled the identification of significantly enriched metabolic pathways by assessing the statistical significance of the overlap between identified metabolites and known metabolic pathways using a hypergeometric test. Statistical analysis of metabolite content and antioxidant capacity Metabolite contents and antioxidant capacity were quantitatively analyzed using standard curves of known concentrations. All statistical analyses were performed using R software (v4.0.3) with the stats package (v4.0.3) for ANOVA and the agricolae package (v1.3–3) for Tukey’s HSD test. Determination of sodium houttuyfonate content To precisely quantify SH in H. cordata, various plant parts were heat blanched at 105°C for 2–10 min, then dried to a constant weight at 60°C. The dried samples were ground and sieved (40–60 mesh). For the extraction, an ultrasonic extraction method was used ([309]Cui et al., 2022). A mixture of 0.02 M K[2]HPO[4], 0.02 M KH[2]PO[4], and methanol was prepared in a 10:10:80 ratio, with the pH adjusted to 9. The extraction was carried out at 60°C for 120 min. After extraction, centrifugation was performed at 25°C at 2000 rpm for 10 min. The supernatant was filtered through a 0.22-μm microporous filter for chromatographic analysis. The chromatographic analysis used a C18 column (250 cm × 4.6 mm) with the same mobile phase. An isocratic elution mode was used at a flow rate of 0.5 ml/min, and the column temperature was maintained at room temperature ([310]Schellinger and Carr, 2006). The detection wavelength was set to 286 nm, and the total detection duration was 18 min. Mass spectrometric identification of sodium houttuyfonate We used an Agilent 1290 ultra-high performance LC system coupled with a Thermo Fisher Q-Exactive Orbitrap Plus high-resolution mass spectrometer to accurately determine the concentration of SH ([311]Liu et al., 2019). Chromatographic separation was achieved using a Waters T3 column (21 mm × 50 mm, 1.8 μm), with precise temperature control at 50°C. The mobile phases consisted of 0.1% formic acid in water (phase A) and acetonitrile (phase B), delivered at a flow rate of 0.3 ml/min. We carefully designed a gradient program, starting with 2% B for the first minute, followed by a linear increase to 100% B over 1–18 min, maintenance at 100% B until 22 min, then a rapid reduction to 2% B over 0.1 min and maintenance at 2% B for the final 2.9 min. The injection volume was 5 μl. For MS analysis, we set dynamic data acquisition to cover a range of 100–1000 m/z. We fine-tuned the ESI source parameters, including an auxiliary gas flow rate of 16 Arb, full MS resolution at 70 000, and a collision energy (NCE) of 25 eV for MS/MS mode with a resolution of 17 500. The spray voltage was set to −3.0 kV for negative mode and 3.0 kV for positive mode, with a dynamic exclusion duration of 4 s and energy steps in NCE of 10, 30, and 50. Data acquisition ranged from m/z 100 to 1000. We performed data interpretation and analysis using Xcalibur 4.3 and MS-DIAL 5.0.3 workstations ([312]Tsugawa et al., 2015). To validate the identification of compounds, we cross-referenced them with the standard SH (CAS 83766-73-8), searched through local databases, and confirmed diagnostic ions. In addition, we analyzed primary quasi-molecular ions and secondary MS fragmentation patterns to ensure a high level of accuracy in compound identification. Plasmid construction To generate the constitutive expression constructs, YFP sequences were initially amplified using KOD-Plus DNA polymerase (TOYOBO, [313]www.toyobo-global.com). The amplified sequences were then cloned into the pSK34 vector ([314]Banno et al., 2001) using the SpeI and NotI restriction sites to create the pSKY36 vector. Sequences encoding full-length AP2_2 (Hocor06aG0037700) and MYB_2 (Hocor18aG0011700) (amino acids 1–356) were amplified and inserted into pSKY36 at the AscI and SpeI sites as described in [315]Yang et al. (2014b), yielding the constructs 35S::AP2_2-YFP and 35S::MYB_2-YFP. The details of all primers used are provided in [316]Supplemental Table 22. Protein subcellular localization and gene-expression analysis Subcellular localization of AP2_2 and MYB_2 proteins was assessed by transient expression experiments in tobacco (Nicotiana benthamiana) as detailed in [317]Yang et al. (2014a). Various constructs were transformed into Agrobacterium tumefaciens strain GV3101, which was subsequently suspended in infiltration medium consisting of 1 mM MES (2-(N-morpholino)ethanesulfonic acid; pH 5.6), 10 mM magnesium chloride, and 200 μM acetosyringone. The bacterial cultures were adjusted to an OD[600] of approximately 0.6. The constructs were transiently expressed in tobacco leaf epidermal cells via agroinfiltration. Fluorescence was observed 3 days post infiltration using a Zeiss LSM 710 confocal microscope ([318]www.zeiss.com). For nuclear staining, leaf samples were stained with 1 μg/ml DAPI (4′,6-diamidino-2-phenylindole; Sigma-Aldrich) for 10–30 min. Gene-expression analysis was performed using total RNA extracted from treated leaves. RT–qPCR and RT–PCR were performed as described previously ([319]Liu et al., 2007; [320]Yang et al., 2023). Two micrograms of total RNA was used for first-strand cDNA synthesis using M-MLV reverse transcriptase (Invitrogen, [321]www.invitrogen.com) following the manufacturer’s protocol. Actin and GAPDH served as internal controls for normalization in the RT–qPCR analyses. Data and code availability We have uploaded the original sequence data for RNA sequencing and whole-genome assembly of H. cordata to NCBI and the Genome Sequence Archive at the China National Genomics Data Center ([322]https://ngdc.cncb.ac.cn). The BioProject numbers are PRJNA940964 and PRJCA015615. Funding This project was funded by the National Natural Science Foundation of China, China (grant number 32360074), the Guizhou Provincial Natural Science Foundation of Department of Education,China ([2022]077), The Karst Mountain Ecological Security Engineering Research Center, China (KY [2021]007), and the Joint Fund of the National Natural Science Foundation of China and the Karst Science Research Center of Guizhou Province, China (U1812401). Acknowledgments