Abstract

   Houttuynia cordata, also known as Yuxingcao in Chinese, is a perennial
   herb in the Saururaceae family. It is highly regarded for its medicinal
   properties, particularly in treating respiratory infections and
   inflammatory conditions, as well as boosting the human immune system.
   However, a lack of genomic information has hindered research on the
   functional genomics and potential improvements of H. cordata. In this
   study, we present a near-complete assembly of H. cordata genome and
   investigate the biosynthetic pathway of flavonoids, specifically
   quercetin, using genomics, transcriptomics, and metabolomics analyses.
   The genome of H. cordata diverged from that of Saururus chinensis
   around 33.4 million years ago; it consists of 2.24 Gb with 76
   chromosomes (4n = 76) and has undergone three whole-genome duplication
   (WGD) events. These WGDs played a crucial role in shaping the H.
   cordata genome and influencing the gene families associated with its
   medicinal properties. Through metabolomics and transcriptomics
   analyses, we identified key genes involved in the β-oxidation process
   for biosynthesis of houttuynin, one of the volatile oils responsible
   for the plant’s fishy smell. In addition, using the reference genome,
   we identified genes involved in flavonoid biosynthesis, particularly
   quercetin metabolism, in H. cordata. This discovery has important
   implications for understanding the regulatory mechanisms that underlie
   production of active pharmaceutical ingredients in traditional Chinese
   medicine. Overall, the high-quality genome assembly of H. cordata
   serves as a valuable resource for future functional genomics research
   and provides a solid foundation for genetic improvement of H. cordata
   for the benefit of human health.

   Key words: Houttuynia cordata, flavonoid biosynthesis, genome assembly,
   houttuynin, quercetin, whole-genome duplication
     __________________________________________________________________

   This study reports a chromosome-level genome assembly for Houttuynia
   cordata and reveals its evolution and whole-genome duplication history.
   A combination of genome, transcriptome, and metabolome analyses provide
   insights into the regulatory mechanism of flavonoid biosynthesis and
   elucidate possible enzymatic processes involved in houttuynin
   biosynthesis in H. cordata.

Introduction

   Houttuynia cordata, commonly known as Yuxingcao or fishy-smelling herb
   in China, is a perennial herbaceous plant from the Saururaceae family
   ([43]Bahadur Gurung et al., 2021; [44]Yin et al., 2023). It is the only
   species in the Houttuynia genus of the Saururaceae family ([45]Han,
   1995) and is found in various regions, including China, Japan, Korea,
   northeastern India, and Southeast Asia ([46]Xu et al., 2021; [47]Luo
   et al., 2022; [48]Wei et al., 2024). It typically grows in moist to
   soggy soils on shady hillsides, waysides, and ridges at altitudes
   ranging from 300 to 2600 m ([49]Luo et al., 2022). H. cordata has a
   unique reproductive method, relying primarily on underground rhizome
   formation and parthenogenesis instead of traditional sexual
   reproduction ([50]Laldinsangi, 2022). It holds great value as both a
   culinary ingredient and a medicinal herb, with numerous studies showing
   its notable antiviral activity ([51]Yuan et al., 2022; [52]Jiu et al.,
   2023).

   The medicinal properties of H. cordata are largely attributed to its
   volatile oils and flavonoid compounds, which have been extensively
   studied for their antibacterial, anti-inflammatory, antiviral, and
   antioxidant effects ([53]Laldinsangi, 2022; [54]Rafiq et al., 2022;
   [55]Pradhan et al., 2023; [56]Wei et al., 2024). The primary components
   of the volatile oil in H. cordata include houttuynin, β-pinene,
   2-undecanone (methyl nonyl ketone), ethyl caprylate, and α-pinene
   ([57]Yang et al., 2019; [58]Lin et al., 2022). Houttuynin is
   particularly noteworthy for its unique fishy smell and potent
   antibacterial properties. Its derivative, sodium houttuyfonate (SH), is
   water stable and exhibits a broad range of antibacterial,
   anti-inflammatory, antioxidant, and antitumor activities against
   various pathogens ([59]Liu et al., 2021b; [60]Shen et al., 2021;
   [61]Cheng et al., 2023; [62]He et al., 2023). In addition, SH has been
   found to modulate the immune system and intestinal flora ([63]Zhang
   et al., 2020). During the process of steam distillation, SH undergoes a
   transformation into 2-undecanone, which further enhances its
   therapeutic potential by activating the Nrf2/HO-1/NQO-1 signaling
   pathway ([64]Lou et al., 2019). Despite the significant medical
   benefits of SH, its biosynthetic pathway remains unknown, leading to
   the need for artificial synthesis.

   Flavonoids are vital plant secondary metabolites synthesized via the
   phenylpropanoid pathway by enzymes such as chalcone synthase (CHS) and
   flavonoid 3-hydroxylase (F3H) ([65]Nabavi et al., 2020; [66]Dong and
   Lin, 2021). These compounds, which include flavonols, flavones, and
   anthocyanins, exhibit antioxidant, antibacterial, and antiviral
   properties ([67]Dias et al., 2021; [68]Ekalu and Habila, 2020; [69]Liu
   et al., 2021a). Research has demonstrated that flavonoids from H.
   cordata can alleviate lung inflammation and mitigate H1N1-induced lung
   injury in mice by inhibiting influenza virus and Toll-like receptor
   signaling ([70]Hung et al., 2015; [71]Lee et al., 2015; [72]Zhou
   et al., 2022). Quercetin, a flavonol found in H. cordata, is
   efficiently absorbed in the human digestive system, particularly in the
   small intestine and stomach ([73]Michala and Pritsa, 2022). This
   compound forms glycosides such as quercitrin, isoquercitrin, baimaside,
   hyperoside, rutin, and isohyperoside, which have demonstrated
   remarkable medicinal properties. Recent studies have highlighted
   quercetin’s protective effects against UVB-induced skin damage and lung
   and liver injuries, as well as its potential to inhibit tumor growth
   ([74]Mapoung et al., 2021; [75]Wang et al., 2022a; [76]Pradhan et al.,
   2023).

   Advances in sequencing technology have made it possible and affordable
   to sequence the genomes of medicinal plants with large genome sizes.
   This breakthrough has greatly enhanced research on medicinal plants at
   the molecular and genetic levels ([77]Cheng et al., 2021b). In line
   with this progress, the “1K Medicinal Plant Genome Project” has been
   proposed, aiming to complete the collection and sequencing of 1000
   important medicinal plant genomes within 3–5 years ([78]Su et al.,
   2022). The 1K Medicinal Plant Genome Database (1K-MPGD) is now
   available at [79]http://www.herbgenome.com/. Within the Saururaceae
   family are six species spread across four relictual genera: Saururus,
   Gymnotheca, Anemopsis, and Houttuynia ([80]Han, 1995). Many of these
   species are herbaceous plants. However, only the diploid genome of
   Saururus chinensis (2n = 22) has been reported to date ([81]Xue et al.,
   2023). H. cordata is a perennial polyploid herb, existing in diploid or
   tetraploid form, and features a diverse chromosome count, ranging from
   18 to 108 ([82]Luo et al., 2022). Despite its broad genetic diversity,
   the precise chromosome number and ploidy level of H. cordata remain
   undefined at the genomic level. This absence of detailed genomic data
   limits the effective utilization of this medicinal plant. Therefore, it
   is crucial to decode the complete genome of H. cordata to gain a
   comprehensive understanding of its genome structure. This information
   will also facilitate the dissection of biosynthetic pathways for
   bioactive compounds, aiding genomics-assisted breeding and herbal
   synthetic biology.

   In the current study, we sequenced and assembled a high-quality genome
   of H. cordata, achieving an exceptional chromosome-level assembly in
   its field. We then examined the phylogenetic relationships and
   evolution of H. cordata, explored whole-genome duplications (WGDs) in
   this species, and identified crucial components of the flavonoid
   biosynthetic pathway. Leveraging the assembled genome, together with
   transcriptomic and metabolomic analyses, we identified key genes
   responsible for houttuynin biosynthesis and quercetin metabolism in H.
   cordata. This study not only broadens our understanding of the genetic
   and metabolic complexity of H. cordata but also establishes a
   foundational genome resource that can catalyze future functional
   genomics studies and enhance the pharmacological exploitation of this
   important herbaceous plant.

Results

Sequencing, assembly, and annotation of the genome

   H. cordata consists of underground white rhizomes, ovate-shaped leaves
   above ground, four white bracts, and a pale-yellow spike inflorescence
   without petals ([83]Figure 1A). According to the genome survey, the
   identification of multiple coverage peaks in the k-mer depth
   distribution indicated the potential polyploidy of H. cordata
   ([84]Supplemental Figure 1A). Moreover, the coverage of the AAAB (0.32)
   type was greater than that of the AABB (0.07) type in Smudgeplot,
   providing strong evidence for autotetraploidy ([85]Supplemental
   Figure 1B). The haplotype size (n) was estimated as approximately 600
   Mb by k-mer analysis ([86]Supplemental Table 1), which was consistent
   with the total genome size (4n) estimated by flow cytometry (1.9–2.5
   Gb) ([87]Supplemental Figure 2).

Figure 1.

   [88]Figure 1
   [89]Open in a new tab

   Morphological and genomic features of H. cordata.

   (A and B) (A) Overview of the entire H. cordata plant, with enlarged
   images of the leaf, inflorescences, bract, and rhizome. Scale bar,
   10 cm. (B) Features of the assembled H. cordata genome. The outer to
   inner circles represent chromosome-scale pseudochromosomes
   (chr01–19[a–d]), class I transposable element (TE) density, class II TE
   density, coding gene density, GC content, collinear blocks, and window
   size. Each linking line in the center of the plot represents a pair of
   homologous genes. TE density refers to the density of TEs in a specific
   genomic region, and coding gene density refers to the density of
   protein-coding genes. The proportion of tandemly repeated DNA sequences
   in a given genomic region is represented by the tandem repeat
   proportion. GC content indicates the proportion of guanine (G) and
   cytosine (C) nucleotides in a specific genomic region. Collinear
   blocks, with a minimum length of 100 kb, represent collinear genomic
   segments that share homologous DNA sequences. The genomic region being
   analyzed is divided into windows of 500 kb each, represented by the
   window size. The four sets of haplotype chromosomes are denoted by a–d.

   (C) Locations of telomere and centromere regions in selected
   chromosomes. The purple line in each subgraph represents the
   distribution of “TTTAGGG” copies on each chromosome, with the positions
   of telomeres marked by purple triangles. The gray line in each subgraph
   represents the copy number of tandem repeats (50–1000 bp monomer) on
   each chromosome, with possible centromere positions marked by green
   rectangles. TRF, Telomere Repeats Finder.

   Using a multi-platform approach, we sequenced and de novo assembled the
   genome of H. cordata ([90]Figure 1B). The selected plant was identified
   as an autotetraploid with a predicted heterozygosity rate exceeding
   2% and an estimated genome size of 2.4 Gb ([91]Supplemental Figure 1),
   consistent with the range of 1.9–2.5 Gb estimated by flow cytometry
   ([92]Supplemental Figure 2). We obtained 65 Gb of PacBio high-fidelity
   (HiFi) reads by optimizing circular consensus sequencing (CCS)
   technology, which enabled us to generate highly accurate (99.94%) long
   HiFi reads with an average length of 16 kb. These assembled HiFi reads
   were then grouped, ordered, and oriented into pseudomolecules
   (hereafter referred to as Hi-C pseudomolecules) This initial assembly
   yielded a draft genome of 2.24 Gb, consisting of 86 contigs with a
   contig N50 length of 28.73 Mb. Subsequently, we used 111 Gb (∼50×) of
   chromosome conformation capture (Hi-C) data to scaffold these contigs
   into 78 pseudomolecules, representing 76 chromosomes, one mitochondrial
   genome, and one chloroplast genome ([93]Supplemental Table 1 and
   [94]Figure 1B). The resulting scaffolded assembly was found to be
   consistent with the karyotype analysis ([95]Supplemental Figure 3),
   confirming the presence of chr01 [abcd] to chr19 [abcd] (4n = 76)
   ([96]Supplemental Figures 4–6). At this stage, eight gaps were
   identified within the pseudomolecules. To enhance the assembly quality,
   we polished the pseudochromosome sequences using Illumina short reads
   and attempted gap filling using HiFi long reads. Ultimately, this
   process produced a 2.24-Gb genome of H. cordata, consisting of 78
   pseudomolecules with a pseudomolecule N50 length of 29.19 Mb and a GC
   content of 39.57% ([97]Supplemental Table 1). Comparison of the
   haplotype sequences of the autotetraploid revealed a high level of
   comparability among the homologous chromosomes, with an alignment rate
   of 84%–86%, and abundant structural variations such as inversions,
   translocation, and duplications were also detected ([98]Supplemental
   Table 1 and [99]Supplemental Figure 7).

   To predict gene models, we used a combination of de novo prediction,
   homology-based searches, and RNA-sequencing data, resulting in the
   prediction of 126 864 protein-coding genes in H. cordata. The primary
   gene models exhibited a mean length of 1629 bp, with an average exon
   length of 293 bp and an average intron length of 589 bp
   ([100]Supplemental Table 1). On the basis of a Benchmarking Universal
   Single-Copy Orthologs (BUSCO) assessment, the completeness of the gene
   annotations was approximately 99.1% ([101]Supplemental Table 2). We
   functionally annotated the protein-coding genes using various
   databases, including Gene Ontology (GO), Kyoto Encyclopedia of Genes
   and Genomes (KEGG), Swiss-Prot, and others, an over 98.66% of the genes
   could be mapped to corresponding functional items, with only 0.82%
   remaining unannotated ([102]Supplemental Table 3). These analyses
   indicate a relatively comprehensive structural annotation of the
   reference genome. RepeatMasker was used to identify repetitive
   sequences, which accounted for 53.60% of the genome. Long terminal
   repeat (LTR) retrotransposons were the predominant type of repetitive
   element, comprising 33.93% of the genome, including 4.12% Copia, 22.64%
   Gypsy, and 7.18% unknown. The LTR assembly index (LAI) was estimated as
   15.49, consistent with the criteria for a reference genome ([103]Ou
   et al., 2018) ([104]Supplemental Table 4).

   Telomeres and centromeres play essential roles in maintaining genome
   stability. Telomeres, located at the ends of chromosomes, consist of
   repeated sequences (TTTAGGG) and help to protect the integrity of
   chromosomes. We identified telomere sequences at both ends of 72 out of
   the 76 chromosomes. This enabled us to assemble the four sets of
   haplotype genomes, providing further evidence for the presence of
   intact telomeres in the genome ([105]Figure 1C, [106]Supplemental
   Figure 8, and [107]Supplemental Table 5). However, four
   chromosomes—chr7b, chr7c, chr7d, and chr8d—displayed telomere sequences
   at only one terminus. This unusual pattern suggests potential
   structural variations, such as telomere loss or rearrangements. This
   finding emphasizes the need for further investigation of the genomic
   stability and integrity of these specific regions ([108]Supplemental
   Figure 8 and [109]Supplemental Table 5). Centromeres consist of tandem
   repeat sequences of various lengths that can vary among closely related
   species. Using a tool called Telomere Repeats Finder (TRF), we
   identified potential centromere positions based on the distribution of
   tandem repeat sequences. We identified candidate centromere regions in
   35 of the 76 chromosomes ([110]Figure 1C, [111]Supplemental Figure 8,
   and [112]Supplemental Table 5). Notably, some chromosomes exhibited
   distinctive patterns of highly enriched tandem repeats, whereas others
   did not show clear clusters of enriched sequences ([113]Figure 1C). The
   centromere regions in H. cordata were relatively short, ranging from 2
   to 71 kb in length. In addition, there were variations in the lengths
   of the repeating monomers within these regions, which ranged from 66 to
   754 bp ([114]Supplemental Table 5). For instance, the centromeres of
   chr18 [a, c, d] were the shortest, consisting of a repeating monomer
   length of 94 bp that was repeated approximately 29.5 times
   ([115]Supplemental Figure 8 and [116]Supplemental Table 5). By
   contrast, the centromere of chr14c had a larger monomer length of 754
   bp that was repeated 45.9 times ([117]Supplemental Figure 8 and
   [118]Supplemental Table 5). These variations suggest potential
   differences in the organization and structure of centromeres among
   different chromosomes. The identification and characterization of
   centromere regions in H. cordata provides valuable insights into the
   genome dynamics and stability of this plant species.

   To assess the completeness of the genome assembly, we examined the
   presence of ribosomal RNA (rRNA) sequences, specifically the 18S and 5S
   rRNA genes. These sequences are highly repetitive and are useful
   indicators of genome quality. We identified the distribution of the 18S
   rRNA gene primarily at the ends of chr06 [a–d] and chr07 [a–d],
   confirming the presence of these rRNA sequences in the assembled genome
   ([119]Supplemental Figure 9A). The 5S rRNA gene arrays were
   predominantly located on chr09 [a–d] ([120]Supplemental Figure 9B). The
   presence of these rRNA sequences further supports the completeness and
   reliability of the assembled genome.

Evolution and whole-genome duplication of the H. cordata genome

   The genome of H. cordata provides a valuable resource for studying the
   evolution of the magnoliid branch ([121]Soltis et al., 2009). To
   examine the evolutionary position of H. cordata, we performed a
   comparative analysis of 21 angiosperm species, including
   representatives from Magnoliaceae, eudicots, monocots, and basal
   angiosperms such as Amborella trichopoda ([122]Amborella Genome
   Project, 2013) and Nymphaea colorata ([123]Dong et al., 2018). We
   identified a total of 19 455 gene families, 25 of which contained 86
   genes specific to H. cordata ([124]Supplemental Table 6). To establish
   phylogenetic relationships and estimate divergence times, we selected
   150 single-copy orthologous gene families to construct a phylogenetic
   tree ([125]Figure 2A). H. cordata clustered together with other
   magnoliid species, including Aristolochia fimbriata, Aristolochia
   contorta, Piper nigrum, and Saururus chinensis. The divergence between
   H. cordata and Aristolochia species occurred approximately 106.8
   million years ago (MYA), whereas the divergence between H. cordata and
   S. chinensis took place approximately 33.7 MYA ([126]Figure 2A and
   [127]Supplemental Table 7).

Figure 2.

   [128]Figure 2
   [129]Open in a new tab

   Phylogenetic analysis and whole-genome duplications of the H. cordata
   genome.

   (A) Phylogenetic tree depicting the relationships among H. cordata and
   20 other angiosperm species. The pie charts show the proportions of
   gene families with expansions (pink) and contractions (light blue) at
   each node or species. The estimated divergence time (in million years
   ago, MYA) is indicated at each node, with bars representing 95%
   confidence. Nodes that have undergone whole-genome duplications (WGDs)
   are marked with a star. Purple stars indicate well-known ancient WGD
   events, and red stars represent WGD events specific to certain species.

   (B and C) Synonymous substitution rate (K[s]) distributions. The K[s]
   distributions of syntenic blocks for orthologous genes (B) and
   paralogous genes (C) are shown for H. cordata, haplotype A (HapA) of H.
   cordata, Amborella trichopoda, Aristolochia contorta, Piper nigrum, and
   Schisandra chinensis.

   (D) Collinear relationships among A. contorta, S. chinensis, and H.
   cordata chromosomes. The gray lines in the background indicate syntenic
   blocks between the genomes that span more than 15 genes, with some
   representative blocks highlighted in red.

   (E) Phylogenetic tree of homologous genes from S. chinensis, P. nigrum,
   and H. cordata. The genes from H. cordata, S. chinensis, and P. nigrum
   are highlighted in green, orange, and purple, respectively. The branch
   color indicates the bootstrap support value, decreasing from red to
   green.

   We next analyzed gene family expansion and contraction across the
   selected species. We found that 74.0% of the gene families (4974) had
   experienced expansion in H. cordata, and 26.0% (1752) of the gene
   families had undergone contraction ([130]Supplemental Table 8). The
   expanded gene families were enriched in GO terms associated with
   transcriptional regulation and the cell cycle ([131]Supplemental
   Table 9), indicating their involvement in gene expression and cell
   division. In the KEGG pathway enrichment analysis, the expanded gene
   families were primarily associated with plant hormone signal
   transduction and thiamine, cysteine, and methionine metabolism
   ([132]Supplemental Table 10). Among the expanded gene families,
   particular emphasis should be placed on those rapidly expanded gene
   families that clustered in the GO enrichment analysis categories of
   DNA-binding transcription factor activity, secondary
   metabolic processes, defense response to other organisms, and
   UDP-glucosyltransferase activity ([133]Supplemental Table 11). In
   the KEGG enrichment analysis, these rapidly expanded gene families were
   involved in phenylpropanoid and tryptophan metabolism, glucosinolate
   biosynthesis, flavonoid biosynthesis, and diterpenoid biosynthesis
   ([134]Supplemental Table 12).

   WGDs and the amplification of repetitive sequences are recognized as
   key drivers of species evolution. To explore the potential occurrence
   of WGDs in H. cordata, we analyzed the synonymous substitution rates
   (K[s]) of orthologous gene pairs between H. cordata and four other
   species. The K[s] peak distributions between H. cordata and A.
   trichopoda, between H. cordata and A. fimbriata, and between H. cordata
   and S. chinensis generally aligned with their predicted evolutionary
   relationships ([135]Figure 2B). We identified WGDs in P. nigrum, S.
   chinensis, and H. cordata by analyzing the density distribution of K[s]
   of paralogous gene pairs ([136]Figure 3C). In the H. cordata genome, we
   detected three distinct K[s] peaks of 1.22, 0.16, and 0.01
   ([137]Figure 2C), indicating the occurrence of three WGDs in H.
   cordata. The most recent WGD, WGD3, as a polyploidization event,
   resulted in the formation of a tetraploid species of H. cordata from a
   diploid species ([138]Figure 2D). However, the collinear gene pairs
   between haplotype A of H. cordata (HcoHapA) and A. contorta, which did
   not experience independent WGD events, exhibited a 4:1 ratio
   ([139]Supplemental Figure 10A and 10B), reflecting the doubling of WGD1
   and WGD2. The detected K[s] peak was approximately 1.12 in S.
   chinensis, and a notable number of gene pairs exhibited K[s] values
   concentrated below 0.25, suggesting that S. chinensis has undergone
   only one WGD, consistent with previous findings ([140]Xue et al., 2023)
   ([141]Figure 2C). The syntenic depth between S. chinensis and HcoHapA
   showed a 2:4 pattern ([142]Supplemental Figure 11A and 11B), confirming
   an additional WGD event in H. cordata. By contrast, two distinct peaks,
   measuring around 0.88 and 0.16, were observed in the P. nigrum genome
   ([143]Figure 2C). In the collinearity analysis between HcoHapA and P.
   nigrum, the syntenic depths showed a 4:2 pattern. However, some of the
   HcoHapA blocks exhibited alignments with five to eight corresponding P.
   nigrum blocks ([144]Supplemental Figure 12A and 12B); for example,
   chr01a, chr02a, chr04a, and chr11a aligned to Pn1, Pn3, Pn5, Pn6, Pn7,
   Pn13, Pn15, and Pn24 ([145]Supplemental Figure 12A and 12B). This
   indicates that P. nigrum has undergone at least two WGDs. However,
   relying solely on evaluation of the density distribution of K[s], it is
   challenging to determine whether P. nigrum, S. chinensis, and H.
   cordata have experienced a common WGD. Alternatively, these data
   suggest a shared WGD1 event between H. cordata and P. nigrum. Our
   analysis of paralogous genes further supports this observation, as we
   detected paralogous genes in P. nigrum that originated from WGD1 (block
   K[s] ≈ 1.0–2.0; [146]Supplemental Figures 13–15). On the basis of genes
   within the identified blocks and orthologous genes of H. cordata, we
   deduced the evolutionary relationships ([147]Figure 2D,
   [148]Supplemental Figure 16, and [149]Supplemental Table 13). Notably,
   orthologous gene pairs tended to cluster together, rather than
   paralogous gene pairs ([150]Figure 2A and 2E), further supporting the
   occurrence of WGD1 in the ancestor of H. cordata and P. nigrum.

Figure 3.

   [151]Figure 3
   [152]Open in a new tab

   Inference of karyotype evolution in H. cordata.

   (A) The dot plot displays collinear blocks within haplotype A (HapA) of
   H. cordata. Chromosomes have been sorted and colored on the basis of
   homology, with pairs of similar colors indicating highly homologous
   chromosomes resulting from whole-genome duplication (WGD).

   (B) Speculative processes of chromosome duplications and three
   end-to-end joining (EEJ) events following WGD2 and WGD3.

   (C) Dot plot of collinear blocks among nine chromosomes in H. cordata,
   representing the ancestral karyotype prior to WGD2. Chromosome colors
   correspond to those in (A) and (B).

   (D) Dot plot of collinear blocks among nine chromosomes of H. cordata
   and A. contorta. Chromosomes are color-coded according to A. contorta.

   (E) Speculative models for chromosomal changes in the WGD1 event.

Inference of karyotype evolution in the H. cordata genome

   Advances in genome sequencing in multiple organisms have opened up new
   possibilities for understanding the evolution of ancestral karyotypes
   ([153]Murat et al., 2017). After multiple WGD events, the H. cordata
   genome still exhibits traces of duplication, suggesting the
   preservation of ancient genomic information. By analyzing conserved
   blocks in HcoHapA, we identified duplications and inferred chromosome
   rearrangement events following WGD2 ([154]Figure 3A). Notably, Chr1a
   and Chr11a displayed high homology, indicating their duplication
   through WGD2. Three evident chromosome rearrangement events occurred in
   H. cordata after WGD2. Prominent homology between the lower half of
   Chr9a and Chr5a, as well as the upper half of Chr9a and Chr15a,
   suggested end-to-end joining (EEJ) following duplications from two
   ancestral chromosomes ([155]Figure 3A and 3B). Similarly, Chr19a
   exhibited high homology with the lower half of Chr15a and Chr8a,
   indicating its gradual evolution through EEJ from two ancestral
   chromosomes ([156]Figure 3A and 3B). On the basis of this inference,
   the chromosome count initially started at n = 11. It then doubled
   through WGD2, resulting in a count of n = 22. Subsequently, three EEJ
   events occurred, leading to a final count of n = 19. Thereafter, the H.
   cordata species underwent WGD3 (polyploidization), leading to the
   formation of a tetraploid genome. (4n = 19 × 4) ([157]Figure 3B).

   To detect duplications caused by WGD1, we selected nine chromosomes
   (Chr1a/2a/3a/6a/7a/9a/10a/14a/19a) to avoid interference from
   duplications that resulted from WGD2. Collinearity analysis revealed
   clear duplications between Chr1a and Chr2a as well as between Chr6a and
   Chr7a ([158]Figure 3C). In addition, Chr9a was formed through EEJ of an
   ancestral chromosome following duplication ([159]Figure 3B). Incomplete
   reciprocal homologous fragments were also detected among Chr10a,
   Chr14a, Chr19a, and Chr3a ([160]Figure 3C), indicating subsequent
   changes to WGD1. Insertion of a fragment from Chr19a into the middle of
   Chr14a ([161]Figure 3C) resulted in chromosomal reduction, and fragment
   exchanges between Chr10a and Chr3a were observed. On the basis of these
   findings, the chromosome count of the common ancestor species of H.
   cordata and P. nigrum was estimated to be 6 before WGD1.

   A. contorta (n = 7), an outgroup species of H. cordata and P. nigrum,
   did not undergo independent WGDs. This suggests that chromosome fusions
   occurred in the ancestral species, resulting in a change in chromosome
   number from 7 to 6. Collinearity analysis between H. cordata and A.
   contorta revealed that the collinearity of almost all chromosomes was
   not well maintained, likely owing to the distant evolutionary
   relationship between A. contorta and H. cordata ([162]Figure 3D).
   However, it is evident that Chr6a and Chr7a in H. cordata evolved
   through the fusion of Ac3 and Ac5 in A. contorta ([163]Figure 3D). In
   addition, the fusion of Ac3 and Ac5 was also detected in the
   collinearity between A. contorta and P. nigrum ([164]Supplemental
   Figure 17A and 17B). Overall, a fusion event occurred in the ancestral
   species of H. cordata and P. nigrum, resulting in a shift from 7 to 6
   chromosomes ([165]Figure 3E). Following WGD1 in the ancestral species
   (n = 6), a replica of a Chr19a fragment was inserted into Chr14a
   ([166]Figure 3C), resulting in a species with 11 chromosomes
   ([167]Figure 3E).

Sodium houttuyfonate biosynthesis in H. cordata

   Houttuynin, also known as decanoyl acetaldehyde, is an
   important compound found in H. cordata that is responsible for its
   distinct fishy odor ([168]Laldinsangi, 2022). The water-soluble
   derivative, SH, is highly stable and exhibits important medicinal
   properties ([169]Liu et al., 2021b). As a result, it is widely used in
   various medical applications. An initial ultra-high performance liquid
   chromatography with quadrupole time-of-flight mass spectrometry
   (UPLC-Q-TOF-MS) analysis successfully detected the presence of SH in
   both the rhizome ([170]Supplemental Figure 18) and
   leaves ([171]Supplemental Figure 19) of H. cordata. This was determined
   based on retention time (11.89) and mass spectral data, which included
   a parent ion at m/z 302.30493 and fragment ions at m/z 295.28714,
   297.28287, and so forth. These results were consistent with those of
   the SH standard sample ([172]Supplemental Figure 20). A noteworthy
   finding was a significantly higher (p < 0.05) concentration of SH in
   rhizomes (79.59 ppm) compared with leaves (2.40 ppm) ([173]Figure 4A).
   This disparity aligns with the stronger fishy odor evident in the
   rhizomes, suggesting a potential biosynthetic pathway for SH within the
   plant.

Figure 4.

   [174]Figure 4
   [175]Open in a new tab

   Proposed sodium houttuyfonate biosynthetic pathway in H. cordata.

   (A) Analysis of sodium houttuyfonate (SH) in different tissues using
   high-performance liquid chromatography. The retention time of the peak
   associated with SH is indicated by a white arrow. The accompanying bar
   graph provides a quantitative representation of the SH content in
   various tissues. The letters a and b indicate significant differences
   between groups based on ANOVA analysis, followed by Tukey’s HSD test
   for post hoc analysis.

   (B) Proposed biosynthetic pathway of SH. Houttuynin originates from
   fatty acid metabolism, specifically palmitic acid (C[16]H[32]O[2],
   molecular weight [MW] 256.42). Palmitic acid is converted to
   palmitoyl-CoA (C16) by acetyl-CoA synthetase, followed by mitochondrial
   fatty acid β-oxidation. This cycle continues until decanoyl-CoA is
   formed, which is then converted to decanoic acid (C[10]H[20]O[2], MW
   172.26) by thioesterase. Decanoic acid is further oxidized to
   decanoylacetaldehyde (houttuynin, C[12]H[22]O[2], MW 198.16).
   Subsequently, the aldehyde group is transformed into a hydroxyl group,
   reacting with sodium sulfate to form water-soluble sodium houttuyfonate
   (SH, C[12]H[23]NaO[5]S, MW 302.36). In addition, houttuynin is prone to
   instability and can readily transform into 2-undecanone (methyl nonyl
   ketone, C[11]H[22]O) during production. Dashed lines represent
   potential intermediate steps, and question marks indicate uncertain
   enzyme involvement. Metabolites identified by UPLC-Q-TOF-MS in this
   study are marked with black boxes, and those reported in previous
   studies are enclosed in cyan boxes.

   (C) Clustered heatmap of differentially expressed genes (DEGs) in
   rhizome vs. leaf tissues (Rz vs. Lf) of H. cordata. Values from three
   biological replicates are shown. The DEGs are categorized into seven
   functional classes based on their roles in the fatty acid metabolic
   pathway. These classes include fatty acid synthases (FAS), fatty acid
   activation enzymes (FAAE), fatty acid oxidoreductases (FAOR), fatty
   acid transferases (FAT), fatty acid esterases (FAE), lipid enzymes
   (LE), and fatty acid dehydrogenases (FAD). Gene-expression differences
   are quantified as log[2] fold changes, with statistical significance
   set at p < 0.001. “Up” indicates a log[2] fold change > 1 (upregulation
   in rhizome relative to leaf), and “Down” indicates a log[2] fold
   change < −1 (downregulation in rhizome relative to leaf).

   Houttuynin differs from typical secondary metabolites such as terpenes
   with isoprene units, flavonoids with phenolic structures, or alkaloids
   with nitrogen atoms. Its unique characteristics, including the presence
   of an aldehyde group and decanoyl group, suggest that it is derived
   from fatty acid oxidation. Therefore, we propose that houttuynin is
   produced through the β-oxidation pathway of fatty acid metabolism. In
   this pathway, fatty acids abundant in H. cordata, such as palmitic acid
   (C16), are first converted into palmitoyl-coenzyme A (CoA) through
   acyl-CoA synthesis. This compound then undergoes sequential removal of
   two-carbon units through β-oxidation. After three cycles of
   β-oxidation, decanoyl-CoA (C10-CoA) is formed. We hypothesize that
   thioesterase enzyme converts decanoyl-CoA into decanoic acid
   (C[10]H[20]O[2]), which is further transformed into decanal
   (C[10]H[20]O). Subsequently, aldehyde dehydrogenase likely catalyzes
   the conversion to decanoyl acetaldehyde (houttuynin, C[12]H[22]O[2]),
   ultimately leading to the synthesis of SH (C[12]H[23]NaO[4]S).
   Houttuynin is prone to instability and can readily transform into
   another compound known as 2-undecanone (methyl nonyl ketone,
   C[11]H[22]O) during production ([176]Chen et al., 2014)
   ([177]Figure 4B). Our hypothesis is supported by the identification of
   decanoic acid and SH in this study and palmitic acid, decanal, and
   2-undecanone in previous research through UPLC-Q-TOF-MS analysis
   ([178]Persson et al., 2010).

   To gain further understanding of the differences in houttuynin
   synthesis between rhizome and leaf tissues, we performed transcriptome
   profiling of enzyme-coding candidate genes involved in the fatty acid
   biosynthetic pathway in H. cordata. By comparing rhizome and leaf
   tissues (Rz vs. Lf), we identified 49 differentially expressed genes
   (DEGs) that may play a role in this process ([179]Figure 4C). These
   genes can be classified into seven categories based on their functions
   within the fatty acid pathway: fatty acid synthases, fatty acid
   activation enzymes, fatty acid oxidoreductases, fatty acid
   transferases, fatty acid esterases, lipid enzymes, and fatty acid
   dehydrogenases. Seventeen of these genes were upregulated, and 32 were
   downregulated ([180]Supplemental Table 14 and [181]15). This
   comparative analysis sheds light on the possible enzymatic processes
   involved in houttuynin biosynthesis, providing valuable insights into
   the unique fatty acid pathway for SH accumulation in H. cordata.

Flavonoid biosynthesis and quercetin metabolism in H. cordata

   H. cordata is widely recognized for its remarkable flavonoid production
   and diverse array of beneficial properties ([182]Ekalu and Habila,
   2020). This makes it an ideal subject for studying the metabolic
   pathways involved in flavonoid biosynthesis, which can be greatly
   influenced by the unique characteristics of the source organism. We
   carried out a thorough analysis of total flavonoids in various tissues
   of H. cordata, including rhizomes (Rz), leaves (Lf), bracts (Br), and
   the inflorescence (If). We found that rhizomes had the lowest total
   flavonoid content (2.38 mg/g), whereas leaves (5.80 mg/g), the
   inflorescence (5.59 mg/g), and bracts (9.49 mg/g) exhibited higher
   levels ([183]Figure 5A). To further investigate the distribution of
   flavonoids within H. cordata, we used a comprehensive targeted
   metabolomics approach, which identified 212 enriched flavonoid
   metabolites ([184]Figure 5B). Among these compounds, flavonols and
   flavones were the most prominent, making up 32% and 31% of the total
   identified flavonoids, respectively ([185]Figure 5B). This accumulation
   pattern highlights the diverse range of flavonoids present in H.
   cordata and emphasizes the notable presence of flavonols and flavones
   in this medicinal plant.

Figure 5.

   [186]Figure 5
   [187]Open in a new tab

   Expression analysis of enzyme-coding genes involved in flavonoid
   biosynthesis in H. cordata.

   (A) Analysis of total flavonoid content in different tissues. Lf,
   leaves; Rz, rhizomes; Br, bracts; If, inflorescences. The letters
   represent significant differences between groups determined by ANOVA,
   followed by Tukey’s HSD test for post hoc analysis.

   (B) Number of flavonoid metabolites in H. cordata.

   (C) Heatmap illustrating the expression levels of enzyme-coding genes
   involved in flavonoid biosynthesis across tissues of H. cordata.

   (D) Simplified flavonoid biosynthetic pathway in H. cordata. Colored
   circles represent specific metabolites: cyan for naringenin, pink for
   flavones, blue for isoflavone, green for flavonol, and gray for
   anthocyanidins.

   To gain insight into flavonoid biosynthesis in H. cordata, we performed
   transcriptomic profiling of various tissues. This comprehensive
   analysis revealed a set of 76 conserved enzyme-coding genes that are
   closely associated with the flavonoid biosynthetic pathway. These
   identified genes play a crucial role in orchestrating critical steps in
   the pathway and encode key enzymes such as ammonia-lyase (PAL),
   4-coumarate CoA ligase (4CL), cinnamate 4-hydroxylase (C4H), chalcone
   isomerase (CHI), CHS, flavonol synthase (FLS), flavanone 3-hydroxylase
   (F3H), anthocyanidin synthase (ANS), dihydroflavonol 4-reductase (DFR),
   isoflavone synthase (IFS), and flavone synthase (FNS) ([188]Figure 5C
   and [189]Supplemental Table 16). These findings provide a comprehensive
   understanding of the genetic basis underlying flavonoid biosynthesis in
   H. cordata.

   To further enhance our candidate selection process, we performed a
   comprehensive analysis of the relationships among gene-expression
   levels and the top five flavonoid metabolites ([190]Figure 5D). These
   metabolites included flavanone (naringenin, cyan), quercitrin
   (flavonols, green), luteolin (flavones, pink),
   delphinidin-3-O-rutinoside (anthocyanidins, gray), and glycitein
   (isoflavones, blue). Our findings revealed significant correlations
   (∣r∣ ≥ 0.6, p < 0.05) between 44 out of the 76 enzyme-coding candidate
   genes and at least one of the tested metabolites ([191]Figure 5D). For
   example, three PAL genes (PAL_10/7/8) were significantly correlated
   with all flavonoid types except for flavonols. PAL_5 showed notable
   correlations with flavonols (green) and anthocyanidins (gray), whereas
   PAL_3 exhibited significant correlations with naringenin (cyan) and
   isoflavones (blue). Overall, our analysis identified key genes,
   including four PAL (PAL_10/7/8/3), five 4CL (4CL_4/2/7/9/6), three CHS
   (CHS_1/2/3), and three CHI (CHI_4/1/2) genes that were all
   significantly correlated with naringenin (cyan), highlighting their
   crucial roles in naringenin biosynthesis ([192]Figure 5D). We also
   discovered a distinct gene set associated with flavonol biosynthesis
   (green), comprising one PAL (PAL_5), seven 4CL (4CL_4/2/7/13/11/12/10),
   one CHI (CHI_3), one CHS (CHS_3), one FNS (FNS_13), three F3H
   (F3H_8/6/2/1), and four FLS (FLS_4/3/6/1) genes. We observed a specific
   correlation with isoflavone (blue) biosynthesis involving four PAL
   (PAL_10/7/8/3), five 4CL (4CL_9/6), three CHS (CHS_1/2), three CHI
   (CHI_4/1/2), three FNS (FNS_4/11/16), three IFS (IFS_5/2), two F3H
   (F3H_8/1), four DFR (DFR_2/3/1/4), and one ANS gene. Similarly, we
   found a significant correlation with flavone (pink) biosynthesis that
   involved three PAL (PAL_10/7/8), five 4CL (4CL_4/2/7/9/6), three CHS
   (CHS_1/2), three CHI (CHI_4/1/2), four FNS (FNS_13/4/11/16), three IFS
   (IFS_5/2/4), four F3H (F3H_8/6/2/1), four DFR (DFR_2/3/1/4), and one
   ANS gene. Finally, we observed a significant correlation with
   anthocyanidins (gray) for four PAL (PAL_10/7/8/5), three 4CL
   (4CL_4/2/7), three CHS (CHS_1/2), three CHI (CHI_4/1/2), four FNS
   (FNS_13/4/11/16), two IFS (IFS_5/2), three F3H (F3H_8/6/2), four DFR
   (DFR_2/3/1/4), and one ANS gene ([193]Figure 5D and [194]Supplemental
   Table 16). These findings highlight specific gene-expression networks
   involved in flavonoid biosynthesis within H. cordata.

   Quercetin, one of the most well-known flavonoids, is known for its
   powerful antioxidant properties and other health benefits in
   traditional Chinese medicine ([195]Wang et al., 2016). Our study
   revealed significant variations (p < 0.05) in antioxidant capacity
   among different tissues of H. cordata. Notably, leaves (73.9 μmol/g)
   and inflorescences (67.3 μmol/g) exhibited considerably higher (p <
   0.05) antioxidant capacity compared with rhizomes (25 μmol/g) and
   bracts (20 μmol/g) ([196]Figure 6A). Furthermore, glycosylated forms of
   quercetin, such as isoquercitrin, rutin, hyperoside, baimaside, and
   quercitrin, which play a crucial role in in vivo antioxidant activity,
   were markedly enriched (p < 0.05) in leaves, inflorescences, and
   bracts. These tissues displayed high concentrations (indicated by red
   color) of glycosylated quercetin forms in the UPLC-Q-TOF-MS
   metabolomics analysis, whereas rhizomes showed the lowest
   concentrations (indicated by blue color) ([197]Figure 6B). This
   distribution pattern (with the exception of bracts) is consistent with
   the observed antioxidant capacities, highlighting the distinct roles of
   different plant parts in accumulating quercetin glycosides.

Figure 6.

   [198]Figure 6
   [199]Open in a new tab

   Quercetin metabolic pathway in H. cordata.

   (A) Total antioxidant capacity in various tissues of H. cordata. Lf,
   leaves; Rz, rhizomes; Br, bracts; If, inflorescences. The letters a and
   b represent significant differences between groups determined by ANOVA,
   followed by Tukey’s HSD test for post hoc analysis.

   (B) Abundance of quercetin glycosides in different H. cordata tissues.

   (C) Overview of the quercetin metabolic pathway. Heatmaps accompanying
   the text display the expression level (TPM) of enzyme-coding genes
   involved in this pathway. Values from three biological replicates are
   shown.

   Further analysis focused on the expression levels of homologous
   glucosyltransferase (UGT) enzyme genes (F3RT, F3GT, F3GGT, and
   F3GRT) that catalyze the formation of these five quercetin glycosides
   in H. cordata. The expression trends of these UGT genes, i.e., three
   F3RT (F3RT_2/3/6), five F3GT (F3GT_7/9/3/2/6), four F3GGT
   (F3GGT_4/5/2/6), and three F3GRT (F3GRT_5/1/7) genes, mirrored the
   observed patterns of quercetin metabolite accumulation, with lower
   expression in the rhizome than in the other organs ([200]Figure 6C).
   This finding suggests a notable relationship between gene expression
   and quercetin metabolism, providing valuable insight into the
   regulation of quercetin biosynthesis and its distribution across
   various plant tissues.

Gene co-expression analysis of flavonoid biosynthesis genes in H. cordata

   To gain a deeper understanding of the regulatory mechanisms that
   control flavonoid biosynthesis in H. cordata, we used WGCNA (weighted
   gene co-expression network analysis), a powerful computational tool
   that helped to reveal gene relationships and regulatory patterns within
   our dataset. Through this analysis, we organized all genes into 32
   modules, distinguished by different colors such as MEblack, MEblue, and
   MEbrown, based on their similar expression patterns across samples
   ([201]Figure 7A and [202]Supplemental Table 17). Among these modules,
   the MEblue module stood out, exhibiting notably strong and
   statistically significant positive correlations (r > 0.8, p < 0.01)
   with flavones, isoflavones, anthocyanidins, and naringenin (excluding
   flavonols, [203]Supplemental Table 17). This finding suggests that the
   genes in the MEblue module play a crucial role in biosynthesis of these
   specific flavonoid compounds.

Figure 7.

   [204]Figure 7
   [205]Open in a new tab

   Gene co-expression network analysis in the flavonoid biosynthesis
   pathway.

   (A) Weighted gene co-expression network analysis between all genes and
   five metabolites in H. cordata. The network diagram visually depicts
   the co-expression patterns among these genes and metabolites. The
   heatmap provides a comprehensive overview of metabolite profiles in
   different tissues of H. cordata during the blooming stage, with lines
   indicating correlations.

   (B) Positive regulatory network illustrating the interactions between
   transcription factors (TFs) and pathway genes. Enzyme-coding genes are
   represented as hexagons, and TFs are depicted as circles. The lines
   indicate positive correlations between the TFs and enzyme-coding genes.

   (C) Identification of TF binding sites in the 2-kb upstream sequences
   of seven candidate genes and their expression levels in various
   tissues.

   (D) Expression levels (TPM) of 25 TFs in different tissues.

   (E) Subcellular localization of two H. cordata TFs fused to the GFP tag
   and expressed in Nicotiana tabacum cells. Nuclei were stained with
   DAPI. The merged images show co-localization of GFP and DAPI signals
   within the nucleus, confirming the nuclear localization of the TFs.
   Scale bars, 20 μm.

   (F) RT–qPCR analysis of two TFs in H. cordata. Real-time qPCR analysis
   of two TFs across various tissues of H. cordata, with expression levels
   normalized to those of the housekeeping genes GAPDH and ACTIN. Data are
   presented as mean ± SEM, and each bar represents the average expression
   from three biological replicates. Statistical significance was
   determined using ANOVA, with different letters indicating significant
   differences (p < 0.05) between tissues. Tissues examined included
   rhizomes, leaves, bracts, and inflorescences during the blooming
   period.

   Delving further into the MEblue module, we focused on predominantly
   significant correlations (r > 0.91, p < 1e−200) which had an extremely
   low probability of occurring by random chance, leading us to identify
   seven enzyme-coding candidate genes ([206]Supplemental Figure 21).
   These included one 4CL (Hocor15aG0005700), one FNS (Hocor16aG0024200),
   two DFRs (Hocor12aG0138000, Hocor01aG0150600), one ANS
   (Hocor14aG0093300), and two CHIs (Hocor10aG0148000, Hocor16aG0092100),
   which shared similar expression patterns across samples
   ([207]Figure 7B). We performed RT–qPCR to validate the expression of
   these seven candidate genes identified from the MEblue module. They
   were notably highly expressed in inflorescences, corroborating the
   expression patterns observed in the heatmaps ([208]Supplemental
   Figure 22). The strong positive correlations allowed us to perform an
   extensive analysis of species-specific networks within the blue module.
   This analysis led to the identification of key transcription factors
   (TFs) involved in regulation of the seven candidate genes associated
   with flavonoid biosynthesis ([209]Figure 7B). These 25 TFs belonged to
   12 families, including well-known families such as bZIP, MYB, ARF, AP2,
   and GRAS, which have been extensively studied for their critical roles
   in secondary metabolism ([210]Yao et al., 2018; [211]Cao et al., 2020;
   [212]Li et al., 2020; [213]Han et al., 2023) ([214]Supplemental
   Table 18).

   To further support the hypothesis of co-regulation, we performed TF
   binding site predictions, revealing the presence of known cis elements
   in the upstream regions of the candidate genes. Specifically, we
   identified G-box binding sites for bZIP TFs, MYB binding sites for MYB
   TFs, TGA elements and the AUXRR core for ARF TFs, HD-ZIP binding sites
   for HD-ZIP TFs, and the W-box for WRKY TFs ([215]Figure 7C and
   [216]Supplemental Table 19). These findings provide compelling evidence
   for the potential hierarchical regulatory relationships between the TFs
   and the enzyme-coding genes involved in flavonoid biosynthesis.

   Importantly, our analysis also revealed higher expression levels of
   these TFs in inflorescences, consistent with the expression patterns of
   the enzyme-coding genes and the contents of flavonoid metabolites
   ([217]Figure 5A and 5C). We performed subcellular localization
   ([218]Figure 7E) and RT–qPCR assays ([219]Figure 7F) for two of the
   TFs, MYB_2 and AP2_2, confirming their nuclear localization and
   significant expression (p < 0.05) in inflorescences. These observations
   suggests that MYB_2 and AP2_2 may have specific roles in the regulation
   of flavonoid biosynthesis during inflorescence development. Their
   higher expression levels in inflorescences emphasize the importance of
   these TFs for shaping the flavonoid profiles of H. cordata
   ([220]Figure 7D and [221]Supplemental Table 20).

   Together, our analyses identify co-expressed genes involved in
   flavonoid biosynthesis in H. cordata and highlight the potential
   applications of TF-mediated regulation of flavonoid biosynthesis in H.
   cordata in future studies.

Discussion

   The Saururaceae family consists of four genera: Anemopsis, Gymnotheca,
   Houttuynia, and Saururus. Members of this family typically have
   heart-shaped leaves and spike-like flowers ([222]Smith and Stockey,
   2007). Among these genera, only the genome of S. chinensis has been
   reported previously. This genome is 537 Mb and diploid with 11
   chromosome pairs (2n = 22) ([223]Xue et al., 2023). H. cordata, a
   species from the Surrogacies family of the magnoliid group, has been a
   topic of debate regarding its evolutionary position within the
   magnoliid branch and the relationships among its species ([224]Datta
   and Dasgupta 1977; [225]Soltis et al., 2009). In the present study, we
   expanded our sequencing efforts to include H. cordata and investigated
   the evolutionary relationships within the Saururaceae family. In
   addition, we identified candidate genes involved in the production of
   bioactive compounds in H. cordata.

   To overcome the challenges posed by the large genome size and high
   heterozygosity of H. cordata, we produced a chromosome-level and
   partially telomere-to-telomere high-precision genome assembly. We
   identified telomeres in 72 out of 76 chromosomes, as well as candidate
   centromere regions, with minimal gaps, providing insight into the
   organization and stability of these genomic elements. Previous studies
   have identified independent WGD events in species such as Liriodendron
   chinense and P. nigrum, whereas the Aristolochia species have not
   experienced such events ([226]Qin et al., 2021). Our phylogenetic
   analysis revealed a close evolutionary relationship between H. cordata
   and other magnoliid species such as S. chinensis, P. nigrum, A.
   fimbriata, and A. contorta. H. cordata diverged from the Aristolochia
   genus approximately 108.4 MYA, from P. nigrum 61.9 MYA, and from S.
   chinensis more recently, about 33.4 MYA. Our data support the
   occurrence of a WGD event in the common ancestor of H. cordata and P.
   nigrum following their divergence from Aristolochia. Subsequently, H.
   cordata and P. nigrum underwent species differentiation. The ancestral
   species of H. cordata experienced two additional WGD events, resulting
   in the formation of a homologous tetraploid. In the case of P. nigrum,
   its ancestral species underwent two independent WGD events and
   chromosome fusions, leading to the formation of the contemporary P.
   nigrum genome. The presence of a WGD before the divergence of H.
   cordata and P. nigrum suggests a shared evolutionary event between
   these two species. These findings align with previous observations of a
   single WGD event in S. chinensis ([227]Xue et al., 2023), indicating a
   shared WGD event in the Saururus genome after the divergence of
   Saururus and Liriodendron. Chromosome duplications and structural
   changes resulting from WGDs have influenced their genetic makeup and
   have potentially contributed to their divergence and speciation. The
   observation of fusion events in the common ancestor of H. cordata and
   P. nigrum, leading to a reduction in the chromosome basis, highlights
   the significance of chromosomal rearrangements during their
   evolutionary history. These findings provide important insights into
   the evolution of H. cordata and its species within the Saururaceae,
   shedding light on the genetic mechanisms that underlie their adaptation
   and diversification.

   The prevalence of gene family expansion suggests that it played a vital
   role in the adaptive evolution of H. cordata, potentially facilitating
   its adaptation to specific environmental conditions or ecological
   niches. By contrast, the contraction of gene families implies possible
   gene loss or reduction in certain functional categories. This pattern
   suggests that rapidly expanding gene families are particularly crucial
   for defense mechanisms, growth, and developmental processes in H.
   cordata.

   H. cordata, which emits a distinct fish-like odor when its leaves are
   crushed, has valuable medical applications ([228]Luo et al., 2022). One
   specific component of H. cordata, houttuynin (decanoyl acetaldehyde,
   C[12]H[22]O[2]), is responsible for causing “fishy burps.” However,
   houttuynin is unstable and easily converts into 2-undecanone (methyl
   nonyl ketone, C[11]H[22]O) during production ([229]Chen et al., 2014).
   To address this instability, SH (C[12]H[23]O[5]SNa), the water-stable
   form of houttuynin, which has pharmaceutical activity similar to that
   of houttuynin, is synthesized ([230]Chen et al., 2014). Currently, SH
   is obtained through a synthetic process starting from lauric acid,
   acetic acid, and slaked lime. However, this process has limitations in
   terms of environmental impact, yield, and purity ([231]Liu et al.,
   2021b). Therefore, it is important to understand the in vivo
   biosynthetic pathway of SH in H. cordata. In this study, high
   accumulation of SH was found in the rhizome of H. cordata, and a
   biosynthetic pathway for SH production in H. cordata was proposed. This
   pathway is related to the fatty acid oxidation pathway. Unlike other
   plant secondary metabolites, houttuynin is structurally similar to
   primary metabolic products derived from the fatty acid oxidation
   pathway. It is likely formed through the breakdown of long-chain fatty
   acids into acetyl-CoA units. Finally, houttuynin is converted to SH,
   possibly by aldehyde dehydrogenase. The H. cordata genome provides an
   opportunity to further study fatty acid metabolism and better
   understand the in vivo production of SH, as well as providing support
   for in vitro synthetic biology research.

   H. cordata is well known for its exceptional accumulation of
   flavonoids, beneficial compounds found in many traditional Chinese
   medicinal plants ([232]Bai et al., 2019; [233]Zhu et al., 2021).
   However, the biosynthetic pathways of these compounds in H. cordata are
   not fully understood, limiting advances in its molecular breeding. By
   performing transcriptomic and metabolic profiling analysis, we have
   identified 76 conserved enzyme-coding genes that play a crucial role in
   flavonoid biosynthesis in H. cordata. These genes, which are
   differentially expressed across various tissues, contribute to our
   understanding of the molecular mechanisms that underlie flavonoid
   biosynthesis in medicinal plants.

   Quercetin, a type of flavanol, possesses a wide range of biological
   activities, including antioxidant, anti-inflammatory, and anticancer
   effects ([234]Di Petrillo et al., 2022; [235]Wang et al., 2022a). We
   found that quercetin and quercetin glycosides (isoquercitrin, rutin,
   hyperoside, baimaside, and quercitrin) were present at high levels in
   the leaves, inflorescences, and bracts of H. cordata. The process of
   glycosylation, which involves the addition of sugar moieties to
   flavonoid compounds, is facilitated by an enzyme called
   glucosyltransferase (UGT). This enzymatic activity results in the
   formation of various flavonoid glycosides, including flavonol aglycones
   that are known for their high bioavailability and bioactivity
   ([236]Ross et al., 2001). Our current study identified several UGT
   genes that are differentially expressed in different tissues of H.
   cordata, suggesting their involvement in quercetin glycoside formation.
   Further investigation of their functions will provide valuable insights
   into the quercitin biosynthetic pathway in plants and can support
   future synthetic biology.

   In summary, we assembled a near-complete genome of H. cordata with 76
   chromosomes (4n = 76) and identified key genes involved in the
   biosynthesis of houttuynin and flavonoids. Our findings establish a
   genomic foundation that enhances our understanding of H. cordata’s
   medicinal properties and supports future research and genetic
   improvements to maximize its therapeutic potential.

Methods

Sample preparation and genome sequencing

   In June 2022, we obtained botanical samples of H. cordata from Kaili
   (107°58′ E, 26°05′–27°38′ N) in Guizhou Province, China. To ensure
   sample quality, we carefully selected a limited number of healthy and
   pathogen-free plants. High-molecular-weight genomic DNA for genome
   sequencing was isolated from young leaves using a standard CTAB
   (hexadecyltrimethylammonium bromide) protocol, and DNA libraries were
   constructed for the PacBio HiFi reads. The young leaves were used for
   cell culture and crosslinking of chromatin to construct the Hi-C
   library.

Genome survey and de novo genome assembly

   Data-quality assessment and pre-processing were carried out using fastp
   v0.20.1 ([237]Chen, 2023b) with default options. We used clean data for
   k-mer analysis with jellyfish 1.1.6 ([238]Marçais and Kingsford, 2011)
   using a k-mer size of 17 bp. The k-mers from sequencing reads were
   visualized for Smudgeplot (v0.2.5) ([239]Ranallo-Benavidez et al.,
   2020), inferring the reference genome as autotetraploid (4n;
   [240]Supplemental Figure 1B). We then used genomescope 2.0
   ([241]Ranallo-Benavidez et al., 2020) to estimate the genome size,
   heterozygosity, and repeat content in a polyploid model.

   The H. cordata genome was assembled using HiFi and Hi-C sequencing
   data. SMRTbell libraries were sequenced on a PacBio Sequel II system,
   and consensus reads (HiFi reads) were generated using CCS software
   ([242]https://github.com/pacificbiosciences/ccs) with default
   parameters. Hifiasm v0.166-r375 was used to construct the primary
   genome and haplotype draft contig genomes from the long and highly
   accurate HiFi and Hi-C reads ([243]Cheng et al., 2021a; [244]Han
   et al., 2022). The Hi-C reads were quality controlled and mapped to the
   H. cordata contig assembly using Juicer ([245]Durand et al., 2016). The
   3D-DNA pipeline (v180419) was then used to generate a candidate
   chromosome-level assembly, correct mis-joins, and order and orient the
   contigs. Manual inspection and refinement of the draft assembly were
   performed using Juicebox Assembly Tools ([246]Seppey et al., 2019). Gap
   filling was carried out using LR_Gapcloser software (v1.0) with HiFi
   reads. The chloroplast and mitochondrial genomes were separately
   assembled using GetOrganelle software (v1.7.5) using the PacBio CCS
   data ([247]Xu et al., 2019; [248]Baeza et al., 2023). Two rounds of
   polishing were performed on the short-read data using nextpolish
   ([249]Chang et al., 2023). The completeness of the chromosome-level
   genome was evaluated using BUSCO software (v4.0.6) ([250]Vanholme
   et al., 2019) with the embryophyta_odb10 ortholog set.

Genome annotation

   To identify transposable elements in the H. cordata genome, we used the
   EDTA pipeline v1.8 ([251]Ou et al., 2019). This pipeline enabled us to
   identify LTR, TIR, and non-TIR elements. To further analyze the genome
   and annotate the TEs, we used RepeatMasker (v4.1.6). In addition, the
   MAKER2 pipeline ([252]Stanke et al., 2008) was used to predict the
   structures of coding genes. This pipeline incorporated three main
   approaches: ab initio predictions, protein homology, and transcriptome
   data. Prior to gene prediction, the assembled genome underwent hard and
   soft masking using RepeatMasker (v4.1.6). Ab initio gene prediction was
   performed using Augustus v3.1 ([253]Stanke et al., 2006), and
   homology-based gene prediction was carried out using Exonerate v2.2.2
   ([254]Slater and Birney, 2005). To predict gene models, EVidenceModeler
   v2.1.0 was used to integrate the results from all three prediction
   methods ([255]Haas et al., 2008).

   The functions of protein-coding genes were determined using three
   methods. First, eggNOG-mapper v2.1.12 ([256]Huerta-Cepas et al., 2017)
   annotation was used to compare genes with the eggNOG homologous gene
   database, enabling the annotation of gene functions, including GO and
   KEGG annotations. Second, a sequence similarity search was performed
   using Diamond v2.1.3 ([257]Buchfink et al., 2015) to compare protein
   sequences with protein databases such as Swiss-Prot and NR, enabling
   the characterization of protein sequences. Finally, a structural domain
   similarity search was performed using InterProScan 5.68-100.0 to
   compare subdatabases in InterPro, thereby obtaining conserved amino
   acid sequences, motifs, and domains of the proteins ([258]Jones et al.,
   2014). tRNAScan-SE 2.0 ([259]Lowe and Eddy, 1997) was used with default
   parameters to identify tRNA genes, and Barrnap 0.8 ([260]Lagesen
   et al., 2007; [261]Nawrocki and Eddy, 2013) was used to predict rRNA
   sequences. Infernal 2.5.2 ([262]Nawrocki and Eddy, 2013) was used to
   search for non-coding RNAs in the Rfam database.

   The original sequence data of the H. cordata genome produced in this
   study have been deposited in the Genome Sequence Archive (Genomics,
   Proteomics and Bioinformatics, 2021) of the National Center for
   Biological Information of China/National Genomics Data Center of
   Beijing Institute of Genomics, Chinese Academy of Sciences (Nucleic
   Acid Research Center, 2022) (GSA: CRA010237). Open access to the data
   is available at [263]https://ngdc.cncb.ac.cn/gsa.

Haplotype comparison

   We constructed pairwise alignments of the sequences of the four
   haplotypes using minimap2 (2.24-r1122) ([264]Li, 2021) with the -asm5
   option, then used SyRI (1.6.3) ([265]Goel et al., 2019) to convert the
   corresponding BAM files into variant information files. We then used
   plotsr (0.5.4) ([266]Goel and Schneeberger, 2022) to visualize all
   variant information files. The variant information files between
   haplotypes were also uploaded to cloud storage:
   [267]https://drive.google.com/drive/folders/1seLh1gEZZyf2E2KrVr3hkZK3fZ
   6BAcZI?usp=sharing.

Phylogenetic analysis

   Gene-evolution analysis was performed by identifying paralogs and
   orthologs among 21 plant species. The orthologous gene groups were
   clustered using OrthoFinder2 with default parameters ([268]Emms and
   Kelly, 2019). The amino acid sequences were aligned using MAFFT v7
   ([269]Katoh and Standley, 2013) and trimmed with trimAI v2.0
   ([270]Capella-Gutiérrez et al., 2009). A maximum-likelihood
   phylogenetic tree was constructed using RAxML v7.2.8 with 1000
   bootstrap replicates based on the PROTGAMMAJTT model ([271]Stamatakis,
   2014). Amborella trichopoda was used as the outgroup. The species tree
   was then used to estimate divergence times using the MCMCTree program
   in the PAML package v4.10.6 ([272]Yang, 2007). Multiple fossil times
   were obtained from TimeTree ([273]http://www.timetree.org/) and used
   for time calibrations. CAFE5 ([274]Mendes et al., 2021) was used to
   infer the expansion and contraction of gene families on the basis of
   the chronogram of 21 plant species mentioned above.

Synteny analysis and gene-duplication identification

   Syntenic blocks within and between species were identified using
   MCScanX-h v1.1.11 ([275]Wang et al., 2012) with homologous gene sets
   and BLASTP. To predict WGD events in one species and estimate the
   divergence time between two species, we analyzed the distribution of
   synonymous substitution rates (K[s]) ([276]Huang et al., 2023). Gene
   families were subjected to GO and KEGG analyses using the R package
   clusterProfiler v3.19 for further investigation ([277]Wu et al., 2021).

Synteny network analyses and phylogenetic profiling

   Twenty-one plant species were analyzed, including seven Magnoliaceae,
   five monocots, six eudicots, and two basal angiosperms. Protein
   sequences and GFF/GFF3 attachments from fully sequenced genomes were
   obtained from various databases, such as Ensembl
   ([278]https://plants.ensembl.org/index.html), National Omics Center
   ([279]https://www.nstda.or.th/noc/), CGDB ([280]http://bio2db.com/),
   Phytozome ([281]https://phytozome-next.jgi.doe.gov/), NCBI
   ([282]https://www.ncbi.nlm.nih.gov/), and CPBD
   ([283]http://citrus.hzau.edu.cn/download.php). Multiple protein
   sequence alignments were created using MUSCLE v3.8.1551 ([284]Edgar,
   2004), and phylogenetic trees were built using the maximum-likelihood
   technique with IQ-TREE v2.1.4 (-m MFP -bb 1000) ([285]Nguyen et al.,
   2015). The resulting trees were annotated and visualized using iTOL v6
   ([286]Letunic and Bork, 2016) and were grouped on the basis of gene
   functions and species relatedness. To identify collinear genes, BLASTP
   v2.10.0 was used to compare protein sequences from each of the 21 plant
   genomes, with an E-value cutoff of 1e−5. Genomic collinearity was
   detected and a library dataset of all possible syntenic gene pairs
   among the 21 plant genomes was constructed using MCScanX ([287]Wang
   et al., 2012) with default parameters. Protein sequences were compared
   using MUSCLE v5.1 ([288]Edgar, 2004) and then transformed into codon
   comparisons using PAL2NAL ([289]Suyama et al., 2006). Finally, the K[a]
   and K[s] values between homologous gene pairs were calculated using the
   YN model ([290]Yang and Nielsen, 2000).

Karyotype analysis of H. cordata

   We used karyotype analysis to identify and analyze the chromosomes of
   H. cordata ([291]Wang et al., 2022b). Root tips from H. cordata plants
   were collected and treated with a colchicine solution, then fixed using
   Kanoy solution to ensure that the root-tip cells were in metaphase. The
   fixed root-tip samples were then hydrolyzed and stained with Giemsa
   dye. Finally, the root-tip samples were prepared using the extrusion
   method. The chromosomes of Houttuynia were captured and measured using
   a high-resolution optical microscope. The chromosomes were classified
   on the basis of their size, banding pattern, and centromeric location.

   To clarify chromosomal alterations in H. cordata, we used the core
   eudicot karyotype (ACEK; n = 21) ([292]Wang et al., 2022) as a
   reference to perform a comparative genomic analysis with Houttuynia and
   A. contorta using WGDI (version 0.6.2) ([293]Sun et al., 2022).

RNA extraction, library construction, and sequencing

   We collected samples from four different tissue segments during the
   blooming period: leaves, rhizomes, inflorescences, and bracts. To
   maintain sterility and preserve the samples, we followed a rigorous
   protocol. Total RNA was extracted using TRIzol reagent according to the
   manufacturer’s instructions ([294]Yang et al., 2020). RNA libraries
   were prepared using the NEBNext Ultra RNA Library Prep Kit for Illumina
   and sequenced on the Illumina NovaSeq 6000 platform to capture
   comprehensive transcriptome profiles ([295]Mildrum et al., 2020). The
   quality of the cDNA libraries was assessed using an Agilent Bioanalyzer
   2100 instrument, and three replicates were performed for each sample
   ([296]Grissom et al., 2005).

Identification of genes involved in flavonoid biosynthesis

   We performed transcriptomic and metabolic profiling of H. cordata
   leaves, rhizomes, bracts, and inflorescences, focusing on flavonoid
   biosynthetic genes in 11 gene families, namely PAL, 4CL, C4H, CHI, CHS,
   FLS, F3H, ANS, DFR, IFS, and FNS. The data were visualized using
   phylogenetic trees, sequence alignments, gene structure diagrams, and
   heatmaps ([297]Chen et al., 2023a). These graph panels were combined or
   arranged using TBtools v2.102 ([298]Chen et al., 2023a).

Transcriptomic analysis

   To ensure data integrity, our data pre-processing included rigorous
   quality assessment and pre-processing using FastQC v0.12.1 and fastp
   v0.20.1 ([299]Zarour et al., 2021) with default options. These
   stringent filtering steps were designed to generate a high-quality
   dataset for subsequent analysis, and metrics such as Q20, Q30, and GC
   content were calculated. The filtered reads were aligned to the
   reference genome using HISAT2 v2.2.1 ([300]Kim et al., 2019) in
   paired-end mode. The results were processed with SAMtools v1.10
   ([301]Li et al., 2009) to sort and index the alignment files. Read
   counting was performed using featureCounts v2.0.1 ([302]Liao et al.,
   2014) to quantify gene counts, and the gene counts were normalized into
   transcripts per million (TPM) expression levels. To identify DEGs among
   different plant tissues, stringent criteria were applied ([303]Shi and
   Gu, 2020). A log[2] fold change of ≥1 and a significance level (p[adj])
   of ≤0.05 were required.

Transcriptional regulation of flavonoid biosynthesis

   We performed a series of gene-expression and co-expression network
   analyses using the entire transcriptome dataset to uncover
   transcriptional regulatory networks involving the flavonoid
   biosynthetic genes and TFs. Genes with TPM < 1 in all tissues were
   filtered out, and the remaining genes were used to construct a
   co-expression network using WGCNA ([304]Chen and Ma, 2021). The
   co-expression network modules were obtained using the blockwise modules
   function with the following parameters: soft threshold power = 14;
   TOMtype = signed; merge-CutHeight = 0.25; and minModuleSize = 50.
   PlantTFDB ([305]Guo et al., 2008) was then used with default parameters
   to identify TFs in the H. cordata genome. Finally, the networks between
   the genes and TFs were visualized in Cytoscape v3.7.1 ([306]Chen
   et al., 2023c).

Metabolomic analysis

   Initial preparation of plant samples for metabolomic analysis included
   lyophilization and pulverization of individual plant parts, i.e.
   leaves, rhizomes, inflorescences, and bracts. Metabolites were then
   extracted from 100 mg of each pulverized sample using a
   well-established method ([307]Meng et al., 2022). We performed liquid
   chromatography–mass spectrometry (LC–MS) using the high-sensitivity
   SCIEX QTRAP 6500+ system for targeted metabolomics studies. This
   platform enabled us to perform a class-targeted metabolomics analysis,
   providing detailed insights into the metabolite profiles relevant to
   our study. Metabolites that met specific criteria, including a VIP
   (variable importance in projection) score of ≥1 and fold change of ≥2
   or ≤0.5, were classified as differentially abundant metabolites. The
   analysis was performed rigorously, with three independent replicates
   for each plant tissue. To determine variations in metabolite
   composition among different components of H. cordata, advanced
   statistical methodologies were used. The identified metabolites were
   then annotated using the KEGG compound database and mapped to the KEGG
   pathway database ([308]Altman et al., 2013). This systematic approach
   enabled the identification of significantly enriched metabolic pathways
   by assessing the statistical significance of the overlap between
   identified metabolites and known metabolic pathways using a
   hypergeometric test.

Statistical analysis of metabolite content and antioxidant capacity

   Metabolite contents and antioxidant capacity were quantitatively
   analyzed using standard curves of known concentrations. All statistical
   analyses were performed using R software (v4.0.3) with the
   stats package (v4.0.3) for ANOVA and the agricolae package (v1.3–3) for
   Tukey’s HSD test.

Determination of sodium houttuyfonate content

   To precisely quantify SH in H. cordata, various plant parts were heat
   blanched at 105°C for 2–10 min, then dried to a constant weight at
   60°C. The dried samples were ground and sieved (40–60 mesh). For the
   extraction, an ultrasonic extraction method was used ([309]Cui et al.,
   2022). A mixture of 0.02 M K[2]HPO[4], 0.02 M KH[2]PO[4], and methanol
   was prepared in a 10:10:80 ratio, with the pH adjusted to 9. The
   extraction was carried out at 60°C for 120 min. After extraction,
   centrifugation was performed at 25°C at 2000 rpm for 10 min. The
   supernatant was filtered through a 0.22-μm microporous filter for
   chromatographic analysis. The chromatographic analysis used a C18
   column (250 cm × 4.6 mm) with the same mobile phase. An isocratic
   elution mode was used at a flow rate of 0.5 ml/min, and the column
   temperature was maintained at room temperature ([310]Schellinger and
   Carr, 2006). The detection wavelength was set to 286 nm, and the total
   detection duration was 18 min.

Mass spectrometric identification of sodium houttuyfonate

   We used an Agilent 1290 ultra-high performance LC system coupled with a
   Thermo Fisher Q-Exactive Orbitrap Plus high-resolution mass
   spectrometer to accurately determine the concentration of SH ([311]Liu
   et al., 2019). Chromatographic separation was achieved using a Waters
   T3 column (21 mm × 50 mm, 1.8 μm), with precise temperature control at
   50°C. The mobile phases consisted of 0.1% formic acid in water (phase
   A) and acetonitrile (phase B), delivered at a flow rate of 0.3 ml/min.
   We carefully designed a gradient program, starting with 2% B for the
   first minute, followed by a linear increase to 100% B over 1–18 min,
   maintenance at 100% B until 22 min, then a rapid reduction to 2% B over
   0.1 min and maintenance at 2% B for the final 2.9 min. The injection
   volume was 5 μl. For MS analysis, we set dynamic data acquisition to
   cover a range of 100–1000 m/z. We fine-tuned the ESI source parameters,
   including an auxiliary gas flow rate of 16 Arb, full MS resolution at
   70 000, and a collision energy (NCE) of 25 eV for MS/MS mode with a
   resolution of 17 500. The spray voltage was set to −3.0 kV for negative
   mode and 3.0 kV for positive mode, with a dynamic exclusion duration of
   4 s and energy steps in NCE of 10, 30, and 50. Data acquisition ranged
   from m/z 100 to 1000. We performed data interpretation and analysis
   using Xcalibur 4.3 and MS-DIAL 5.0.3 workstations ([312]Tsugawa et al.,
   2015). To validate the identification of compounds, we cross-referenced
   them with the standard SH (CAS 83766-73-8), searched through local
   databases, and confirmed diagnostic ions. In addition, we analyzed
   primary quasi-molecular ions and secondary MS fragmentation patterns to
   ensure a high level of accuracy in compound identification.

Plasmid construction

   To generate the constitutive expression constructs, YFP sequences were
   initially amplified using KOD-Plus DNA polymerase (TOYOBO,
   [313]www.toyobo-global.com). The amplified sequences were then cloned
   into the pSK34 vector ([314]Banno et al., 2001) using the SpeI and NotI
   restriction sites to create the pSKY36 vector. Sequences encoding
   full-length AP2_2 (Hocor06aG0037700) and MYB_2 (Hocor18aG0011700)
   (amino acids 1–356) were amplified and inserted into pSKY36 at the AscI
   and SpeI sites as described in [315]Yang et al. (2014b), yielding the
   constructs 35S::AP2_2-YFP and 35S::MYB_2-YFP. The details of all
   primers used are provided in [316]Supplemental Table 22.

Protein subcellular localization and gene-expression analysis

   Subcellular localization of AP2_2 and MYB_2 proteins was assessed by
   transient expression experiments in tobacco (Nicotiana benthamiana) as
   detailed in [317]Yang et al. (2014a). Various constructs were
   transformed into Agrobacterium tumefaciens strain GV3101, which was
   subsequently suspended in infiltration medium consisting of 1 mM MES
   (2-(N-morpholino)ethanesulfonic acid; pH 5.6), 10 mM magnesium
   chloride, and 200 μM acetosyringone. The bacterial cultures were
   adjusted to an OD[600] of approximately 0.6. The constructs were
   transiently expressed in tobacco leaf epidermal cells via
   agroinfiltration. Fluorescence was observed 3 days post infiltration
   using a Zeiss LSM 710 confocal microscope ([318]www.zeiss.com). For
   nuclear staining, leaf samples were stained with 1 μg/ml DAPI
   (4′,6-diamidino-2-phenylindole; Sigma-Aldrich) for 10–30 min.
   Gene-expression analysis was performed using total RNA extracted from
   treated leaves. RT–qPCR and RT–PCR were performed as described
   previously ([319]Liu et al., 2007; [320]Yang et al., 2023). Two
   micrograms of total RNA was used for first-strand cDNA synthesis using
   M-MLV reverse transcriptase (Invitrogen, [321]www.invitrogen.com)
   following the manufacturer’s protocol. Actin and GAPDH served as
   internal controls for normalization in the RT–qPCR analyses.

Data and code availability

   We have uploaded the original sequence data for RNA sequencing and
   whole-genome assembly of H. cordata to NCBI and the Genome Sequence
   Archive at the China National Genomics Data Center
   ([322]https://ngdc.cncb.ac.cn). The BioProject numbers are PRJNA940964
   and PRJCA015615.

Funding

   This project was funded by the National Natural Science Foundation of
   China, China (grant number 32360074), the Guizhou Provincial Natural
   Science Foundation of Department of Education,China ([2022]077), The
   Karst Mountain Ecological Security Engineering Research Center, China
   (KY [2021]007), and the Joint Fund of the National Natural Science
   Foundation of China and the Karst Science Research Center of Guizhou
   Province, China (U1812401).

Acknowledgments