Abstract Simple sequence repeats (SSRs) are found in nonrandom distributions in genomes and are thought to impact gene expression. The distribution patterns of 48 295 SSRs of Paphiopedilum malipoense are mined and characterized based on the first full‐length transcriptome and comprehensive transcriptome dataset from 12 organs. Statistical genomics analyses are used to investigate how SSRs in transcripts affect gene expression. The results demonstrate the correlations between SSR distributions, characteristics, and expression level. Nine expression‐modulating motifs (expMotifs) are identified and a model is proposed to explain the effect of their key features, potency, and gene function on an intra‐transcribed region scale. The expMotif‐transcribed region combination is the most predominant contributor to the expression‐modulating effect of SSRs, and some intra‐transcribed regions are critical for this effect. Genes containing the same type of expMotif‐SSR elements in the same transcribed region are likely linked in function, regulation, or evolution aspects. This study offers novel evidence to understand how SSRs regulate gene expression and provides potential regulatory elements for plant genetic engineering. Keywords: full‐length transcriptomes, gene expression, motif types, Paphiopedilum malipoense, simple sequence repeats (SSRs) __________________________________________________________________ Statistical genomics analyses reveal a high‐density SSR landscape of slipper orchid P. malipoense transcriptome. Transcribed SSRs can alter the gene expression potential and tissue‐specificity where they are located. The expression regulation effects of expMotifs are finely delineated, hotspots of expMotifs exist within the transcribed regions, and the mechanisms are illustrated by a model. graphic file with name ADVS-11-2304848-g003.jpg 1. Introduction Many repetitive elements are widespread throughout prokaryotic and eukaryotic genomes, and some of them were confirmed to impact gene expression.^[ [36]^1 , [37]^2 , [38]^3 , [39]^4 ^] Simple sequence repeats (SSRs), consisting of 1−6 bp units of short, tandemly repeated motifs, are considered an important source of rapid generation of heritable variation and driving the expression evolution and environmental adaptation of species.^[ [40]^1 , [41]^5 ^] The variation in SSRs distributed in genes, transposons, and centromeres can stimulate genomic diversification and promote plant diversity generation and environmental adaptation.^[ [42]^6 ^] SSRs within genes could act as “tuning knobs” for regulating gene expression and modulating physiological metabolic processes.^[ [43]^6 , [44]^7 , [45]^8 ^] Thus, the SSRs were responsible for some of the phenotype changes in plants, including morphogenesis,^[ [46]^9 ^] tolerance,^[ [47]^10 ^] and reproductive phenology.^[ [48]^11 , [49]^12 ^] SSRs within different transcribed regions can modulate gene expression and functions through various mechanisms. SSR expansion and/or contraction in the CDS region affects the protein product of the corresponding gene and modifies the phenotype directly.^[ [50]^6 , [51]^13 ^] SSR variation in 5’‐UTRs could impact transcription and translation and ultimately modulate gene expression.^[ [52]^13 ^] SSR expansion in 3’‐UTRs can cause transcription slippage and lead to gene silencing or a loss of function.^[ [53]^13 , [54]^14 ^] Moreover, SSRs are involved in intra‐ and interspecies gene expression divergence through varieties of motif types, copy numbers, and locations.^[ [55]^15 , [56]^16 , [57]^17 ^] As indicated by Genome‐wide studies, massive variations of gene expression‐modulating SSRs (eSSRs) within genomes account for a substantial portion of the heritability of complex traits and exhibit evolutionary significance in multiple species.^[ [58]^1 , [59]^18 , [60]^19 ^] In sunflowers, many functional eSSRs have been identified under selection and facilitating local adaptation across natural populations.^[ [61]^16 , [62]^20 ^] The repeat copy number of SSR in the Saccharomyces cerevisiae SDT1 promoter evolved to the optimal number that achieves the highest gene expression in the experimental evolution setting.^[ [63]^21 ^] Understanding the molecular mechanisms underlying SSR‐mediated expression regulation could enhance future research on complex traits and the design of universal regulatory elements for genetic engineering. However, the abundance, hypervariability, and complexity of SSRs in genomes pose challenges in measuring their effect on gene expression systematically based on omic data. Comprising ≈10% of flowering plant species, orchids (Orchidaceae) possess extraordinarily diverse morphologies and lifestyles and have colonized almost every habitat on Earth.^[ [64]^22 , [65]^23 ^] These features exert their remarkable charisma and make them important models for evolutionary biology and ecology. However, little is known about the composition, distribution, and mechanisms underlying eSSRs in the orchid genome. Orchidaceae has a large and the most variable genome size among angiosperm families (0.33–55.4 pg C^−1).^[ [66]^24 ^] With a genome size range of 4.1–43.1 pg C^−1, slipper orchids (Cypripedioideae) have some of the largest genomes in orchids, and Paphiopedilum exhibits a large genome size of this subfamily (16.5–39.5 pg C^−1).^[ [67]^24 , [68]^25 ^] Substantial amounts of repetitive sequences play essential roles in genome dynamics, differentiation, and evolution, contributing to the nuclear genome sizes of Cypripedioideae.^[ [69]^24 , [70]^25 ^] We used P. malipoense as the research subject to investigate the eSSRs in a large orchid genome with low rates of repetitive DNA removal.^[ [71]^26 ^] Herein, we developed a genomic statistics framework to evaluate the expression‐modulating effect of SSRs from omic‐scale data. We adopted joint long‐read SMRT sequencing and short‐read RNA‐Seq, which can resolve reliable repetitive regions, to generate a full‐length reference transcriptome of P. malipoense. Further, we identified and characterized the SSRs based on these data. We confirmed the associations between SSR features (including motif types, location, abundance, density, length, etc.) and gene expression levels, identified nine motifs significantly affecting gene expression (abbreviated as expMotif), and finely described the key features, potency, and gene function involving gene expression regulation on an intra‐transcribed region scale. This system‐level study provides new insight into the role of repetitive sequences in the adaptation and evolution of orchids. The novel discovery of expMotifs informed the understanding of the mechanisms underlying the eSSRs, which were valuable potential regulatory elements for plant genetic engineering. 2. Results 2.1. Full‐Length Transcriptome of P. Malipoense Generated by Hybrid Sequencing In this study, RNA‐seq and PacBio Iso‐Seq were combined to obtain the full‐length transcriptome of P. malipoense. A fine sampling of seven flower parts (labellum, petals, dorsal sepals, synsepals, stamen, staminodes, and gynoecia) was employed for RNA‐seq on the Illumina HiSeq 4000 platform. After quality filtering, a total of 57.26 Gb clean data were generated with an average of 25 119 394 clean reads per sample (Data [72]S1, Supporting Information). RNA samples from roots, stems, leaves, petals, sepals, gynandrium, and peduncles were mixed and sequenced using the PacBio Sequel platform. A total of 13.25 Gb of raw data and 5 611 491 subreads were generated, and the average length was 2362 bp (Table [73]S1, Supporting Information). The SMRTlink v5.1 pipeline was used to process the raw data and obtained 360 416 circular consensus sequences (CCS), including 60 627 non‐full length reads and 283 475 full length non‐chimeric reads. After iterative clustering for error correction (ICE) and hybrid error correction using RNA‐seq data, 126 513 high‐quality full‐length transcripts with a mean length of 2792 bp were generated. After removing redundant sequences using CD‐HIT, a total of 63 940 nonredundant transcripts (unigenes) were finally obtained. All unigenes were functionally annotated by searching seven public databases, including NR, SwissProt, eggNOG, KOG, GO, KEGG, and Pfam. A total of 55 116 (86.20%) unigenes were annotated, and 24 457 (38.25%) unigenes were annotated in all databases (Table [74]S2, Supporting Information). 2.2. Identification of SSRs in the P. Malipoense Transcriptome The MISA program was used to detect potential SSR loci in the P. malipoense transcriptome. A total of 48 295 perfect SSRs were identified from 18 770 unigenes, with 7476 unigenes containing more than one SSR. The frequency of SSRs was 75.53%, and the mean density (SSR counts per Mb) was 248.11 (Table [75]1 ). A total of 575 SSRs present in compound formation were composed of 1,179 SSRs, and each compound SSR consisted of 2–4 tandem single SSRs. Mono‐nucleotides (33 103, 68.56%) accounted for the vast majority of SSRs in the P. malipoense transcriptome, followed by di‐nucleotides (5958, 12.34%), tri‐nucleotides (3715, 7.69%), penta‐nucleotides (2452, 5.08%), hexa‐nucleotides (2173, 4.50%), and tetra‐nucleotides (894, 1.85%) (Table [76]1). The SSR tract length ranged from 15 to 217 bp, with an average of 19.51 bp. The SSR tract length was significantly different between the motif sizes (Kruskal‒Wallis rank sum test, χ ^2 = 9722.9, df = 5, p value < 2.2e−16), and slightly declined with increasing motif size (Kendall rank correlation coefficient tau = −0.1969, p value < 2.2e−16) (Figure [77]S1, Supporting Information). The repeat number of motifs varied from 3 to 217 with an average of 15.65 (Figure [78]1a). Table 1. Statistical results of SSRs in the P. malipoense transcriptome. Item Number The total number of sequences examined 63 940 The total size of examined sequences [bp] 194 653 584 Total number of identified SSRs 48 295 Number of SSR‐containing sequences 18 770 Number of sequences containing more than one SSR 7476 Number of SSRs present in compound formation 575 Mono‐nucleotide 33 103 Di‐nucleotide 5958 Tri‐nucleotide 3715 Tetra‐nucleotide 894 Penta‐nucleotide 2452 Hexa‐nucleotide 2173 [79]Open in a new tab Figure 1. Figure 1 [80]Open in a new tab Statistics of SSRs in the P. malipoense transcriptome. a) Distribution of tandem repeat number of SSRs in the P. malipoense transcriptome. b) GC contents of different components in the P. malipoense transcriptome. The error bars show the standard error of the mean. ^*** indicates p < 0.001 (Wilcoxon rank sum test or Dunn's pairwise test). c) Distribution of SSR motif types. d) Counts of six SSR motif sizes in three transcribed regions. e–g) Density (SSR counts per Mb) of six SSR motif sizes in the 5’‐UTR (e), CDS (f), and 3’‐UTR (g). A total of 313 motif types were detected in the 48 295 SSRs. A/T, with a frequency of 66.33%, was the most dominant among all the motifs. AG/CT (3125, 6.47%) and AT/AT (2465, 5.10%) were the top two dinucleotide motifs. The most abundant motif types in the trinucleotide were ATC/ATG (680, 1.41%) and AAG/CTT (594, 1.23%). For tetranucleotides and pentanucleotides, AAAT/ATTT (452, 0.94%) and AAAAT/ATTTT (268, 0.55%) were the most frequent motif types, respectively. Moreover, AAGTGC/ACTTGC (268, 0.55%) and ACCTGC/AGGTGC (260, 0.54%) comprised 24.30% of all hexa‐nucleotides (Figure [81]1c). 2.3. SSR Distribution in Different Transcribed Regions We utilized ANGEL 3.0 ([82]https://github.com/PacificBiosciences/ANGEL) to predict ORFs, and 15 836 authentic ORFs were verified by BLAST searches against the Nr and SwissProt databases and corrected manually. Noncoding RNAs were identified by employing five analysis methods, including Pfam, CPC2, CNCI, CPAT, and PLEK. After filtering transcripts < 200 bp in length, a total of 8128 ncRNAs were confirmed as lncRNAs by at least four methods. We compared the SSR distribution characteristics among different transcriptome components. Of the 48 295 SSRs, 935, 2430, 9755, and 6281 were located in the CDS, 5’‐UTR, 3’‐UTR, and lncRNA, respectively (Table [83]2 ). The highest density of SSRs was observed in the 3’‐UTR, with an average of 1159.56 per Mb, followed by the 5’‐UTR (394.57/Mb) and lncRNAs (265.34/Mb). CDS had the lowest density of SSRs (36.70/Mb). After excluding 28 895 SSRs located in the boundary of CDS and UTRs or lacking solid localization information, 19 401 SSRs were used for subsequent analysis. Table 2. SSR distribution among different transcriptome components. Region Number of sequences Total number of base pairs [bp] Number of SSR Density of SSRs [/Mbp] CDS 15 836 25 475 917 935 36.70 UTRs 31 458 14 571 206 12 185 836.24 5'UTR 15 654 6 158 541 2430 394.57 3'UTR 15 804 8 412 665 9755 1159.56 5'UTR‐CDS boundary – – 21 – CDS‐3'UTR boundary – – 1 – lncRNA 8 128 23 671 941 6281 265.34 Other transcripts 23 474 76 716 792 28 872 376.35 [84]Open in a new tab SSR counts in different transcribed regions exhibited significant differences across the motif sizes (Pearson's chi‐square test, χ ^2 = 10058, df = 10, p value < 2.2e−16) (Figure [85]1d). CDS had the lowest SSR frequency and was dominated by trinucleotides (70.91%) (Figure [86]1f). SSRs in the 5’‐UTR were mostly mono‐nucleotide (42.14%) and di‐nucleotide (37.12%) (Figure [87]1e). The motif sizes of SSRs were most heterogeneous in the 3’‐UTR with mono‐nucleotide predominance (92.94%), whose SSR density was 31‐fold higher than that of CDS (Figure [88]1g). Furthermore, the SSR tract length showed a significant difference among genic regions (Kruskal‒Wallis rank sum test, χ ^2 = 668.56, df = 2, p value < 2.2e−16). The average length of SSR tracts located in CDS (17.52 bp) was shorter than that in the UTRs, and SSRs in the 3’UTR (19.87 bp) were significantly longer than those in the 5’‐UTR (19.55 bp) (Figure [89]S2, Supporting Information). The Kruskal‒Wallis rank sum test confirmed that the interaction between regions and motif sizes affected the length of SSR tracts (H = 38.09, df = 10, p value = 3.66e−05), especially for mononucleotides, dinucleotides, and hexanucleotides (Figure [90]S3, Supporting Information). We also calculated the GC content of the SSRs and their background sequences. The GC% of SSRs in all of the transcriptome components showed significant differences from the sequences in which they were located (Wilcoxon rank sum test, W = 9971507, p value < 2.2e−16) and was lower than the GC% of their background sequences excluding CDS. CDS had the highest GC% of SSRs at 61.78 ± 29.14%, followed by 5’‐UTR (31.06 ± 31.06%) and lncRNA (2.55 ± 12.82%), and 3’‐UTR had the lowest GC% of SSRs. The post hoc test indicated significant differences among all the transcribed regions (Dunn's pairwise test, p value < 2.2e−16) (Figure [91]1b). Adjusted standardized residuals from the chi‐squared test revealed preferences of motif types among transcribed regions. A/T repeats were