Abstract

   Simple sequence repeats (SSRs) are found in nonrandom distributions in
   genomes and are thought to impact gene expression. The distribution
   patterns of 48 295 SSRs of Paphiopedilum malipoense are mined and
   characterized based on the first full‐length transcriptome and
   comprehensive transcriptome dataset from 12 organs. Statistical
   genomics analyses are used to investigate how SSRs in transcripts
   affect gene expression. The results demonstrate the correlations
   between SSR distributions, characteristics, and expression level. Nine
   expression‐modulating motifs (expMotifs) are identified and a model is
   proposed to explain the effect of their key features, potency, and gene
   function on an intra‐transcribed region scale. The expMotif‐transcribed
   region combination is the most predominant contributor to the
   expression‐modulating effect of SSRs, and some intra‐transcribed
   regions are critical for this effect. Genes containing the same type of
   expMotif‐SSR elements in the same transcribed region are likely linked
   in function, regulation, or evolution aspects. This study offers novel
   evidence to understand how SSRs regulate gene expression and provides
   potential regulatory elements for plant genetic engineering.

   Keywords: full‐length transcriptomes, gene expression, motif types,
   Paphiopedilum malipoense, simple sequence repeats (SSRs)
     __________________________________________________________________

   Statistical genomics analyses reveal a high‐density SSR landscape of
   slipper orchid P. malipoense transcriptome. Transcribed SSRs can alter
   the gene expression potential and tissue‐specificity where they are
   located. The expression regulation effects of expMotifs are finely
   delineated, hotspots of expMotifs exist within the transcribed regions,
   and the mechanisms are illustrated by a model.

   graphic file with name ADVS-11-2304848-g003.jpg

1. Introduction

   Many repetitive elements are widespread throughout prokaryotic and
   eukaryotic genomes, and some of them were confirmed to impact gene
   expression.^[ [36]^1 , [37]^2 , [38]^3 , [39]^4 ^] Simple sequence
   repeats (SSRs), consisting of 1−6 bp units of short, tandemly repeated
   motifs, are considered an important source of rapid generation of
   heritable variation and driving the expression evolution and
   environmental adaptation of species.^[ [40]^1 , [41]^5 ^] The variation
   in SSRs distributed in genes, transposons, and centromeres can
   stimulate genomic diversification and promote plant diversity
   generation and environmental adaptation.^[ [42]^6 ^] SSRs within genes
   could act as “tuning knobs” for regulating gene expression and
   modulating physiological metabolic processes.^[ [43]^6 , [44]^7 ,
   [45]^8 ^] Thus, the SSRs were responsible for some of the phenotype
   changes in plants, including morphogenesis,^[ [46]^9 ^] tolerance,^[
   [47]^10 ^] and reproductive phenology.^[ [48]^11 , [49]^12 ^] SSRs
   within different transcribed regions can modulate gene expression and
   functions through various mechanisms. SSR expansion and/or contraction
   in the CDS region affects the protein product of the corresponding gene
   and modifies the phenotype directly.^[ [50]^6 , [51]^13 ^] SSR
   variation in 5’‐UTRs could impact transcription and translation and
   ultimately modulate gene expression.^[ [52]^13 ^] SSR expansion in
   3’‐UTRs can cause transcription slippage and lead to gene silencing or
   a loss of function.^[ [53]^13 , [54]^14 ^] Moreover, SSRs are involved
   in intra‐ and interspecies gene expression divergence through varieties
   of motif types, copy numbers, and locations.^[ [55]^15 , [56]^16 ,
   [57]^17 ^]

   As indicated by Genome‐wide studies, massive variations of gene
   expression‐modulating SSRs (eSSRs) within genomes account for a
   substantial portion of the heritability of complex traits and exhibit
   evolutionary significance in multiple species.^[ [58]^1 , [59]^18 ,
   [60]^19 ^] In sunflowers, many functional eSSRs have been identified
   under selection and facilitating local adaptation across natural
   populations.^[ [61]^16 , [62]^20 ^] The repeat copy number of SSR in
   the Saccharomyces cerevisiae SDT1 promoter evolved to the optimal
   number that achieves the highest gene expression in the experimental
   evolution setting.^[ [63]^21 ^] Understanding the molecular mechanisms
   underlying SSR‐mediated expression regulation could enhance future
   research on complex traits and the design of universal regulatory
   elements for genetic engineering. However, the abundance,
   hypervariability, and complexity of SSRs in genomes pose challenges in
   measuring their effect on gene expression systematically based on omic
   data.

   Comprising ≈10% of flowering plant species, orchids (Orchidaceae)
   possess extraordinarily diverse morphologies and lifestyles and have
   colonized almost every habitat on Earth.^[ [64]^22 , [65]^23 ^] These
   features exert their remarkable charisma and make them important models
   for evolutionary biology and ecology. However, little is known about
   the composition, distribution, and mechanisms underlying eSSRs in the
   orchid genome. Orchidaceae has a large and the most variable genome
   size among angiosperm families (0.33–55.4 pg C^−1).^[ [66]^24 ^] With a
   genome size range of 4.1–43.1 pg C^−1, slipper orchids
   (Cypripedioideae) have some of the largest genomes in orchids, and
   Paphiopedilum exhibits a large genome size of this subfamily
   (16.5–39.5 pg C^−1).^[ [67]^24 , [68]^25 ^] Substantial amounts of
   repetitive sequences play essential roles in genome dynamics,
   differentiation, and evolution, contributing to the nuclear genome
   sizes of Cypripedioideae.^[ [69]^24 , [70]^25 ^] We used P. malipoense
   as the research subject to investigate the eSSRs in a large orchid
   genome with low rates of repetitive DNA removal.^[ [71]^26 ^]

   Herein, we developed a genomic statistics framework to evaluate the
   expression‐modulating effect of SSRs from omic‐scale data. We adopted
   joint long‐read SMRT sequencing and short‐read RNA‐Seq, which can
   resolve reliable repetitive regions, to generate a full‐length
   reference transcriptome of P. malipoense. Further, we identified and
   characterized the SSRs based on these data. We confirmed the
   associations between SSR features (including motif types, location,
   abundance, density, length, etc.) and gene expression levels,
   identified nine motifs significantly affecting gene expression
   (abbreviated as expMotif), and finely described the key features,
   potency, and gene function involving gene expression regulation on an
   intra‐transcribed region scale. This system‐level study provides new
   insight into the role of repetitive sequences in the adaptation and
   evolution of orchids. The novel discovery of expMotifs informed the
   understanding of the mechanisms underlying the eSSRs, which were
   valuable potential regulatory elements for plant genetic engineering.

2. Results

2.1. Full‐Length Transcriptome of P. Malipoense Generated by Hybrid
Sequencing

   In this study, RNA‐seq and PacBio Iso‐Seq were combined to obtain the
   full‐length transcriptome of P. malipoense. A fine sampling of seven
   flower parts (labellum, petals, dorsal sepals, synsepals, stamen,
   staminodes, and gynoecia) was employed for RNA‐seq on the Illumina
   HiSeq 4000 platform. After quality filtering, a total of 57.26 Gb clean
   data were generated with an average of 25 119 394 clean reads per
   sample (Data [72]S1, Supporting Information). RNA samples from roots,
   stems, leaves, petals, sepals, gynandrium, and peduncles were mixed and
   sequenced using the PacBio Sequel platform. A total of 13.25 Gb of raw
   data and 5 611 491 subreads were generated, and the average length was
   2362 bp (Table [73]S1, Supporting Information). The SMRTlink v5.1
   pipeline was used to process the raw data and obtained 360 416 circular
   consensus sequences (CCS), including 60 627 non‐full length reads and
   283 475 full length non‐chimeric reads. After iterative clustering for
   error correction (ICE) and hybrid error correction using RNA‐seq data,
   126 513 high‐quality full‐length transcripts with a mean length of 2792
   bp were generated. After removing redundant sequences using CD‐HIT, a
   total of 63 940 nonredundant transcripts (unigenes) were finally
   obtained. All unigenes were functionally annotated by searching seven
   public databases, including NR, SwissProt, eggNOG, KOG, GO, KEGG, and
   Pfam. A total of 55 116 (86.20%) unigenes were annotated, and 24 457
   (38.25%) unigenes were annotated in all databases (Table [74]S2,
   Supporting Information).

2.2. Identification of SSRs in the P. Malipoense Transcriptome

   The MISA program was used to detect potential SSR loci in the P.
   malipoense transcriptome. A total of 48 295 perfect SSRs were
   identified from 18 770 unigenes, with 7476 unigenes containing more
   than one SSR. The frequency of SSRs was 75.53%, and the mean density
   (SSR counts per Mb) was 248.11 (Table [75]1 ). A total of 575 SSRs
   present in compound formation were composed of 1,179 SSRs, and each
   compound SSR consisted of 2–4 tandem single SSRs. Mono‐nucleotides
   (33 103, 68.56%) accounted for the vast majority of SSRs in the P.
   malipoense transcriptome, followed by di‐nucleotides (5958, 12.34%),
   tri‐nucleotides (3715, 7.69%), penta‐nucleotides (2452, 5.08%),
   hexa‐nucleotides (2173, 4.50%), and tetra‐nucleotides (894, 1.85%)
   (Table [76]1). The SSR tract length ranged from 15 to 217 bp, with an
   average of 19.51 bp. The SSR tract length was significantly different
   between the motif sizes (Kruskal‒Wallis rank sum test, χ ^2 = 9722.9,
   df = 5, p value < 2.2e−16), and slightly declined with increasing motif
   size (Kendall rank correlation coefficient tau = −0.1969, p value <
   2.2e−16) (Figure [77]S1, Supporting Information). The repeat number of
   motifs varied from 3 to 217 with an average of 15.65 (Figure [78]1a).

Table 1.

   Statistical results of SSRs in the P. malipoense transcriptome.
   Item                                               Number
   The total number of sequences examined             63 940
   The total size of examined sequences [bp]        194 653 584
   Total number of identified SSRs                    48 295
   Number of SSR‐containing sequences                 18 770
   Number of sequences containing more than one SSR    7476
   Number of SSRs present in compound formation         575
   Mono‐nucleotide                                    33 103
   Di‐nucleotide                                       5958
   Tri‐nucleotide                                      3715
   Tetra‐nucleotide                                     894
   Penta‐nucleotide                                    2452
   Hexa‐nucleotide                                     2173
   [79]Open in a new tab

Figure 1.

   Figure 1
   [80]Open in a new tab

   Statistics of SSRs in the P. malipoense transcriptome. a) Distribution
   of tandem repeat number of SSRs in the P. malipoense transcriptome. b)
   GC contents of different components in the P. malipoense transcriptome.
   The error bars show the standard error of the mean. ^*** indicates p <
   0.001 (Wilcoxon rank sum test or Dunn's pairwise test). c) Distribution
   of SSR motif types. d) Counts of six SSR motif sizes in three
   transcribed regions. e–g) Density (SSR counts per Mb) of six SSR motif
   sizes in the 5’‐UTR (e), CDS (f), and 3’‐UTR (g).

   A total of 313 motif types were detected in the 48 295 SSRs. A/T, with
   a frequency of 66.33%, was the most dominant among all the motifs.
   AG/CT (3125, 6.47%) and AT/AT (2465, 5.10%) were the top two
   dinucleotide motifs. The most abundant motif types in the trinucleotide
   were ATC/ATG (680, 1.41%) and AAG/CTT (594, 1.23%). For
   tetranucleotides and pentanucleotides, AAAT/ATTT (452, 0.94%) and
   AAAAT/ATTTT (268, 0.55%) were the most frequent motif types,
   respectively. Moreover, AAGTGC/ACTTGC (268, 0.55%) and ACCTGC/AGGTGC
   (260, 0.54%) comprised 24.30% of all hexa‐nucleotides (Figure [81]1c).

2.3. SSR Distribution in Different Transcribed Regions

   We utilized ANGEL 3.0 ([82]https://github.com/PacificBiosciences/ANGEL)
   to predict ORFs, and 15 836 authentic ORFs were verified by BLAST
   searches against the Nr and SwissProt databases and corrected manually.
   Noncoding RNAs were identified by employing five analysis methods,
   including Pfam, CPC2, CNCI, CPAT, and PLEK. After filtering transcripts
   < 200 bp in length, a total of 8128 ncRNAs were confirmed as lncRNAs by
   at least four methods. We compared the SSR distribution characteristics
   among different transcriptome components. Of the 48 295 SSRs, 935,
   2430, 9755, and 6281 were located in the CDS, 5’‐UTR, 3’‐UTR, and
   lncRNA, respectively (Table [83]2 ). The highest density of SSRs was
   observed in the 3’‐UTR, with an average of 1159.56 per Mb, followed by
   the 5’‐UTR (394.57/Mb) and lncRNAs (265.34/Mb). CDS had the lowest
   density of SSRs (36.70/Mb). After excluding 28 895 SSRs located in the
   boundary of CDS and UTRs or lacking solid localization information,
   19 401 SSRs were used for subsequent analysis.

Table 2.

   SSR distribution among different transcriptome components.
   Region Number of sequences Total number of base pairs [bp] Number of
   SSR Density of SSRs [/Mbp]
   CDS 15 836 25 475 917 935 36.70
   UTRs 31 458 14 571 206 12 185 836.24
   5'UTR 15 654 6 158 541 2430 394.57
   3'UTR 15 804 8 412 665 9755 1159.56
   5'UTR‐CDS boundary – – 21 –
   CDS‐3'UTR boundary – – 1 –
   lncRNA 8 128 23 671 941 6281 265.34
   Other transcripts 23 474 76 716 792 28 872 376.35
   [84]Open in a new tab

   SSR counts in different transcribed regions exhibited significant
   differences across the motif sizes (Pearson's chi‐square test, χ ^2 =
   10058, df = 10, p value < 2.2e−16) (Figure [85]1d). CDS had the lowest
   SSR frequency and was dominated by trinucleotides (70.91%)
   (Figure [86]1f). SSRs in the 5’‐UTR were mostly mono‐nucleotide
   (42.14%) and di‐nucleotide (37.12%) (Figure [87]1e). The motif sizes of
   SSRs were most heterogeneous in the 3’‐UTR with mono‐nucleotide
   predominance (92.94%), whose SSR density was 31‐fold higher than that
   of CDS (Figure [88]1g). Furthermore, the SSR tract length showed a
   significant difference among genic regions (Kruskal‒Wallis rank sum
   test, χ ^2 = 668.56, df = 2, p value < 2.2e−16). The average length of
   SSR tracts located in CDS (17.52 bp) was shorter than that in the UTRs,
   and SSRs in the 3’UTR (19.87 bp) were significantly longer than those
   in the 5’‐UTR (19.55 bp) (Figure [89]S2, Supporting Information). The
   Kruskal‒Wallis rank sum test confirmed that the interaction between
   regions and motif sizes affected the length of SSR tracts (H = 38.09,
   df = 10, p value = 3.66e−05), especially for mononucleotides,
   dinucleotides, and hexanucleotides (Figure [90]S3, Supporting
   Information).

   We also calculated the GC content of the SSRs and their background
   sequences. The GC% of SSRs in all of the transcriptome components
   showed significant differences from the sequences in which they were
   located (Wilcoxon rank sum test, W = 9971507, p value < 2.2e−16) and
   was lower than the GC% of their background sequences excluding CDS. CDS
   had the highest GC% of SSRs at 61.78 ± 29.14%, followed by 5’‐UTR
   (31.06 ± 31.06%) and lncRNA (2.55 ± 12.82%), and 3’‐UTR had the lowest
   GC% of SSRs. The post hoc test indicated significant differences among
   all the transcribed regions (Dunn's pairwise test, p value < 2.2e−16)
   (Figure [91]1b).

   Adjusted standardized residuals from the chi‐squared test revealed
   preferences of motif types among transcribed regions. A/T repeats were