Abstract Pseudotaxus chienii (Taxaceae) is an endangered conifer species endemic to China. However, a lack of suitable molecular markers hinders the genomic and genetic studies on this species. Here, we characterized and developed the microsatellite markers from a newly sequenced P. chienii transcriptome. A total of 21,835 microsatellite loci were detected from 161,131 non-redundant unigene sequences, and the frequency of SSRs was 13.55%, with an average of one SSR loci per 9.18 kb. Mono-nucleotide, di-nucleotide, and tri-nucleotide were the dominant repeat types, accounting for 50.06, 13.49, and 29.39% of the total SSRs, respectively. In terms of distribution location, the coding regions (CDS) with few microsatellites and mainly consisted of tri-nucleotides. There were significant differences in the length of microsatellite among genic regions and motif types. Functional annotation showed that the unigenes containing microsatellites had a wide range of biological functions, most of which were related to basic metabolism, and a few might be involved in expression regulation of gene and response to environmental stress. In addition, 375 primer pairs were randomly selected and synthesized for the amplification and validation of microsatellite markers. Seventy-seven primer pairs were successfully amplified and 40 primer pairs were found to be polymorphic. Finally, 20 pairs of primers with high polymorphism were selected to assess the genetic diversity in four P. chienii populations. In addition, the newly developed microsatellite markers exhibited high transferability (70%) in Amentotaxus argotaenia. Our study could enable further genetic diversity analysis and functional gene mining on Taxaceae. Keywords: Pseudotaxus chienii, transcriptome, microsatellite markers, genetic diversity, transferability Introduction Pseudotaxus chienii, belonging to monotypic genus Pseudotaxus W. C. Cheng (Taxaceae) endemic to China, is narrowly distributed Zhejiang, Jiangxi, Hunan and Guangdong Province, and the Guangxi Zhuang Autonomous Region ([29]Fu et al., 1999). The species is an evergreen shrub or small tree plants, up to 4 meters, characterized by seeds of a white fleshy aril ([30]Fu et al., 1999). Most of its habitats are in the subtropical evergreen broad-leaved forest or deciduous broad-leaved forest at an elevation of about 800–1500 m ([31]Fu et al., 1999). However, the distribution range and population size of P. chienii have been continuously declining due to the low seed-setting rate, the difficult natural reproduction, the strict habitat requirements, and the impact of human activities ([32]Fu and Jin, 1992; [33]Yang et al., 2005). Currently, P. chienii has been listed as a second-class nationally protected plant by the Chinese Red Data Book ([34]Fu and Jin, 1992). Thus, it is imperative to develop effective strategies to protect this species. The first step is a comprehensive knowledge of its genetic diversity, which requires a powerful marker resource. However, due to the lack of genomic resources, the knowledge about functional gene analysis and molecular marker development of this species is limited. Microsatellites, also known as simple sequence repeats (SSRs) or short tandem repeats (STRs), consist of short tandem repeated motifs of 1–6 bp units, which are ubiquitous in the genomes of prokaryotic and eukaryotic organisms ([35]Tautz and Renz, 1984; [36]Toth et al., 2000). Microsatellite markers are one of the most efficient types of molecular markers because of their high polymorphism, wide genome coverage, co-dominant inheritance, and good reproducibility ([37]Powell et al., 1996; [38]Gupta and Varshney, 2000; [39]Varshney et al., 2005). Hence, they have been widely used in genetic linkage map, quantitative trait loci (QTL) mapping, marker-assisted selection (MAS) breeding, evolutionary studies, and genetic diversity analysis ([40]Nishio et al., 2008; [41]Garvin et al., 2010; [42]Ahmadi and Fotokian, 2011; [43]Wang et al., 2015; [44]Zhao et al., 2019). Traditional methods for developing microsatellite markers are time-consuming, laborious, and expensive ([45]Zane et al., 2002). With the development of next-generation sequencing technology (NGS), transcriptome sequencing has gradually become a convenient and efficient method for the development of large-scale SSR markers, especially for non-model species without an available reference genome ([46]Taheri et al., 2018). SSRs can be divided into genomic SSRs (g-SSRs) and expressed sequence tag SSRs (EST-SSRs) ([47]Eujayl et al., 2004). EST-SSRs are derived from gene transcription regions and are more conserved than g-SSRs, therefore, are more likely to be transferable across related species ([48]Dutta et al., 2011). To date, there have been few investigations of EST-SSR markers in P. chienii, especially based on transcriptome sequencing. The distribution of microsatellites in genomic regions is non-random ([49]Li et al., 2002). For example, in many species, protein-coding regions contain fewer di-nucleotide and tetra-nucleotide SSRs, but more tri-nucleotide and hexa-nucleotide SSRs ([50]Li et al., 2004). According to a study on SSRs in human and Arabidopsis thaliana, the number of SSRs in untranslated regions (UTRs) was more abundant than that in coding regions (CDS) ([51]Zhang et al., 2004). In general, the occurrence frequency of SSRs in coding regions is relatively low, which may be explained by its high mutation rate affecting gene expression ([52]Vieira et al., 2016). Besides, similar distribution patterns were observed in microsatellites located in intergenic regions and introns ([53]Hancock, 1995). The function of microsatellites is associated with their location in the genome. Compared with other genic regions, microsatellites in coding regions are the most conserved ([54]Xin et al., 2012), which may be attributed to selection pressure ([55]Li et al., 2004). Microsatellite variations in coding regions can directly affect the protein synthesis process ([56]Gao et al., 2013). SSR variations in 5′-UTRs can be involved in gene expression regulation by affecting transcription and translation ([57]Li et al., 2004; [58]Vieira et al., 2016). SSR expansions in 3′-UTRs can lead to transcription slippage and mRNA sequence expansion, which further disrupt mRNA splicing and other cell functions ([59]Li et al., 2004; [60]Kalia et al., 2011). Intronic SSRs may play a role in regulating gene transcription, mRNA splicing, or export to cytoplasm ([61]Li et al., 2004). SSR variations in intergenic regions may have an impact on gene function by changing the secondary structure of DNA or the source of small RNAs ([62]Gao et al., 2013). However, there has been no report on the function of microsatellites in P. chienii, which is not conducive to the discovery of functional genes and the identification of phenotypic variation. In this study, we mined microsatellite loci from a newly sequenced transcriptome of P. chienii using Illumina sequencing technology. The aims of this study were to (i) characterize the frequency, distribution, function, and evolution of microsatellites in P. chienii transcriptome; (ii) develop polymorphic microsatellite markers and verified their polymorphism level for P. chienii; (iii) test cross-species transferability of the novel microsatellite markers among Taxaceae species. The results of this study will provide new insights into the function and evolution of microsatellites in P. chienii transcriptome. Additionally, the newly developed microsatellite markers may facilitate the conservation and management of P. chienii genetic resources. Materials and Methods Sample Collection, RNA and DNA Extraction The plant materials (root, stem, leaf, and strobilus) used for RNA isolation and transcriptome sequencing were collected from a single Pseudotaxus chienii individual growing in Bijiashan, Jiangxi Province, China (26°30′37′′N, 114°09′42′′E, 1290 m asl). Each tissue sample was immediately stored in RNA Protective Additive after collection and removal of dust. The collected samples were first stored overnight at 4°C and then stored at −20°C the next day. Total RNA was separately extracted from each tissue using the RNAprep pure Plant Kit (Tiangen, Beijing, China) according to the manufacturer’s instructions. RNA degradation and contamination were detected using 1% agarose gels. RNA quality assessments for purity, integrity, and concentration were separately performed using the NanoPhotometer spectrophotometer (IMPLEN, CA, United States), Qubit 2.0 Fluorometer (Life Technologies, CA, United States), and Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, United States). For SSR characterization, the young leaves of 110 individuals were collected from four geographic localities across China: 31 from Maoshan, Zhejiang (MS), 30 from Yinshan park, Guangxi (YS), 30 from Bijiashan, Jiangxi (BJS), and 19 from Zhangjiajie, Hunan (ZJJ) ([63]Table 1). Total genomic DNA was extracted using the modified cetyltrimethyl ammonium bromide (CTAB) method ([64]Su et al., 2005). The quality and quantity of DNA were determined by NanoDrop 2000 spectrophotometer (ThermoFisher Scientific, Waltham, MA, United States) and 0.8% agarose gel electrophoresis. Finally, all DNA samples were diluted to 50 ng/μL and stored at −20°C until use. TABLE 1. Sampling location information of four P. chienii populations. Population Location Sample Size Latitude (N) Longitude (E) Altitude (m a.s.l.) MS Maoshan, Zhejiang 31 118°58′23′′ 28°06′08′′ 1120 YS Yinshan park, Guangxi 30 110°14′53′′ 24°09′15′′ 1050 BJS Bijiashan, Jiangxi 30 114°09′41′′ 26°30′35′′ 1340 ZJJ Zhangjiajie, Hunan 19 110°28′56′′ 29°23′12′′ 1055 [65]Open in a new tab Transcriptome Sequencing and de novo Assembly The qualified RNA samples from four tissues were used for cDNA library construction and transcriptome sequencing. Four cDNA libraries were constructed using NEBNext Ultra RNA Library Prep Kit for Illumina (NEB, United States) according to the manufacturer’s instructions. Transcriptome sequencing was performed using Illumina HiSeq 2500 platform (Illumina, San Diego, CA, United States). After sequencing, raw reads were filtered to remove the adaptor sequences and low-quality reads. The remaining high-quality clean reads were de novo assembled using Trinity software ([66]Grabherr et al., 2011) with min_kmer_cov set to 2 and other parameters set to default. Assembled transcripts were clustered, and removal of the redundant sequences to generate non-redundant unigenes by CD-HIT software ([67]Fu et al., 2012). Microsatellite Identification and Characterization The MISA software (Microsatellite searching tool^[68]1) was used to identify microsatellite loci in the assembled transcriptome of P. chienii. The parameters were set as follows: mono-, di-, tri-, tetra-, penta-, and hexa-nucleotides with minimum repeat numbers of 10, 6, 5, 5, 5, and 5, respectively. For compound SSRs, the maximum interruption between two SSRs was set as 100 bases. In order to investigate the distribution of microsatellites in P. chienii transcriptome, the coding sequence (CDS) regions of unigenes were predicted based on the BLAST results against the Swiss-Prot database (E-value < 10^–5), and then CDS regions that showed no hits in BLAST were predicted using ESTScan ([69]Iseli et al., 1999). The relative position of microsatellites (CDS, 5′-UTR, or 3′-UTR) was inferred from the positions of start and stop codons of the CDS. Chi-square analysis was used to test whether there were significant differences in different genic regions (CDS, 5′-UTR, and 3′-UTR) for the density of microsatellites. Kruskal–Wallis rank sum test ([70]Kruskal and Wallis, 1952) was used to test the influence of genic region, repeat motif size, and repeat motif type on microsatellite length. Function Annotation of Unigenes Containing Microsatellites To identify the possible function of microsatellites, all unigene sequences containing microsatellites were aligned to the public databases by BLASTX program (E-value < 10^–5), including Nr (NCBI non-redundant protein sequences), Nt (NCBI non-redundant nucleotide sequences), Swiss-Prot (A manually annotated and reviewed protein sequence database), KOG (Clusters of eukaryotic Orthologous Groups), and Pfam (Protein family, assigned using the HMMER3.0 package). Besides, for all unigene sequences containing microsatellites, the GO (Gene Ontology) annotation was performed by the Blast2GO software ([71]Conesa et al., 2005) (E-value < 10^–6). The metabolic pathway analysis was predicted by KAAS (KEGG Automatic Annotation Server) (E-value < 10^–10). To determine whether microsatellites were significantly enriched in some GO terms or KEGG pathways, GO and KEGG enrichment analysis were performed with the topGO package ([72]Alexa and Rahnenfuhrer, 2010) and clusterProfiler package ([73]Yu et al., 2012), respectively. Over-represented GO terms/KEGG pathways were estimated by Fisher’s exact test with multiple testing correction of FDR (Q-value < 0.05) ([74]Benjamini and Hochberg, 1995). Amplification and Validation of Microsatellite Markers Microsatellite primers were designed using Primer Premier 5.0 ([75]Lalitha, 2000) based on the following criteria: (1) primer length of 18–25 bp, with an optimum length of 20 bp; (2) annealing temperature of 50–62°C with a maximum of 5°C difference in annealing temperature between forward and reverse primers; (3) CG content from 40 to 60%; (4) the expected PCR product sizes of 100–500 bp; (5) avoidance of primer dimers and hairpin structures. PCR amplifications were conducted in a volume of 25 μL containing 50 ng of template DNA, 2.5 μL of 10 × PCR buffer (2.0 mM Mg^2+), 1.6 μL of dNTPs (10 mM), 0.5 μL of each primer (10 μM), 1 U of Taq polymerase, and 18.7 μL of double-distilled water. The PCR conditions were as follows: initial denaturation at 94°C for 5 min, followed by 35 cycles of 40 s at 94°C, 40 s at annealing temperature, and 30 s at 72°C, and a final extension at 72°C for 10 min. The amplified products were detected by 1.5% (w/v) agarose gel electrophoresis. Then, the successfully amplified primers were selected for polymorphic screening using 6% denaturing polyacrylamide gel electrophoresis. Finally, microsatellite primers with high polymorphism and good repeatability were selected for genotyping on an ABI 3730XL DNA Analyzer (Applied Biosystems, Foster City, CA, United States) with GeneScan LIZ 500 as an internal reference (Applied Biosystems). Allele sizes were detected with the software GeneMapper v4.0 (Applied Biosystems). The genetic parameters of microsatellite loci were estimated by GenAlEx v6.5 software ([76]Peakall and Smouse, 2012), including the number of alleles (Na), the number of effective alleles (Ne), observed heterozygosity (Ho), and expected heterozygosity (He). The polymorphic information content (PIC) of each microsatellite marker was calculated using PIC_CALC v0.6 ([77]Nagy et al., 2012). The null allele frequency for each marker was estimated using MICRO-CHECKER 2.2.3 ([78]Van Oosterhout et al., 2004). Tests for deviations from linkage disequilibrium (LD) and Hardy-Weinberg equilibrium (HWE) were carried out by GENEPOP v4.2 ([79]Raymond and Rousset, 1995). The software BOTTLENECK v1.2.02 ([80]Piry et al., 1999) was used to detect recent population bottlenecks under three different models of microsatellite evolution (Infinite allele model, IAM; Stepwise mutation model, SMM; Two-phased model of mutation, TPM). Cross-Species Amplification and Transferability Analysis For cross-species transferability analysis, the related species of the family Taxaceae, Amentotaxus argotaenia (with 58 individuals from four different populations, [81]Supplementary Table 1), was chosen to evaluate the transferability of these newly developed microsatellite markers. Genomic DNA isolation and PCR amplification were performed as described above. PCR products were separated by capillary electrophoresis using an ABI 3730XL DNA Analyzer (Applied Biosystems, Foster City, CA, United States), and the sizes of fragments were determined using GeneMapper v4.0 (Applied Biosystems). Results Illumina Sequencing and de novo Assembly In this study, the Illumina HiSeq 2500 platform was used to sequence the transcriptome of P. chienii. A total of 52,633,598, 57,112,536, 62,840,720, and 42,617,842 raw reads were generated from root, stem, leaf, and strobilus, respectively. After trimming adaptors and removing low-quality reads, 51,073,864, 56,419,624, 61,110,258, 41,448,574 high-quality clean reads were obtained for root, stem, leaf, and strobilus ([82]Table 2). The total nucleotide number of each tissue was greater than 6G. The Q20 and Q30 ratio of each tissue were over 95 and 91%, respectively. The GC content of each tissue was approximately 45% ([83]Table 2). Thus, the sequencing data was of high quality and met the requirement for subsequent analyses. Using the Trinity program ([84]Grabherr et al., 2011), the high-quality clean reads from four tissue were assembled into 156,747, 202,908, 107,675, and 201,124 transcripts, respectively. The assembled transcripts of these tissues were clustered into 66,126, 79,842, 52,207, and 60,391 unigenes ([85]Table 2). After removing the redundant sequences, a total of 161,131 non-redundant unigenes were generated. TABLE 2. Summary of de novo transcriptome assembly of four tissues from P. chienii. Categories Items Root Stem Leaf Strobilus Raw reads Total raw reads 52,633,598 57,112,536 62,840,720 42,617,842 Clean reads Total clean reads 51,073,864 56,419,624 61,110,258 41,448,574 Total clean bases (Gb) 7.66 8.46 9.17 6.22 Q20 (%) 97.16 98.03 96.9 97.28 Q30 (%) 92.16 94.03 91.55 92.34 GC content (%) 44.25 46.26 45.57 45.23 Transcripts Total number of transcripts 156,747 202,908 107,675 201,124 Total length of transcripts (bp) 113,815,935 136,683,283 89,744,115 114,523,821 Maximum length of transcripts (bp) 13,579 12,558 11,562 12,275 Minimum length of transcripts (bp) 201 201 201 201 Mean length of transcripts (bp) 726 674 833 569 N50 (bp) 1,395 1,224 1,621 1,011 Unigenes Total number of unigenes 66,126 79,842 52,207 60,391 Total length of unigenes (bp) 87,790,725 101,369,156 74,169,328 76,205,289 Maximum length of unigenes (bp) 13,579 12,558 11,562 12,275 Minimum length of unigenes (bp) 201 201 201 201 Mean length of unigenes (bp) 1,328 1,270 1,421 1,262 N50 (bp) 1,843 1,728 1,931 1,754 [86]Open in a new tab Characterization of Microsatellite in P. chienii Transcriptome Using the MISA program, a total of 21,835 potential microsatellite loci were identified from 161,131 non-redundant unigenes, with 2,969 unigenes containing more than one microsatellite loci. The frequency of microsatellite in P. chienii transcriptome was 13.55%; an average of one microsatellite loci was found every 9.18 kb ([87]Table 3). The most abundant repeat motif types were mono-nucleotide (10,930, 50.06%), which accounted for about half of total SSRs, followed by tri-nucleotide (6,418, 29.39%), di-nucleotide (2,946, 13.49%), hexa-nucleotide (844, 3.87%), tetra-nucleotide (354, 1.62%), and penta-nucleotide (343, 1.57%) ([88]Table 3). The number of tandem repeats of microsatellite motifs ranged from five to 107. Microsatellite with ten tandem repeats (5,141, 23.54%) were the most common, followed by five tandem repeats (4,393, 20.12%), six tandem repeats (2,913, 13.34%), and 11 tandem repeats (2,357, 10.79%). Microsatellite motifs with > 24 tandem repeats only accounted for 2.13% ([89]Figure 1). TABLE 3. Analysis results of microsatellite based on the P. chienii transcriptome. Items Number Total number of sequences examined 161,131 Total size of examined sequences (bp) 200,495,148 Total number of identified microsatellite loci 21,835 Number of microsatellites containing sequences 17,974 Number of sequences containing more than 1 microsatellite loci 2,969 Number of microsatellites present in compound formation 1,513 Frequency of microsatellite loci 13.55 Distribution density of microsatellite loci (kb) 9.18 Mono-nucleotide 10,930 Di-nucleotide 2,946 Tri-nucleotide 6,418 Tetra-nucleotide 354 Penta-nucleotide 343 Hexa-nucleotide 844 [90]Open in a new tab FIGURE 1. [91]FIGURE 1 [92]Open in a new tab Distribution of microsatellite motif types and tandem repeat numbers in P. chienii transcriptome. Of the 21,835 microsatellite loci, 239 different repeat motifs were detected. For the mono-nucleotide motif, A/T was the most abundant type with a frequency of 49.13%. For the di-nucleotide repeat motifs, AT/AT (6.31%) and AG/CT (5.32%) were the dominant types. The two most frequent types in the tri-nucleotide were AAG/CTT (6.38%) and AGG/CCT (6.16%). Moreover, a low percentage (7.06%) of tetra-, penta- and hexa-nucleotide repeat motifs were observed in all identified microsatellite motifs ([93]Figure 2). FIGURE 2. [94]FIGURE 2 [95]Open in a new tab Distribution of microsatellite repeat motifs in P. chienii transcriptome. Microsatellite length ranged from 10 to 107 bp, with an average length of 16.23 bp. In general, the length of microsatellites differed significantly between the motif size classes (Kruskal–Wallis rank sum test, P < 2.2e^–16), and their average length decreased with the increase of the motif size (Nemenyi test, P < 2.2e^–16). However, there was no significant difference in length between tetra- and penta-nucleotide according to the pairwise length comparisons by the Nemenyi test ([96]Supplementary Figure 1). Microsatellite Distribution in Different Genic Regions We explored the distribution characteristics of microsatellite loci in P. chienii transcriptome. Of the 21,835 microsatellite loci, 2,007 and 8,958 were located in the coding sequence regions (CDS) and untranslated regions (UTRs), respectively ([97]Figure 3). The remaining 10,870 microsatellites were excluded from the analysis because there was insufficient information to determine their distribution. Microsatellites in different genic regions (CDS, 5′-UTRs, and 3′-UTRs) showed distinct patterns of distribution (χ^2 = 2993.2, P < 2.2e^–16). CDS contained fewer microsatellites and was dominated by tri-nucleotides (1,481, 73.79%), whereas the mono- and di-nucleotide microsatellites in the UTRs (4,869, 54.35%) were abundant. Smaller motif types (mono-, di-, and tri-nucleotide microsatellites) were also observed to be more common in the transcriptome ([98]Figure 3). Furthermore, significant differences in the length of microsatellite among three regions (CDS, 5′-UTR, and 3′-UTR) (Kruskal–Wallis rank sum test, P < 2.2e^–16, [99]Supplementary Figure 2). The average length of microsatellites located in the CDS regions (17.84 bp) was slightly longer than in the UTRs (16.43 bp). The length of the microsatellite was also affected by the interaction of the regions and motif types according to the Kruskal–Wallis rank sum test (P < 2.2e^–16, [100]Supplementary Figure 3). FIGURE 3. [101]FIGURE 3 [102]Open in a new tab Distribution of six microsatellite repeat types in different genic regions of P. chienii. Functional Annotation and Classification of Unigenes Containing Microsatellites To predict the potential functions of microsatellites, all microsatellite-containing unigenes were blasted against the public databases using the BLASTX program. As a result, 12,652 (70.39%) unigenes were annotated in Nr, 6,660 (37.05%) in Nt, 10,622 (59.10%) in Swiss-Prot, 4,274 (23.78%) in KOG, 11,267 (62.68%) in Pfam, 11,334 (63.06%) in GO, and 5,417 (30.14%) in KEGG. A total of 2,045 (11.38%) unigenes were annotated in all seven databases, and 14,264 (79.36%) unigenes were annotated in at least one database ([103]Table 4). TABLE 4. Summary of functional annotation results of unigenes containing microsatellite in P. chienii transcriptome. Annotated databases Number of unigenes Percentage (%) Nr 12,652 70.39 Nt 6,660 37.05 Swiss-Prot 10,622 59.10 KOG 4,274 23.78 Pfam 11,267 62.68 GO 11,334 63.06 KEGG 5,417 30.14 Annotated in all databases 2,045 11.38 Annotated in at least one database 14,264 79.36 Total 17,974 100 [104]Open in a new tab Using the Blast2GO Program, the microsatellite-containing unigenes were divided into three categories: biological process (BP), cellular component (CC), and molecular function (MF), including 48 sub-categories ([105]Figure 4). In the BP category, “cellular process” (5,464, 48.21%) and “metabolic process” (4,879, 43.05%) were represented prominently. Within the CC category, “cell” (1,975, 17.43%) and “cell part” (2,642, 23.31%) were the most abundant terms, whereas only a few unigenes were assigned to “extracellular region part” (3), “cell junction” (3), and “extracellular matrix” (2). Among eight different MF categories, “catalytic activity” (3,813, 33.64%) and “binding” (3,078, 27.16%) were the two most frequent classes. These results indicated that the SSRs-containing unigenes were mainly involved in the basal metabolism of cells. FIGURE 4. [106]FIGURE 4 [107]Open in a new tab GO functional classification of unigenes containing microsatellite in P. chienii transcriptome. Furthermore, KEGG analysis was used to determine the biological pathways and functions of the unigenes containing microsatellites. In total, 5,125 (28.51%) unigenes were mapped in 124 different pathways and classified into 5 main categories. The largest category was “metabolism” (3,086, 56.97%), followed by “genetic information processing” (1,326, 24.48%), “cellular processes” (284, 5.24%), “environmental information processing” (228, 4.21%), and “organismal systems” (201, 3.71%) ([108]Figure 5). Among these 124 pathways, the most represented pathways including “spliceosome,” “ribosome,” “carbon metabolism,” “plant–pathogen interaction,” “plant hormone signal transduction,” “biosynthesis of amino acids,” “RNA transport,” “protein processing in endoplasmic reticulum,” “starch and sucrose metabolism,” and “ubiquitin mediated proteolysis” ([109]Supplementary Table 2). FIGURE 5. [110]FIGURE 5 [111]Open in a new tab KEGG pathway classification of unigenes containing microsatellite in P. chienii transcriptome. Finally, all microsatellite-containing unigenes were searched against the KOG database to classify and functionally annotation the orthologous gene products. The results showed that a total of 4,274 unigenes were classified into 24 functional groups. Among them, the three largest groups were “posttranslational modification, protein turnover, chaperones” (642, 15.02%), “general function prediction only” (511, 11.96%), and “translation, ribosomal structure and biogenesis” (484, 11.32%), respectively. However, the smallest categories were “extracellular structures” and “nuclear structure,” with two and 18 genes annotated, respectively ([112]Figure 6). These results reflected that the unigenes containing microsatellites had extensive basic biological functions. FIGURE 6. [113]FIGURE 6 [114]Open in a new tab KOG functional classification of unigenes containing microsatellite in P. chienii transcriptome. GO and KEGG Enrichment Analysis of the Microsatellite-Containing Unigenes In order to further identify molecular and biological functions of the unigenes containing microsatellites, Fisher’s exact test was used for GO and KEGG enrichment analysis, respectively. There was significant enrichment of microsatellites in a total of 211 GO terms and most were mapped in transcription, regulation of transcription, and binding (Q-value < 0.05, [115]Supplementary Table 3). KEGG pathway enrichment analysis indicated that the microsatellite-containing unigenes showed significantly enriched in 13 pathways (Q-value < 0.05), and Nine of them are related to basal metabolism and biosynthesis of cells. Notably, the remaining four pathways involved in gene expression regulation, and might play important roles in the response to stress ([116]Figure 7). FIGURE 7. [117]FIGURE 7 [118]Open in a new tab Significant enrichment KEGG pathways of unigenes containing microsatellite. Development and Validation of Microsatellite Markers Based on the SSR-containing unigene sequences, 375 primer pairs were randomly selected and synthesized for the amplification and validation of microsatellite markers. After screening, 77 primer pairs (20.59%) successfully amplified and produced clear bands with expected size, 40 pairs of primers were identified as polymorphic, and 37 pairs were monomorphic ([119]Supplementary Table 4). Finally, 20 primers with high polymorphism were selected and used for genetic diversity analysis ([120]Table 5). In total, 128 alleles were detected at the 20 polymorphic microsatellite loci. The number of amplified alleles per locus (Na) ranged from 2 (P. chienii-198) to 16 (P. chienii-152), with an average of 6.4 alleles per locus. The observed (Ho) and expected (He) heterozygosity ranged from 0.000 to 1.000 and from 0.129 to 0.887, with an average of 0.349 and 0.527, respectively. The PIC values ranged from 0.125 to 0.877, with an average of 0.483. Besides, a high frequency of null alleles was detected at six loci (P. chienii-17, P. chienii-25, P. chienii-86, P. chienii-162, P. chienii-267, and P. chienii-288), and one pair of loci (P. chienii-28 and P. chienii-162) showed LD. HWE test revealed that 13 of 20 microsatellite loci showed significant deviations from the equilibrium after Bonferroni correction (P < 0.05) ([121]Table 5). TABLE 5. Characteristics of 20 newly developed polymorphic microsatellite markers in P. chienii. Locus Primer sequence (5′-3′) Repeat motif Size range (bp) Annealing temperature (°C) Na Ho He PIC* Null GenBank accession no. P. chienii-17 F: GGTCGAGTACGTGGTGGTTT R: GCCTGCGCTGTCATAAACTG (AGG)5 158–161 57 3.000 0.145 0.257 0.237** 0.2703 (Y) [122]MT563347 P. chienii-20 F: TGTGCCAGTACTGCTACTGC R: TGAATGCGTGCGGAAACAAG (ACC)6 183–195 57 5.000 0.209 0.585 0.517 −0.0267 (N) [123]MT563350 P. chienii-25 F: GATGCCGCTGGTTTCAATCC R: GCCGTACCGATTGGGATCAT (GGA)8 198–214 57 5.000 0.309 0.397 0.366* 0.315 (Y) [124]MT563351 P. chienii-28 F: GAGTGGGAGACGAAGAGTGC R: CGAAGTGGGCTGCAACAATG (CTC)8 261–266 57 6.000 0.600 0.670 0.626 0.0196 (N) [125]MT563352 P. chienii-29 F: AGCTGCAAGGCTACACAGAG R: CAATCCCGGGCCTGTTAGAA (GAA)5 235–239 57 5.000 0.318 0.563 0.500** −0.1413 (N) [126]MT563353 P. chienii-36 F: GGGCCATCCTCTTCCTCAAC R: CTCGACACTGCTCCACATCT (TCC)8 240–247 57 3.000 0.055 0.129 0.125** 0.0299 (N) [127]MT563358 P. chienii-75 F: CGCTCCAACGAATCCAACC R: TAATGCCATCCGCACAACC (CAGAAG)5 241–277 57 14.000 0.743 0.844 0.826 −0.1586 (N) [128]MT563368 P. chienii-86 F: GAATTTGAAGCACGGCCTCA R: GAGTGCCCTGCTTTCTGGAT (GGCACC)5 242–277 57 7.000 0.156 0.595 0.549** 0.3533 (Y) [129]MT563369 P. chienii-134 F: ACGCCACGTTAGGACACAAT R: CCTAGATCAAGAGCGGCCTG (CTT)6 238–319 57 6.000 0.227 0.389 0.364 0.0983 (N) [130]MT563374 P. chienii-141 F: CTGTCAACAAGCGGCTTTCC R: AGAGCCGGGGGAAAATTGAG (CGG)7 232–319 57 8.000 0.473 0.694 0.640** 0.0296 (N) [131]MT563377 P. chienii-152 F: CCCATCTGAACCCACGCTAA R: AAAGCGCTCATGCCCAAAAC (GGC)7 239–300 57 16.000 1.000 0.887 0.877** −0.2134 (N) [132]MT563378 P. chienii-162 F: ACCTATCACCTCCTCGACCC R: CCGTTCCATCACTGTGGACA (CCACCG)6 203–229 55 13.000 0.573 0.807 0.784** 0.3871 (Y) [133]MT563379 P. chienii-198 F: GAGGGATACAGAAGCACAG R: TATGACAAACCCAAACGAG (ATA)5 276–288 56 2.000 0.591 0.416 0.330** 0 (N) [134]MT563329 P. chienii-214 F: GACAACGGCAAAGGAGGAAT R: GCGATAGCCACCAAAGACAT (ATA)6 319–322 58 3.000 0.000 0.329 0.290** 0 (N) [135]MT563334 P. chienii-216 F: TGCGGTTCAGTAACAGTCCTTC R: TCCCCCACCTCTTCCCAG (CTCCTG)5 439–452 58 6.000 0.657 0.513 0.408 −0.314 (N) [136]MT563335 P. chienii-233 F: TGTGTGAAAGGACAAGGCGT R: GCACCCTATTCACCCGAGAT (ATGCAG)7 225–259 56 5.000 0.318 0.474 0.431 −0.0267 (N) [137]MT563337 P. chienii-267 F: CCCCTCATTGACAGGTTC R: AAGATAGTCGGGACACCAAG (CTG)5 309–320 56 6.000 0.036 0.336 0.306** 0.3232 (Y) [138]MT563381 P. chienii-288 F: CACGCCCACCATAGTTGT R: GGAGGAAGATGTCGTTGAAG (AAGG)5 263–270 58 3.000 0.045 0.523 0.449** 0.2884 (Y) [139]MT563386 P. chienii-341 F: GACCTCTTACCAGCTGCGAG R: ACCACCGGTTTCAGTTTCGT (CCT)11 208–227 62 9.000 0.200 0.765 0.726** 0 (N) [140]MT563390 P. chienii-358 F: TAAGTGGCTGCTGCATCACA R: TACAGCAGCAGCAGAGCTTT (TCC)5 249-251 58 3.000 0.318 0.360 0.302 0.1828 (N) [141]MT563392 [142]Open in a new tab Locus, locus name; Na, number of alleles; Ho, observed heterozygosity; He, expected heterozygosity; PIC, Polymorphism information content (*P < 0.05, significantly departures from Hardy–Weinberg equilibrium; **P < 0.01, extremely significantly departures from Hardy–Weinberg equilibrium); Null, the frequency of null allele (Y represents that the SSR locus may have a null allele, N represents that the SSR locus may not have null allele). At the population level, MS population had the highest level of genetic diversity (Na = 3.450, Ne = 1.997, Ho = 0.384, and He = 0.389), whereas the lowest diversity levels were in ZJJ population (Na = 2.850, Ne = 1.759, Ho = 0.272, and He = 0.340). All of the inbreeding coefficient (F[IS]) values were positive for four populations, which indicated that there was a certain degree of inbreeding in P. chienii populations ([143]Table 6). Besides, bottleneck analysis showed that YS and BJS populations showed heterozygous excess and might experience a significant genetic bottleneck under the SMM models (P < 0.05) ([144]Supplementary Table 5). TABLE 6. Genetic diversity parameters of four populations of P. chienii. Population N Na Ne I Ho He PPB (%) F[IS] MS 31 3.450 1.997 0.702 0.384 0.389 90 0.031^ns YS 30 3.500 1.918 0.670 0.360 0.357 85 0.008^ns BJS 30 3.650 2.033 0.716 0.350 0.381 100 0.099** ZJJ 19 2.850 1.759 0.602 0.272 0.340 85 0.227** Species level 110 6.400 2.814 1.071 0.349 0.527 100 0.342 [145]Open in a new tab N, numbers of individuals; Na, number of alleles; Ne, number of effective alleles; Ho, observed heterozygosity; He, expected heterozygosity; I, Shannon’s information index; PPB, percentage of polymorphic loci; F[IS], inbreeding coefficient. ^nsP > 0.05, **P < 0.01. Transferability of Microsatellite Markers To evaluate the transferability of the microsatellite markers from the P. chienii transcriptome, we performed cross-species amplification analysis for these 20 novel microsatellite markers in A. argotaenia, 14 primers were amplified products. Therefore, the overall transferability rates were 70%. Among the 14 primer pairs, 12 showed polymorphisms in 58 A. argotaenia samples. For the 12 polymorphic loci, there were 2–7 alleles at each locus, with a total of 50 alleles. The Ho, He, I, and PIC per locus ranged from 0 to 1.000, 0.133 to 0.741, 0.257 to 1.539, and 0.155 to 0.664, respectively ([146]Supplementary Table 6). The microsatellite loci were moderately polymorphic. Discussion Characterization of Microsatellites in P. chienii Transcriptome Recently, next-generation sequencing (NGS) technologies, such as transcriptome sequencing, have made it possible to develop large-scale microsatellite markers ([147]Taheri et al., 2018). In this study, a total of 21,835 microsatellites were identified from 161,131 unigenes of the transcriptome. The frequency of microsatellites was 13.55%, and the distribution density was one SSR per 9.18 kb, which was much higher than previous reports in Pinus dabeshanensis (1/23.08 kb) ([148]Xiang et al., 2015) and Larix principis-rupprechtii Mayr (1/26.8 kb) ([149]Dong et al., 2018), but lower than Cephalotaxus hainanensis (1/1.652 kb) ([150]Qiao et al., 2014) and Glyptostrobus pensilis (1/7.59 kb) ([151]Li et al., 2019). This result showed that SSR sites were abundant in P. chienii transcriptome. The frequency and density of SSRs varied from species to species, which could be attributed to the size of unigene assembly dataset, the search criteria, mining tools for SSR locus, and genome organization ([152]Morgante et al., 2002; [153]Varshney et al., 2005; [154]Taheri et al., 2019). In our study, six different repeat motifs were identified. Excluding the mono-nucleotide repeats, tri-nucleotide (29.39%) repeats were the most abundant type, followed by di-nucleotide repeats (13.49%). The AAG/CTT (1,382, 6.38%) motif was the most abundant of the tri-nucleotide repeats, followed by AGG/CCT (1,346, 6.16%). This result might be common in conifer species, including Glyptostrobus pensilis ([155]Li et al., 2019), Larix principis-rupprechtii Mayr ([156]Dong et al., 2018), and Torreya grandis ([157]Zeng et al., 2018). Among the di-nucleotide repeats, AT/AT (1377, 6.31%) was the most dominant motif, which was consistent with studies in Cephalotaxus hainanensis ([158]Qiao et al., 2014) and Pinus koraiensis ([159]Li et al., 2020), but not in Pennisetum purpureum ([160]Zhou et al., 2018), Liquidambar formosana ([161]Chen et al., 2020), and Rhododendron arboretum ([162]Sharma et al., 2020). By contrast, the lowest frequency of CG/CG motifs (18, 0.08%) of the di-nucleotide repeats, which might be explained by cytosine methylation inhibited gene transcription ([163]Chen et al., 2015; [164]Xing et al., 2017). The length of the microsatellite was one of the main factors affecting its polymorphism. According to [165]Temnykh et al. (2001), SSR polymorphism can be considered low, medium or high if length < 12 bp, 12 ≤ length < 20 bp, and length > 20 bp. In this study, SSR length between 12 and 20 bp (10638, 48.72%) accounted for the largest proportion. Therefore, it can be speculated that the microsatellites in P. chienii transcriptome may have a moderate level of polymorphism. In addition, the mutation rate of microsatellite is positively correlated with its length ([166]Calabrese and Sainudiin, 2005). In P. chienii, six microsatellite motifs were found to have significant differences in length (Kruskal–Wallis rank sum test, P < 2.2e^–16), and decreased with increasing motif size (Nemenyi test, P < 2.2e^–16). Longer microsatellites were expected to have a high mutation rate because of more chances of replication slippage ([167]Calabrese and Sainudiin, 2005). Therefore, tetra-, penta-, and hexa-nucleotide microsatellites might have higher mutation rates than those of the mono-, di-, and tri-nucleotide microsatellites. Microsatellite Distribution in Different Genic Regions Studies have shown that microsatellites are non-randomly distributed across protein-coding regions, UTRs, and introns ([168]Li et al., 2004). In this study, the number of microsatellites in UTRs was much higher than that in CDS regions. One possible reason was that microsatellites located in UTRs were subject to less selection pressure and evolutionary constraints, thus, the microsatellites located in this region was more prone to expansion. In contrast, due to the high selection pressure of the microsatellite in coding regions, the structure and function of the gene will be seriously damaged if “indel” mutation occurs in this region ([169]Levinson and Gutman, 1987; [170]Li et al., 2004). We also found that the microsatellites located in CDS regions were dominated by tri-nucleotides, while other SSR motifs accounted for a small proportion. This was common in many plants, such as Phoenix dactylifera L. ([171]Zhao et al., 2013), Tetraena mongolica ([172]Dang et al., 2020), and Paeonia lactiflora ([173]Wan et al., 2020). Compared to other repeat types, tri-nucleotide is less likely to frame-shift mutations in coding regions ([174]Metzgar et al., 2000). Although some studies have shown that the microsatellites in coding regions were subject to higher evolutionary constraints and thus exhibited a shorter length. For example, [175]Liu et al. (2016) found in a study on Sargassum thunbergii that microsatellites in UTR regions were much longer than those in CDS regions. However, our results showed that the microsatellites located in UTR regions had relatively short average length, which could be attributed to the large number of mono- and di-nucleotide microsatellites located in these regions. Potential Function of Unigenes Containing Microsatellites Microsatellites derived from the transcribed sequence might be directly related to gene function and thus might play an adaptive and evolutionary role in affecting gene products, inducing phenotypic changes, and regulating gene expression ([176]Li et al., 2004). In order to reveal the potential functions of microsatellites-containing unigene sequences, functional annotation and classification of these unigenes were carried out. GO functional annotations showed that a large number of unigenes containing microsatellites were assigned to “cellular process,” “metabolic process,” “cell,” “cell part,” “catalytic activity,” and “binding” terms, suggesting they might be related to the basal metabolism and life activities of P. chienii. Similarly, KEGG and KOG analysis indicated that SSRs-containing unigenes had a wide range of biological functions and were involved in various aspects of P. chienii growth and development. In addition, GO enrichment analysis showed that unigenes containing microsatellites were significantly enriched in terms related to “transcription,” similar to the reports in Onobrychis viciifolia ([177]Shen et al., 2019) and Siberian wildrye ([178]Zhou et al., 2016). Additionally, we found that some unigenes containing microsatellites were associated with transcription factors. TFs (Transcription Factors) regulate a lot of biological processes, including plant defense, response to a range of abiotic and biotic stresses, and tolerance to disease and environmental stress ([179]Singh et al., 2002). As an ancient relic plant, P. chienii might be disturbed by various adverse factors during the long evolutionary process. However, it still survived to the present day and has strong adaptability to the environment. From the point of view of molecular biology, the microsatellites identified in this study that are related to transcription factors might serve as important regulatory buttons for gene expression in the adaptive evolution of P. chienii. Likewise, KEGG enrichment analysis revealed that four pathways that were associated with gene expression regulation (ko03022, ko04140, ko04070, and ko04712) showed significant enrichment, indicating that they played an important role in the response to environmental stress. Additionally, there may be some differences in the functions of microsatellites in different tissues. However, we only analyzed the distribution and function of microsatellites in P. chienii transcriptome as a whole. [180]Ince et al. (2010b) studied microsatellite-containing gene expression profiles in 16 different tissues and at different developmental stages in pepper (Capsicum annuum L.), and found that tissue-specific genes contained more dinucleotide microsatellites, and housekeeping genes contained more trinucleotide microsatellites. Therefore, we can further compare the expression patterns of microsatellite-containing genes in four tissues of P. chienii in the future, so as to deepen the understanding of tissue-specific expression of microsatellites. Validation of Microsatellite Markers The 375 pairs of primers were randomly selected for validating the amplification quality and polymorphisms of the microsatellites in the present study. Among of them, only 77 (20.53%) produced clear bands with expected size. This low amplification success rate (20.53%) was comparable to that reported in some conifer species, such as Abies alba (24%) ([181]Postolache et al., 2014), Torreya grandis (29.25%) ([182]Zeng et al., 2018), and Pinus koraiensis (32%) ([183]Li et al., 2020). There were two main reasons for the low success rate of amplification in P. chienii. On the one hand, there were a large number of pseudogenes in conifer genomes, which may hinder the amplification of microsatellite markers. On the other hand, the highly repetitive sequences in conifer genomes, such as transposons, might affect the amplification efficiency of microsatellites ([184]Kovach et al., 2010). Besides, the presence of large introns or indels, assembly errors in the unigenes, and low specificity might cause the primer amplification failure ([185]Bazzo et al., 2018; [186]Niu et al., 2019). Genetic Diversity of P. chienii Populations Genetic diversity is the product of long-term evolution of a species or population and is a necessary condition for the survival and sustainable development of populations ([187]Zhang and Zhou, 2013). We validated the polymorphism of the 20 polymorphic microsatellite markers using four populations of P. chienii. A total of 128 alleles were detected at the 20 loci. Among of them, loci P. chienii-152 (with 16 alleles) and P. chienii-75 (with 14 alleles) showed a high level of polymorphism. Loci P. chienii-152 was highly homologous to an unknown protein of Picea sitchensis. Loci P. chienii-75 was associated with a gene encoding R2R3-MYB transcription factor, therefore, played an important regulatory role in P. chienii growth and development, as well as in response to abiotic and biotic stress. At the species level, there was a moderate level of genetic diversity (Na = 6.4, Ho = 0.349, He = 0.527, I = 1.071) for P. chienii, which was similar to the previous reports of [188]Kou et al. (2020) using nuclear loci (H[d] = 0.5400, π = 0.00265), but higher than that of [189]Su et al. (2009) by ISSR markers (h = 0.2118, I = 0.2390). In addition, compared with other conifer species such as Larix principis-rupprechtii Mayr (Na = 3.850, Ho = 0.487, He = 0.490) ([190]Dong et al., 2018) and Pinus koraiensis (Na = 6.45, Ho = 0.299, He = 0.311) ([191]Li et al., 2020), P. chienii has a slightly higher level of genetic diversity (Na = 6.4, Ho = 0.349, He = 0.527). The abundance of genetic variation at the species level can be attributed to internal and historical factors. As an ancient tertiary relic plant, P. chienii has accumulated a large number of genetic variations during the long-term evolution, so it is expected to maintain a high level of genetic diversity. Besides, the Central and Nanling regions of China were less affected by the Quaternary glacial and became the possible refugia of many relic plants ([192]Chen et al., 2011). We speculated that these refugia might be beneficial to the preservation of genetic variation in P. chienii. At the population level, all the four populations of P. chienii showed a low level of genetic diversity, which could be explained by its population size and distribution characteristics. The wild populations of P. chienii were few and scattered distributions, which limited the gene flow among populations and reduced genetic diversity. In addition, the small population size might result in the bias of genetic diversity in this study. We will sample more individuals of P. chienii to further explore its genetic variation in the future. Transferability of Microsatellite Markers In general, microsatellite markers developed from transcriptome data/ESTs showed a higher level of transferability in related species ([193]Ince et al., 2010a, [194]2014; [195]Yan et al., 2017). In the present study, 14 microsatellite markers could successfully amplify in A. argotaenia, and 12 of them were polymorphic in all A. argotaenia individuals. The transferability ratio was 70%, which was higher than the reported in Capsicum (32.11%) ([196]Ince et al., 2010c), and Elymus (49.11%) ([197]Zhou et al., 2016), indicating that the markers developed in the present study had good potential for cross-species amplification and could be used in future genetic studies of other Taxaceae species. Conclusion In this study, we identified and characterized the frequency, distribution, and function of microsatellites based on transcriptome data of P. chienii. The unigenes containing microsatellites had a wide range of biological functions, most of which were associated with basic metabolism, and a few might be involved in expression regulation of gene and response to environmental stress. These results provided new insights into the role of microsatellites in the transcriptome. In addition, twenty microsatellite markers were developed for P. chienii with high polymorphism and good transferability, which will facilitate the genetic study in P. chienii and other related species. Data Availability Statement The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: [198]https://www. ncbi.nlm.nih.gov/genbank/, SRR11715801; [199]https://www.ncbi. nlm.nih.gov/genbank/, SRR11715800; [200]https://www.ncbi.nlm.nih.gov/genbank/, SRR11715799; [201]https://www.ncbi.nlm.nih.gov/genbank/, [202]SRR11715798. The EST sequences in this study have been submitted to GenBank with accession numbers [203]MT563329–[204]MT563392. Author Contributions YS and TW conceived and designed the study and revised the manuscript. RX and ZW performed the experiments, analyzed the data, and wrote the manuscript. All the authors contributed to the article and approved the submitted version. Conflict of Interest The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Acknowledgments