Abstract

   Background: Short tandem repeats (STRs) are highly variable elements
   that play a pivotal role in multiple genetic diseases and the
   regulation of gene expression. Long-read sequencing (LRS) offers a
   potential solution to genome-wide STR analysis. However, characterizing
   STRs in human genomes using LRS on a large population scale has not
   been reported.

   Methods: We conducted the large LRS-based STR analysis in 193 unrelated
   samples of the Chinese population and performed genome-wide profiling
   of STR variation in the human genome. The repeat dynamic index (RDI)
   was introduced to evaluate the variability of STR. We sourced the
   expression data from the Genotype-Tissue Expression to explore the
   tissue specificity of highly variable STRs related genes across
   tissues. Enrichment analyses were also conducted to identify potential
   functional roles of the high variable STRs.

   Results: This study reports the large-scale analysis of human STR
   variation by LRS and offers a reference STR database based on the LRS
   dataset. We found that the disease-associated STRs (dSTRs) and STRs
   associated with the expression of nearby genes (eSTRs) were highly
   variable in the general population. Moreover, tissue-specific
   expression analysis showed that those highly variable STRs related
   genes presented the highest expression level in brain tissues, and
   enrichment pathways analysis found those STRs are involved in synaptic
   function-related pathways.

   Conclusion: Our study profiled the genome-wide landscape of STR using
   LRS and highlighted the highly variable STRs in the human genome, which
   provide a valuable resource for studying the role of STRs in human
   disease and complex traits.

   Keywords: short tandem repeats, long-read sequencing, highly variable
   STRs, TRcards, database, brain tissue, synaptic function

Introduction

   Short tandem repeats (STRs) are abundant repetitive elements comprised
   of recurring DNA motifs of two–six bases. Due to their repetitive
   nature, STRs have the highest mutational rate in the genome and are
   typically polymorphic. They are often used in forensics and population
   genetics and are also the underlying cause of many genetic diseases
   ([78]Gymrek 2017; [79]Hannan 2018).

   STR expansions in the coding or non-coding regions are linked to more
   than 50 known disorders ([80]Depienne and Mandel, 2021). Many of these
   conditions affect the nervous system. Well-known examples of STR
   expansion diseases in protein-coding regions are the “polyglutamine”
   (PolyQ) diseases (e.g., Huntington disease and Spinocerebellar ataxia),
   caused by variable stretches of the repeated trinucleotide CAG.
   Non-coding repeat expansions are even more diverse and can occur in
   either the 5′ UTRs, introns, or 3′ UTRs of genes. Their impact strongly
   depends on the type, length, and location of the repeat motif within
   genes. Examples of these repeat disorders include Fragile X syndrome
   (FXS) caused by CGG repeats and Myotonic dystrophy (DM1) caused by CTG
   repeats ([81]Tang et al., 2017; [82]Trost et al., 2020; [83]Depienne
   and Mandel, 2021).

   Recently, more than 28,000 eSTRs in 17 tissues were identified to play
   a role in gene regulation by leveraging deep whole-genome sequencing
   (WGS) and gene expression data collected by the Genotype-Tissue
   Expression Project (GTEx), STRs for which the number of repeats was
   associated with the expression of nearby genes, termed expression STRs
   (eSTRs). Then, eSTRs were ranked with a statistical fine-mapping
   framework to prioritize potentially causal eSTRs and 5% of which were
   referred to as fine-mapped eSTRs (FM-eSTRs) ([84]Fotsing et al., 2019).
   It is becoming increasingly clear that STRs across the genome are
   likely to have widespread contributions to complex polygenic traits. In
   these cases, smaller expansions or contractions may subtly increase or
   decrease the risk for a trait and work together to modulate an
   individual’s disease risk ([85]Gymrek et al., 2016; [86]Fotsing et al.,
   2019; [87]Jakubosky et al., 2020).

   Genome-wide surveys of STRs in individual genomes have become feasible
   due to the development of high-throughput sequencing technologies. Most
   studies used whole-genome sequence data based on short-read sequencing
   (SRS) to genotype STRs ([88]Willems et al., 2014; [89]Tang et al.,
   2017; [90]Mousavi et al., 2019; [91]Trost et al., 2020; [92]Mitra et
   al., 2021). However, the intrinsic limitations of SRS prevent the
   comprehensive characterization of all STRs or the discovery of novel
   disease-relevant repeat expansions, which are longer than read length
   ([93]Gymrek, 2017; [94]Liu et al., 2020).

   Long-read sequencing (LRS) technologies offer a good solution to
   genome-wide STR analysis. Current LRS technologies, such as Pacific
   Biosciences sequencing and Oxford Nanopore Technologies (ONT)
   sequencing, have achieved reads longer than 10 kb on average, which
   have a high chance to cover whole tandem repeats, including flanking
   unique sequences ([95]Pollard et al., 2018; [96]Midha et al., 2019;
   [97]Amarasinghe et al., 2020; [98]Logsdon et al., 2020). LRS has
   recently been applied to genotype long and complex repeats, such as the
   C9orf72 GGGGCC expansion implicated in frontotemporal lobar
   degeneration and a complex pentamer repeat in SAMD12 implicated in
   myoclonus epilepsy ([99]Zeng et al., 2019; [100]Mitsuhashi and
   Matsumoto, 2020; [101]DeJesus-Hernandez et al., 2021). More human
   diseases caused by STR expansions have also been reported in recently
   published studies with the utilization of LRS ([102]Sone et al., 2019;
   [103]Tian et al., 2019; [104]Zeng et al., 2019; [105]Deng et al.,
   2020).

   The normal ranges of different STRs may vary significantly in the
   general population. Thus, the knowledge of the normal repeat ranges of
   STRs is critically important to determine that the pathogenicity of
   observed repeats in known STRs or to discover novel disease-relevant
   repeat expansions ([106]Liu et al., 2020). To the best of our
   knowledge, although there exist studies on detecting and characterizing
   STRs in human genomes using LRS on select small datasets, analysis at
   scale has not been reported ([107]Liu et al., 2020).

   Herein, we conducted a large-scale analysis of human STR variation by
   LRS in the Chinese population and developed a reference STR database,
   named TRcards, with 193 of the LRS dataset. Besides, we performed
   genome-wide profiling of STR variation in the human genome with LRS
   data, evaluated the variability of STR and characterized the highly
   variable STRs.

Materials and Methods

Participants

   A set of 193 unrelated Chinese was included in our study for ONT
   sequencing. Among all the individuals, 102 (52.85%) were males and 91
   (47.15%) were females. The ages ranged from 26 to 85 years, with a
   median age of 50 years. This study was approved by the Ethics Committee
   of Xiangya Hospital, Central South University. All participants gave
   informed consent.

Long-Read Whole-Genome Sequencing

   DNA samples sequenced in this study were isolated from whole blood. DNA
   samples of individuals were sequenced using a PromethION sequencer
   (Oxford Nanopore Technologies). Library preparation was carried out
   using a 1D Genomic DNA ligation kit (SQKLSK109) according to the
   manufacturer’s protocol. For each individual, one PRO-002 (R9.4.1) flow
   cell was used. PromethION data base-calling was performed using guppy
   v.3.3.0 (Oxford Nanopore Technologies), and only pass reads (Qscore ≥7)
   were used for subsequent analysis ([108]Sun et al., 2020).

   Sample LNT00178 was also sequenced with the Pacibio Sequel II platform.
   High molecular weight (HMW) DNA was extracted, and HiFi libraries were
   constructed using the SMRTbell Express Template Prep Kit v2 and
   SMRTbell Enzyme Clean Up Kit (PacBio) ([109]Du et al., 2021). Size
   selection was performed with SageELF and 15 kb fragments were chosen
   for sequencing with the Sequel II platform using 30 h movies. Then, the
   resulting raw subreads were converted to circular consensus sequencing
   (CCS) reads using the CCS v4.2 algorithm with–minPasses 3
   –minPredictedAccuracy 0.99. Furthermore, HG002 with ONT and the
   corresponding PacBio CCS data were downloaded from
   [110]https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/Ashk