Abstract

   RNA secondary structures play essential roles in the formation of the
   tertiary structure and function of a transcript. Recent genome-wide
   studies highlight significant potential for RNA structures in the
   mammalian genome. However, a major challenge is assigning functional
   roles to these structured RNAs. In this study, we conduct a
   guilt-by-association analysis of clusters of computationally predicted
   conserved RNA structure (CRSs) in human untranslated regions (UTRs) to
   associate them with gene functions. We filtered a broad pool of ∼500
   000 human CRSs for UTR overlap, resulting in 4734 and 24 754 CRSs from
   the 5′ and 3′ UTR of protein-coding genes, respectively. We separately
   clustered these CRSs for both sets using RNAscClust, obtaining 793 and
   2403 clusters, each containing an average of five CRSs per cluster. We
   identified overrepresented binding sites for 60 and 43 RNA-binding
   proteins co-localizing with the clustered CRSs. Furthermore, 104 and
   441 clusters from the 5′ and 3′ UTRs, respectively, showed enrichment
   for various Gene Ontologies, including biological processes such as
   ‘signal transduction’, ‘nervous system development’, molecular
   functions like ‘transferase activity’ and the cellular components such
   as ‘synapse’ among others. Our study shows that significant functional
   insights can be gained by clustering RNA structures based on their
   structural characteristics.

Introduction

   RNA secondary structures are integral to the maturation, regulation and
   function of all transcripts including mRNA. Numerous regulatory
   structure elements are located in the 5′ and 3′ UTRs of mRNAs ([37]1).
   Some of the well known examples include: the iron response elements
   (IRE) involved in maintaining the cellular iron content ([38]2), the
   gamma interferon inhibitor of translation (GAIT) element that is
   involved in limiting the cellular immune response ([39]3), Histone 3′
   UTR stem-loop required for cell cycle regulation of histone gene
   expression ([40]4), and the internal ribosome entry sites (IRES) for
   the translation initiation in cap-independent manner ([41]5). A typical
   feature of many such functionally important structural elements is that
   they are evolutionarily conserved, wherein the course of evolution
   brings the compensatory mutations in the primary sequences that still
   support the base-pairs preserving the underlying functional RNA
   structure.

   To date, different computational methods have been applied for
   genome-wide prediction of evolutionary conserved secondary structure
   (CRS) on sequence-based multiple alignments, for examples RNAz ([42]6)
   and EvoFold ([43]7). Although these approaches are efficient for the
   genome-wide screens, their design on using pre-aligned sequences as
   input with predefined fixed window sizes makes them sensitive to both
   misalignments and structure predicted on incomplete sequence depending
   on the chosen window size. Given that, an RNA structure can be more
   conserved than its primary sequence it becomes desirable to
   simultaneously predict an alignment and a CRS of the RNA sequences. The
   first such approach to simultaneous sequence and secondary structure
   alignment was proposed by David Sankoff ([44]8).

   The Sankoff algorithm has been implemented in programs like FOLDALIGN
   ([45]9) and LocARNA ([46]10). A more optimal strategy, but also more
   computationally expensive, for the genome-wide prediction of CRSs is to
   use the multiple genome-wide alignments as an indication of similarity.
   Subsequently, re-align in both sequence and structure simultaneously
   ([47]11,[48]12). One way to do this, is to build a structural alignment
   iteratively by building a revised model every time a sequence was
   matched to the previous model. This is realized in the expectation
   maximization (EM)-based approach in CMfinder ([49]13). CMfinder is not
   constrained by the initial multiple sequence or genome alignment or by
   predefined window sizes, instead it uses the initial alignment to infer
   the putatively orthologous sequences and perform the local structural
   alignment while discarding apparently irrelevant ones. However,
   CMfinder has its own caveats as well by providing mainly shorter motifs
   ([50]14).

   Here, we focus on the previous CMfinder screen in multiple genomes of
   the 17 vertebrates that was subsequently extended to the 100-species
   tree by aligning to the original models. This resulted in a vast
   compendium in ∼500 000 CRS regions (genomic regions of overlapping
   CRSs) collectively corresponding to ∼2% of the human input sequence
   taken from the MULTIZ alignments ([51]14). That study revealed
   enrichment location of CRSs and range annotations, including UTRs,
   transcriptional regulatory regions, intron and RNA-binding protein
   (RBP) binding sites. However, it has not been studied if some of the
   CRS located in genes with same biotypes also make up the same
   structural motif. Here, we address precisely this challenge.

   To our knowledge, only a few studies have made attempts to report the
   clusters of evolutionary conserved RNA structure families using the
   genome-wide structure prediction sets. Most of them implemented EvoFold
   or RNAz based prediction of conserved RNA structures, followed by the
   classication of the structures considering their genomic location or
   distribution across different gene biotypes ([52]7,[53]12), or by
   performing hierarchical clustering based on the all-vs-all pairwise
   distances using the pairwise alignment tool LocaRNA ([54]10). The
   limiting factors for these studies are that they are either dependent
   upon the initial sequence alignments, prefixed window sizes, or the
   hierarchical clustering does not have efficient run-time performance
   for clustering the genome-wide scale of predicted structures. To
   overcome the limitations of the run-time performance, a heuristic
   pairwise sequence alignment algorithm called DotAligner ([55]15) was
   introduced that requires the pre-computed RNA dot plots to perform the
   alignments. Using this tool one could perform the clustering of large
   set of single sequences, but not the evolutionary conserved structures,
   as it leverages the diversity of suboptimal structures from a partition
   function of RNA alignments to identify an optimal sequence-structure
   alignment of two RNAs.

   The alternative approaches to these methods include the use of CMFinder
   ([56]13) for motif discovery of RNA structures among the set of
   unaligned sequences, it has been previously applied across the
   orthologous sequences of bacterial homologous genes ([57]16,[58]17).
   Another method called NoFold ([59]18) clusters single query sequences
   based on constructing a distance function to compare against the
   empirical models to map RNA sequences to a structural feature space.
   GraphClust ([60]19,[61]20), is another alignment-free approach that
   decomposes RNA structures into graph-encoded features to identify the
   common structure motifs, however the pipeline works on single sequences
   and clusters paralogs. Extending over single sequence clustering, the
   EvoFam comparative method introduced the clustering of the EvoFold
   ([62]7) predictions, however the EvoFold predictions were itself based
   on sequence alignments with limited degree of sequence variation.

   In this contribution, we use our CMFinder screen and focus specifically
   on the CRSs in the UTRs of the protein-coding genes and perform their
   clustering using a RNAscClust ([63]21), an extension of GraphClust that
   utilizes the evolutionary signatures of RNA structures as an additional
   classification feature for clustering. The advantage of using
   RNAscClust over the other clustering methods is that it makes it
   possible to also search for the paralog CRSs including those that have
   less sequentially conserved structured RNAs and may not be captured in
   the initial structural alignments. Here, we present the clusters of
   CRSs obtained from RNAscClust, and associate the clustered CRSs with
   common structural features to their potential functional roles based on
   the functional enrichment analysis by taking advantage of the host gene
   associated Gene Ontology (GO) terms into account. Additionally, we
   present the results from the cluster specific enrichment for the
   binding sites for RNA-binding proteins (RBPs). We also propose a
   prefiltering step for the input set of CRSs from a genome-wide screen,
   as well as an iterative post-cluster processing step (Figure [64]1) to
   obtain better clusters from the RNAscClust.

Figure 1.

   [65]Figure 1.
   [66]Open in a new tab

   Overview of the CRS clustering workow. We implemented the following
   main tools: PETfold (3,6); RNAlib - ViennaRNA package (4,6); RNAscClust
   (5); LocaRNA (6).

Materials and methods

Workflow

   To identify recurring and functionally significant structural elements
   in the UTRs of protein-coding genes, we employed our CMfinder based
   predicted catalog of 773 850 CRS alignments. These alignments cover 515
   506 CRS regions in the human MULTIZ alignments of 16 vertebrates
   relative to the human genome (accessible at
   [67]http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz17way/). To
   generate this extensive catalog of CRS, Seemann et al. ([68]14)
   extended the predictions to the 100 species from the multiple
   alignments of 99 vertebrate genomes with human
   ([69]http://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz100way/) by
   mapping human hg38 coordinates to orthologous regions in each of the
   other 99 vertebrate genomes first by using liftOver ([70]22) and
   subsequently searching these sequences (including the original 17
   species) for hits using CRS covariance models with CMsearch ([71]23).
   For the purpose of clustering, the 100-species CRS alignments were
   again reduced to those sequences derived from the species present in
   the hg18 17-species alignment. This reduction was implemented to
   include a diverse set of genomes with phylogenetic variation in the
   clustering process, while still utilizing the most up-to-date genome
   assemblies. To illustrate, within the 100-species alignment, there are
   12 primate sequences that are expected to be highly similar to each
   other. As a result, these sequences contribute minimal co-variation
   information to the RNAscClust based clustering process. The resultant
   alignment encompassed various species, including human, mouse, and
   zebrafish. The workflow (Figure [72]1) utilized for the clustering of
   CRS is outlined below.

Selection of CRSs overlapping 5′ and 3′ UTRs

   We specifically chose CRSs that exhibited a minimum overlap of at least
   50% of their length with the annotated UTRs obtained from GENCODE v33
   ([73]24) in the human genome. In most cases, the CRSs were considerably
   shorter than the corresponding UTRs and were completely contained
   within the UTR regions. The shortest CRSs we selected had to encompass
   at least 40 nucleotides (nt) of human sequence in the alignment,
   ensuring they were long enough to form stable hairpin structures. To
   extract UTR annotations from the GTF format annotation file and
   calculate the overlap between the CRSs and UTRs, we utilized in-house
   scripts that implemented functions from the R packages GenomicFeatures
   and data.table

CRSs filter based on FDR and G+C content

   We further applied a filter to the CRSs, requiring them to have a G+C
   content ranging from 25% to 65%. This range has been identified as
   optimal in our original screen for achieving a mean false discovery
   rate (FDR) of < 20% ([74]14). The resulting set of CRSs, referred to as
   Input-set1, was used as input for clustering using the RNAscClust
   pipeline. As a first-hand approach, the CRSs in Input-set1 were used as
   is in their current orientation, and no reverse complement of CRSs was
   applied. This pipeline internally utilizes PETfold ([75]25), which
   considers evolutionary and thermodynamic information to predict
   consensus secondary structures. The identified conserved base pairs
   from the alignments are then used as constraints for predicting the
   secondary structure of the human sequence using RNAfold ([76]26),
   enabling the projection of conserved base pairs onto the sequence.

PETfold structure prediction (outside RNAscClust pipeline)

   We used the CRS alignments filtered based on the FDR and G+C content
   directly as the input to the PETfold based consensus secondary
   structure prediction. For this, we employed the default value of -p
   0.90 for the PETfold reliability cutoff, but allowed relaxed -g 0.50
   cutoff for the maximum percent of gaps in alignment column. The
   predicted consensus structure is next adjusted according to the human
   sequences by removing the columns with all gaps using an in-house R
   script, and stockholm format alignment files containing the adjusted
   consensus structures were generated to be used as input for RNAscClust.
   Doing this outside the RNAscClust pipeline gives us an opportunity to
   compare the overlapping consensus structures in the CRS region, and
   select the non-redundant representative CRSs from each CRS region.

Selection of representative CRSs

   For the selection of representative CRSs with sufficiently dissimilar
   structure from each CRS region, we first sorted the CRSs in descending
   order of total base-pairs (bp) and then by ascending order of their
   lengths. Next, we selected the CRS top in the order and compared its
   structural similarity with each of the CRS down in the order
   iteratively. To determine if the two CRSs in comparison are
   structurally similar to each other, we calculated a similarity score
   (S[A] and S[B] as shown in the below expression). For example, given
   CRSs A (top in the order) and B (next in order) from the same CRS
   region, their similarity scores are calculated as
   graphic file with name M0001.gif

   where, ‘# Identical bp’ refers to the count of identical base pairs
   present in the CRSs, while ‘# bp’ represent the total base pairs in A
   and B. If both S[A] and S[B] are sufficiently low it indicates that the
   CRSs are sufficiently dissimilar. Here we employ a cut-off of 40% for
   adding CRS B to the selection list. In the subsequent iterations, CRS A
   and B are compared to CRS C and so on, continuing until the last CRS in
   the CRS region is reached. If any CRS has a higher similarity score, it
   is discarded, ensuring that the selected CRSs accurately represent the
   CRS region without redundancy. This method is implemented using a
   Python script that utilizes functions from RNAlib-2.4.18 ([77]26) to
   determine identical base pairs. The chosen CRSs serve as input
   (Input-set2) for RNAscClust, utilizing the -structure-is-given option
   to prevent recomputation of the consensus structure and constraint
   folding of the sequences (as discussed in section, ‘CRSs filter based
   on FDR and G+C content’). Again, we did not consider the reverse
   complementary version of the CRSs. Note that a CRS region often consist
   of multiple individual CRSs on both strands. Obtaining the reverse
   complements require careful consideration of e.g. wobble pairs and full
   scale statistical analysis, which is beyond the scope of this work. The
   percentage of sequence identity, GC content fraction and the number of
   sequences in the Input-set1 and Input-set2 CRSs can be seen in
   [78]Supplementary Figure S1.

CRS clustering based on RNAscClust

   We installed and configured RNAscClust 1.1.1, including all the
   necessary software dependencies (please refer to [79]Supplementary
   Table S3), on a local system. In order to run RNAscClust on a computing
   cluster, we made specific modifications to certain pipeline scripts,
   adapting them to work with the Slurm Workload Manager, which was
   originally designed for the Sun Grid Engine queuing system. The default
   parameter settings were used for all the required tools. For the
   GraphClust module as well, we maintained the default parameter values,
   but we increased the number of iterative clustering rounds to 45. This
   adjustment aimed to assign as many input CRSs as possible to clusters.
   The minimum cluster size parameter was set to 3 (configuration file
   containing these settings is available in [80]Supplementary File
   10,[81]11). Clustering was carried out separately after both step 2 and
   step 4 of the workflow described in the study (Figure [82]1). The
   parameter settings remained consistent in both cases, with the only
   difference being that in the latter, the input alignments were limited
   to the selected representative CRSs. These alignments were in Stockholm
   file format that included pre-computed PETfold predicted consensus
   structures, which were used for constraint folding within the
   RNAscClust pipeline. The commands used for executing the RNAscClust
   pipeline is shown in [83]Supplementary Note.

Post-cluster processing

   After running RNAscClust, we obtained clusters of paralog sequences
   that had been globally aligned using LocARNA ([84]10). To analyze these
   clusters further, we utilized PETfold to predict secondary structures
   based on the LocARNA alignments. We collected cluster-specific
   statistics, such as the number of sequences in the alignment, the
   fraction of each sequence that was paired or contained gaps, and the
   median sequence length per cluster. This information was gathered using
   a custom Python script we developed in-house. To measure the spread of
   sequences within each cluster, we plotted the standard deviation of the
   sequence length. In order to eliminate any outlier sequences that were
   either too long or too short compared to the median sequence length, we
   referred to the distribution of standard deviation of the length of CRS
   sequences per cluster (Figure [85]3 C, D), followed by manual
   inspection of the outlier clusters. Based on this observation we set a
   threshold for the range of sequence length (median ± 10 nt). Sequences
   with lengths outside of this range were removed from the cluster. The
   remaining CRSs (Clustered RNA sequences) were then subjected to another
   round of alignment using LocARNA, followed by secondary structure
   prediction using PETfold. We once again collected statistics on the
   aligned sequences, focusing on the fraction of the sequence that was
   paired or contained gaps. For this, we again referred to the
   distribution of fraction of paired nucleotides and the fraction of
   ungapped sequence per cluster (Figure [86]3A, [87]B), followed by the
   manual inspection of the outlier clusters. Based on the observation
   made we set a threshold on the fraction of unpaired nucleotides to at
   least 50%. We selected these thresholds as it will exclude only the
   outlier CRS in the cluster while retaining the large fraction of the
   data. We repeated the process of realigning the CRSs and predicting
   consensus structures after excluding sequences with more than 50%
   unpaired nucleotides in the alignment. Finally, considering the
   distribution of the fraction of paired nucleotide in the consensus
   structure ([88]Supplementary Figure S4), we selected only those
   clusters that contained at least 20% paired nucleotides and a minimum
   of three CRSs, ensuring the presence of RNA structure for further
   functional enrichment analysis.

Figure 3.

   [89]Figure 3.
   [90]Open in a new tab

   Quality of clustered CRSs from (A) 5′ UTRs and (B) 3′ UTRs. The x-axis
   represents the fraction of the ungapped clustered CRS sequence that is
   assigned to at least one cluster. The y-axis represent the fraction of
   paired nucleotide in the clustered CRS sequence. Each dot in the
   scatter plot represent a CRS. The color of the dots represent the
   length of the clustered CRS aignment block. The pie chart shows the
   fraction of CRSs that belong to an alignment of certain length range.
   The table shows the number of clusters and the total CRSs that belong
   to these clusters. In panel (C) and (D), we see the distribution of the
   standard deviation of the length of the CRS sequence per cluster from
   the 5′ and 3′ UTR respectively.

Recovery of known structure families from Rfam and EvoFam

   In order to determine whether we could identify the known Rfam
   structure families within the clusters of CRSs derived from UTRs, we
   obtained the Rfam 14.4 data, which provides genomic coordinates and
   family annotations for the structure elements. This data was downloaded
   from the Rfam ftp site:
   [91]https://ftp.ebi.ac.uk/pub/databases/Rfam/14.4/genome_browser_hub/ho
   mo_sapiens/ ([92]Supplementary Note). To establish a connection between
   the human genomic coordinates of the structure elements and our input
   set of CRSs for clustering, we made an intersection, requiring that at
   least 50% of the Rfam structures overlap the CRSs. Furthermore, we
   filtered the families based on their evidence record in the Rfam
   database, with a focus on those that were recognized as specific 5′ and
   3′ UTR structure elements ([93]Supplementary Tables S1 and [94]S2). If
   multiple structure elements from the same Rfam family overlapped with
   multiple CRSs, we anticipated that these Rfam families would be
   detected within our clustered set of CRSs.

   For EvoFam annotations, we sourced them from the following location:
   [95]http://moma.ki.au.dk/prj/mammals/all_UTR_with_paralog_30012010_thre
   sh1.0/. Initially, the genomic coordinates for EvoFam families were
   based on the hg19 build. To align them with our CRSs, we converted the
   coordinates to hg38 using liftOver ([96]27). Subsequently, we
   intersected these updated coordinates with our CRSs, considering them
   overlapping if at least 50% of the EvoFam structures coincided with the
   CRS. Similar to Rfam, if multiple structure elements from the same
   EvoFam family overlapped with multiple CRSs, we expected them to be
   detected in our clustered set of CRSs.

Gene Ontology (GO) and Pathway overrepresentation analysis

   In order to classify the clusters based on their functional
   characteristics, we conducted a functional enrichment analysis
   utilizing the GO and Pathway annotations of protein-coding genes that
   host the CRSs in their UTR region. We obtained the GO annotations from
   the GO Consortium ([97]25) through ensembl (Ensembl v106) biomaRt web
   services ([98]28). To evaluate the cluster-specific overrepresentation
   of GO terms, we compared the proportion of genes in a CRS cluster of
   interest that were annotated with a specific GO term (foreground), to
   the proportion of all other protein-coding genes that contained CRSs
   assigned to at least one cluster and were annotated with the
   corresponding GO term (background). Explicitly, we constructed a 2 × 2
   contingency table for this analysis. The table included the number of
   genes in the CRS cluster of interest that were annotated with a
   specific GO term of interest in row 1 and column 1 (foreground). Row 2,
   column 1 represented the total count of genes annotated with the same
   GO term and were assigned to at least one cluster (background). Column
   2, row 1 represented the total count of genes within the foreground
   cluster of interest that had at least one GO annotation, excluding the
   GO term of interest. Finally, Column 2, row 2 represented the total
   count of all genes that belonged to at least one CRS cluster and had a
   GO annotation, excluding the GO term of interest. Only clusters with
   three or more distinct genes were included in this analysis. We applied
   one-sided Fisher’s exact test to compute the P-values. To account for
   multiple comparisons within each cluster, the P-values were adjusted
   (P[adj]) using the Benjamini and Hochberg (BH) method, implemented in
   an in-house R script. Additionally, we computed the Fold Change (FC)
   difference by dividing the foreground proportion by the background
   proportion. A GO term was considered overrepresented in a cluster if
   there were at least 3 genes annotated with that term, and if the P[adj]
   was < 0.05 and the FC was > 1.5.

   For pathway enrichment analysis, we downloaded pathway annotation data
   from the Reactome (v80) ([99]29) and KEGG (v102.0) ([100]30,[101]31)
   pathway databases using biomaRt ([102]32) and their respective data
   download links ([103]https://reactome.org/download-data;
   [104]https://rest.kegg.jp/link/hsa/pathway). We assessed the
   overrepresentation of pathways in each cluster using the one-sided
   Fisher’s exact test once again. Explicitly, we again constructed a
   2 × 2 contingency table for this analysis. Where the table included the
   number of genes in the CRS cluster of interest that were annotated with
   a specific pathway of interest in row 1 and column 1 (foreground). Row
   2, column 1 represented the total count of genes annotated with the
   same pathway and were assigned to at least one cluster (background).
   Column 2, row 1 represented the total count of genes within the
   foreground cluster of interest that had at least one pathway
   annotation, excluding the pathway of interest. Finally, Column 2, row 2
   represented the total count of all genes that belonged to at least one
   CRS cluster and had a pathway annotation, excluding the pathway of
   interest. A pathway was considered overrepresented in a cluster if
   there were at least 3 genes annotated with that pathway, and if the
   P[adj] was <0.05 and the FC was > 1.5.

RBP binding site coverage enrichment and overrepresentation analysis

   RNA structures are well known to interact with proteins ([105]33), to
   investigate if our set of clustered CRSs are the binding sites for
   specific RBPs, we acquired the data on RBP binding sites from the
   ENCODE project phase III, which offers a comprehensive map of human
   RBPs’ binding and functional characteristics ([106]34). This dataset
   incorporates information from various assays, including the enhanced
   CLIP (eCLIP) assay, which examines the in vivo binding activity of 150
   RBPs. We focused on the peaks that were consistently identified in both
   biological replicates for our analysis. To retrieve the eCLIP RBP
   binding site data from ENCODE, we utilized a custom R script that
   leverages functions from the ENCODExplorer package and performed a
   batch download using the eCLIP experiment accession IDs. The downloaded
   data were in BED6+4 format, also known as the ENCODE narrowPeak bed
   format, which includes the identified peaks of signal enrichment based
   on pooled and normalized data (link for more information:
   [107]http://genome.ucsc.edu/FAQ/FAQformat.html#format12).

   To determine the overlap between the RBP binding sites and our set of
   CRSs, we employed the bedtools software ([108]35). We required that the
   entire binding site to overlap with the CRSs. Next, we conducted a
   nucleotide-level coverage enrichment analysis in three different
   sets: (i) all CRSs, (ii) CRSs overlapping UTRs and (iii) the clustered
   CRSs from UTRs. For each set, we compared the proportion of nucleotides
   covered by RBP binding sites with the proportion of nucleotides covered
   by RBP binding sites in the UTRs. To make these comparisons, we
   employed the two-proportion one-sided Z-Test that returns the value of
   Pearson’s chi-squared test statistic and a P-value which is useful to
   infer whether the coverage of RBP binding sites were significantly
   higher in these three sets compared to entire UTRs.

   Furthermore, we examined the cluster-specific overrepresentation of RBP
   binding sites by assessing comparison between the proportions of CRSs
   within a cluster that co-localize with the binding site of a given RBP
   (foreground), and all CRSs in our input set used for clustering that
   co-localize with the binding site of that RBP (background). Explicitly,
   we constructed a 2 × 2 contingency table for this analysis. The table
   included the number of CRSs in the CRS cluster of interest that
   co-localize with the binding site of a RBP of interest in row 1 and
   column 1 (foreground). Row 2, column 1 represented the total count of
   CRSs in the input set that co-localize with the binding site of that
   RBP. Column 2, row 1 represented the total count of CRSs in the
   foreground cluster of interest that co-localize with the binding site
   of at least one RBP, excluding the binding site of the RBP of interest.
   Finally, Column 2, row 2 represented the total count of all CRSs in the
   input set that co-localize with the binding site of at least one RBP,
   excluding the binding site of a RBP of interest. We applied one-sided
   Fisher’s exact test to compute the P-values. To account for multiple
   comparisons within each cluster we adjusted the P-values using the BH
   method. The FC value was computed by dividing the foreground proportion
   by the background proportion. The statistical tests were performed
   using an in-house R script.

Results

Pre-processing the input CRSs

   To extract CRSs for the clustering, we first selected CRSs overlapping
   human UTRs with relatively relaxed FDR threshold (<20%), to retain a
   large pool of CRSs for clustering. Subsequently, we excluded clusters
   with average FDR >15%. This approach enabled us to include instances
   with a low FDR, which could contribute to an overall cluster. In total,
   we acquired 6285 and 35 573 CRSs (Input-set1) from the 5′ and 3′ UTRs,
   respectively. The difference in the total number of CRSs in the 5′ and
   3′ UTRs is primarily attributed to two factors: (i) the discrepancy in
   their lengths, as the 5′ UTRs are generally shorter than the 3′ UTRs
   ([109]36) and (ii) the variation in G+C content between the 5′ and 3′
   UTR sequences, with the 5′ UTRs having a higher G+C content ([110]1),
   resulting in a higher FDR in CRSs from the 5′ UTR.

   The set of CRSs we have chosen may consist of multiple instances that
   overlap with each other by at least 1 nucleotide. These overlapping
   instances are referred to as CRS regions (genomic regions of
   overlapping CRSs) ([111]14). In fact, the aforementioned selected set
   of CRSs corresponds to a total of 4061 CRS regions in the 5′ UTRs and
   20 667 CRS regions in the 3′ UTRs. If there is large overlap among
   multiple CRSs, it indicates the presence of multiple potential foldings
   within that region, as predicted by CMfinder (demonstrated in
   Figure [112]2A, [113]B). The overlapping CRSs can extend both upstream
   and downstream of the overlapping region. To prevent clustering of CRSs
   originating from the same CRS region, which can occur because the
   shared structure element also contribute as the immediate neighboring
   subgraphs in the RNAscClust pipeline, a pre-filtering step is performed
   on the input set of CRSs. This pre-filtering ensures that each CRS
   element within a region is unique, as explained in the ‘Selection of
   CRSs overlapping 5′ and 3′ UTRs’ section of the methods. Following the
   filtering process for the overlapping CRSs, a total of 4734 and 24 754
   representative CRSs were obtained from the 5′ and 3′ UTR, respectively,
   that were used as the Input-set2 (Table [114]1). Note that we obtain
   more CRSs than regions in the 5′ and 3′ UTRs respectively. This is due
   to the filtering scheme which allow for structurally dissimilar CRSs
   within the same region.

Figure 2.

   [115]Figure 2.
   [116]Open in a new tab

   CRS region in UTRs with multiple overlapping CRSs. (A) An example of
   CRS region with multiple overlapping CRSs. A UCSC browser graphic
   showing 3′ UTR region in the RASD1 gene locus (top). The CRS track
   (available from
   [117]https://rth.dk/resources/rnannotator/crs/vert/v2.1/) shows four
   CRSs that overlap each other, representing a CRS region. (B) The
   overlapping CRSs in this region largely share the common structure
   elements. We see, the CRS M1725894 overlap completely with M1725895 and
   share the common secondary structure highlighted with blue outline.
   Similarly, the CRSs M1725895 and M1725899 share the green outlined
   structure element, and the CRSs M1725899 and M1725896 has the common
   structure element highlighted in red. Here, M1725895 and M1725896 alone
   are the representative CRSs for the region with non-redundant structure
   information.

Table 1.

   Summary of the CRS clustering Input-set1: Total CRSs after filtering
   for the FDR and G+C content (Materials and methods); input-set2: It is
   subset of Input-set1, and contains only the selected representative set
   of CRSs; UCRS: unique CRS region; SCI: structure conservation index
                                     5′ UTR                3′ UTR
                              Input-set1 Input-set2 Input-set1 Input-set2
   Total CRSs                 6285       4734       35 573     24 754
   Avg. CRS length (in nt)    83.48      82.57      85.50      84.37
   CRSs assigned to a cluster 6164       4635       24 109     17 776
   Total Clusters             1200       975        3604       2994
   Avg. CRSs in cluster       5.22       4.76       6.72       5.73
   Avg. genes in cluster      4.59       4.73       5.31       5.67
   Clusters with UCRS         372        523        532        1491
   [118]Open in a new tab

Clustering CRSs located in human 5′ and 3′ UTRs respectively

   Using RNAscClust we clustered separately the CRSs from the 5′ and 3′
   UTRs in Input-set1 and Input-set2, respectively such that the clusters
   contained a minimum of 3 CRSs (see Materials and methods, ‘CRS
   clustering based on RNAscClust’ for details). For the 5′ UTR, we
   obtained 1200 and 975 clusters, and for the 3′ UTR, we obtained 3604
   and 2994 clusters, respectively, using Input-set1 and Input-set2. Since
   Input-set2 contained non-redundant structure elements, we anticipated
   that clustering would group more similar structure elements from
   different genomic loci compared to Input-set1. Indeed, by using the
   filtered set of CRSs (Input-set2), we found twice as many unique
   CRS-containing clusters, with 523 and 1491 from the 5′ and 3′ UTRs
   respectively, compared to Input-set1 where we found 372 and 532
   clusters from the 5′ and 3′ UTRs respectively (Table [119]1). Hence, we
   focus on the clustering results from Input-set2 in rest of the
   manuscript.

Quality of the clusters

   To assess the quality of the clusters, we analyzed the alignment
   length, fraction of ungapped sequences, and fraction of paired
   nucleotides. The distribution of clustered CRS sequences based on
   paired and ungapped nucleotides and alignment block length is depicted
   in Figure [120]3(A, B). Most of the clustered paralog CRS sequence
   alignment lengths ranged from 50 to 100 columns, comprising around 70%
   of the total. Around 15% of the CRS sequences belonged to the alignment
   block length of 30–50 columns, which was shorter than the average input
   CRS sequence length (∼82–84nt) due to RNAscClust decomposing the CRS
   into substructures during clustering ([121]21). The remaining ∼15%
   belonged to clusters with an alignment block >100 columns, resulting
   from the inclusion of gaps in the clustered sequence alignment relative
   to the consensus structure of the CRSs in the cluster, which is
   possible when uneven CRS sequence lengths are clustered together by
   RNAscClust.

   The standard deviation (SD) of the CRS sequence length per cluster
   (Figure [122]3 C, D) shows that there are at least ∼10% (5′ UTR: 110;
   3′ UTR: 318) of the total clusters that had the SD >10. We manually
   inspected the clusters with higher SD (>10) in sequence length, and
   found in many of the cases there were the inclusion of either a longer
   sequence or a minority of smaller sequences in the cluster. This
   affects the overall alignment and the consensus structure due to the
   insertion of many gaps or missing/incompatible base pairs. Although,
   the RNAscClust is benchmarked using the Rfam-ome data to achieve a high
   recall and precision scores for identifying similar structures (initial
   candidate clusters) based on the dot products of sparse feature vectors
   induced in the pipeline ([123]21), there is always a small chance for
   wrongly assigning a sequences to a cluster during the iterative
   clustering step that invokes GraphClust methodology which could yield
   the clusters with higher SD in the sequence length.

Post processing of the clusters

   To enhance the quality of clusters, we adopted an iterative approach.
   In a first step, we eliminated outlier CRSs within each cluster by
   applying a length filter. Specifically, we removed sequences that
   deviated from the median length of CRSs within the cluster by either
   being 10 nucleotides longer or shorter (refer to the Methods section,
   ‘Post-cluster processing’ for more information). Following this
   filtering step, we removed 691 CRSs from the 5′ UTR clusters and 2320
   CRSs from the 3′ UTR clusters. Next, we focused on clusters that
   contained a minimum of three CRSs. These clusters were realigned, and
   consensus structures were predicted. By performing this step, we
   improved the consistency of the sequences and quality of the clustered
   CRS alignment, resulting in a higher fraction of ungapped sequences
   (Figure [124]4A, B)

Figure 4.

   [125]Figure 4.
   [126]Open in a new tab

   Quality of clustered CRSs after post-cluster processing from (A) 5′
   UTRs and (B) 3′ UTRs. The x-axis represents the fraction of the
   ungapped CRS sequence (UGS) that is clustered in a cluster. The y-axis
   represent the fraction of paired nucleotide in the CRS (PNT). Each dot
   in the scatter plot represent a CRS that is clustered in a cluster. The
   color of the dots represent the length of the clustered CRS alignment
   block. The pie chart shows the fraction of CRSs that belong to an
   alignment of certain length range. The dotted grey lines shows the 50%
   thresold on the fraction of ungapped sequence and paired nucleotide.
   The first two rows in the tables show the total clusters and the CRSs
   that were selected for the realignment after the removal of CRSs based
   on the their length threshold. The CRSs falling in the range: PNT <50;
   UGS > 50 were removed from the cluster and the remaining CRSs were
   realigned. A cluster is chosen for downstream analysis, if it contained
   at least 3 selected CRS and has at least 20% of paired nucleotides in
   the alignment.

   In the second step, we removed clustered CRSs that had more than 50%
   unpaired nucleotides (refer Methods, ‘Post-cluster processing’)
   compared to the consensus structure. This resulted in only 46 and 205
   CRSs being removed from the 5′ and 3′ UTR clusters, respectively. We
   then selected only those clusters left with at least 3 CRSs, realigned
   them, and predicted the consensus structure again. In total, 57 and 287
   CRSs were removed from the 5′ and 3′ UTR clusters, respectively,
   because the cluster size was < 3.

   In total, combining the first and second steps and disregarding
   clusters with < 3 CRSs, we only removed 794 of 4647 and 2812 of 16 972
   clustered CRSs from the 5′ and 3′ UTR clusters, respectively. In the
   subsequent step, we further selected the clusters by filtering them for
   at least 20% paired nucleotides in the consensus structure, ensuring
   the presence of RNA structure. We removed 61 clusters corresponding to
   298 CRSs from the 5′ UTRs and 252 clusters corresponding to 1469 CRSs
   from the 3′ UTRs, respectively.

   The resulting 3522 and 12 570 CRSs corresponding to 793 and 2403
   clusters from 5′ and 3′ UTR were of high quality, with a average median
   FDR of approximately 13% (Figure [127]4; [128]Supplementary File 1).
   Most of the CRSs with higher FDR were eliminated during the
   post-cluster processing steps, indicating that the retained CRSs were
   of high quality. The distribution of the FDRs of the CRSs for the top
   20 clusters ordered based on the median FDR values per cluster is shown
   in [129]Supplementary Figure S2. In total 59 of 793 (7%) clusters in 5′
   UTR and 355 of 2403 (15%) clusters in 3′ UTR contained > 50% of CRSs
   with <10% FDR. Additionally, 489 of 793 (62%) clusters in 5′ UTR UTR
   and 1582 of 2403 (66%) clusters in 3′ UTR exhibited statistically
   significant covariation (R-scape v0.3.2 E-values <0.05; see
   [130]Supplementary File 1, Sheets 3–4; Supplementary Data,
   Supplementary Note, [131]Supplementary Figure S3). In line with the
   purpose of RNAscClust to cluster the paralog genes, we obtained 12
   clusters in the 5′ UTRs and 41 clusters in the 3′ UTRs (with sequence
   identity ranging from 25% to 75% and GC content ranging from 25% to
   65%) that comprise 24 and 102 paralogous genes, respectively. These
   findings provide additional support for our clustering process we
   introduce, and the functionality of the RNAscClust pipeline (see
   [132]Supplementary File 1, Sheets 5–6).

   We further analyzed dimethyl sulfate (DMS) signals, which react with
   unpaired adenine and cytosine residues, from DMS-seq experiments
   conducted on human K562 cells ([133]51). The strand-specific DMS-seq
   experiments included denatured, in vivo and in vitro samples, with the
   denatured sample serving as an ‘unstructured’ control. First, we
   intersected all positions with DMS signal scores with gene annotations
   from GENCODE v33. We found that over 99% of the positions intersected
   on the same strand as the annotations, while about 11% of the DMS
   signals are mapped to the opposite strand of gene annotations. These
   numbers are as expected as DMS-seq experiments probe the structure of
   transcribed RNA. Next, we intersected the UTR-specific Rfam families
   with the DMS signals. Out of 258 Rfam annotations spanning to 42 Rfam
   families ([134]Supplementary File 2, Sheets 5, 6), 51 instances (20%)
   overlapped with positions with DMS signals, which represented 21 (50%)
   of the Rfam families ([135]Supplementary File 6, Sheets 1–3). All
   overlapping DMS signals were from the same strand as the Rfam
   annotations, and none overlapped with the opposite strand. We found
   that the DMS signals for positions overlapping with Rfam were
   significantly lower (Wilcoxon signed-rank test, P[adj] <0.05) compared
   to UTRs with no Rfam overlap in general, but also among the paired
   compared to the unpaired nucleotides ([136]Supplementary Figure S5),
   confirming the structured nature of the Rfam instances.

   Since we did not include the reverse complementary versions of the CRSs
   in our clustering analysis, we may have missed clustering CRSs on the
   reverse complementary strand of those currently used in our input set,
   which might still be supported by DMS signals. Therefore, we
   intersected the clustered CRSs with the DMS signal scores on both the
   same strand and the opposite strand of the predicted CRSs. We found
   that 516 (15%) of 3522 clustered CRSs in the 5′ UTR and 3735 (30%) of
   12 570 clustered CRSs in the 3′ UTR overlap with DMS signals. In terms
   of clusters, at least one CRS overlaps with a DMS signal in 379 (48%)
   of 793 clusters in the 5′ UTR and 1869 (78%) of 2403 clusters in the 3′
   UTR ([137]Supplementary File 6). We show that paired nucleotides in the
   clustered CRSs have significantly lower DMS signal scores (Wilcoxon
   signed-rank test, P[adj] < 0.05) compared to unpaired nucleotides and
   the rest of the UTR without Rfam or CRS overlap ([138]Supplementary
   Figure S5), in both in vivo and in vitro DMS-seq experiments. As
   anticipated, this signal was far less prominent in the control
   denatured DMS-seq experiment. This corresponded to 503 and 3674 CRSs
   representing 370 and 1859 clusters in the 5′ and 3′ UTRs, respectively
   ([139]Supplementary File 6). Of these, slightly more than 50% of CRSs,
   255 in the 5′ UTR and 1975 in the 3′ UTR, overlapped the DMS signals on
   the same strand. We also compared the mean DMS signal scores from in
   vitro and in vivo experiments for the paired and unpaired nucleotides
   in the consensus structure per cluster, focusing on signals that
   overlap on the same strand as the clustered CRS. In terms of fold
   change (mean DMS score for unpaired/mean DMS score for paired
   positions), we found that 87 CRSs from 84 clusters (22% of 379
   clusters) in the 5′ UTR and 702 CRSs from 590 clusters (32% of 1869
   clusters) in the 3′ UTR exhibited a 1.5-fold lower DMS signal at paired
   positions compared to unpaired positions ([140]Supplementary File 7).

   In total, 7.3% (32 516 of 447 502) of all paired positions in the
   clustered CRSs, corresponding to the overall consensus structures from
   the 5′ and 3′ UTRs, overlapped with the DMS signals. We evaluated the
   precision and recall of the predicted base pairs in the consensus
   structure by examining the DMS signals that overlap with the base pairs
   projected onto the clustered CRSs on the same strand. We defined True
   Positives (TP) as base pairs that overlapped with low DMS signal scores
   (<0.4, <0.3 and <0.2), false negatives (FN) were identified as low DMS
   signal scores overlapping with unpaired bases, and false positives (FP)
   as high DMS signal scores overlapping with paired bases (>0.6, >0.7 and
   >0.8). We found that the overall precision for the predicted base pairs
   ranged from 0.76 to 0.84, and the recall ranged from 0.47 to 0.50
   across different thresholds of the DMS signal score in the DMS-seq in
   vivo and in vitro experiments ([141]Supplementary File 8; Sheet 1). We
   also evaluated the precision and recall values for the predicted pairs
   per cluster, considering the mean paired and unpaired positions of the
   clustered CRSs corresponding to the consensus structure. The mean
   precision and recall for the predicted base pairs across clusters from
   the 5′ and 3′ UTRs ranged between 0.77 to 0.80 and 0.49 to 0.50,
   respectively. The complete list of clusters with estimated precision
   and recall using different thresholds of DMS is provided in
   [142]Supplementary File 9. In conclusion, the probing data supports the
   structured nature of our CRS clusters.

Recovery of known regulatory RNA elements from Rfam and EvoFam

   To assess whether the clusters captured known structure families, we
   compared the genomic coordinates of clustered CRSs with the annotated
   regions from Rfam ([143]37) and EvoFam ([144]38) databases. In total,
   we identify 26 Rfam (v14.4) families in the 5′ UTRs and 19 families in
   the 3′ UTRs that exhibit evidence of being UTR-specific regulatory
   elements (please refer to the methods section for the definition and
   [145]Supplementary Tables S1 and [146]S2). None of these 26 Rfam
   families from the 5′ UTR overlap with at least two CRSs from different
   genomic loci. Hence we did not expect any of these to be recovered in
   the clusters. Similarly, we identify 19 families in the 3′ UTRs. Among
   the 19 families in the 3′ UTRs, only two Rfam families namely Histone3
   and Selenocysteine insertion sequence (SECIS) 1 stem-loops, overlap
   with at least two CRSs from different genomic loci. We successfully
   identify the Histone3 Rfam family in cluster #1650 containing 7 CRSs,
   with 4 of them overlapping distinct histone genes. Two of these
   clustered CRSs from different genomic loci exhibit complete overlap
   with the Rfam Histone 3′ UTR stem-loop RNA structure motifs, covering
   them 100% (Figure [147]5). Based on our GO enrichment analysis, we also
   find that this cluster is significantly associated with the GO term
   ‘nucleosome assembly’, which further links it to the Histone3 RNA
   motif.

Figure 5.

   [148]Figure 5.
   [149]Open in a new tab

   Histone 3′ UTR stem–loop Rfam motif identified by RNAscClust. (A)
   Consensus secondary structure based on the sequence-structure alignment
   of the clustered CRS paralog sequences from human genome. The stem–loop
   structure is characteristic for the histone 3′ UTR stem-loop motif
   (Rfam [150]https://rfam.xfam.org/family/RF00032). (B) and (C) UCSC
   browser graphic showing two CRSs (M1300789, M2304799) from the cluster
   #1650 that overlap with the Histone3 Rfam motif highlighted in the blue
   shaded region.

   In the case of the SECIS 1 Rfam family, which only has 8 motifs
   identified in the human genome (see [151]Supplementary File 2, Sheet
   5), we observed three instances where these motifs overlapped with
   distinct CRSs, with 100% coverage by either of them. Upon closer
   examination, we noticed that these CRSs covered different lengths of
   human sequences in their alignments: 179nt (M1416351), 63nt
   (M2009617) and 51nt (M2061896), respectively (see [152]Supplementary
   File 2, Sheet 1). Further investigation revealed that these CRSs had
   different secondary structures, showing no local similarity. While the
   CRSs M1416351 and M2009617 fully cover the Rfam SECIS 1 motif, the CRS
   M2061896 covers the SECIS 1 motif by only 71%, and notably, is located
   on the opposite strand as the annotated motif. Given this difference in
   strand orientation, we didn’t expect this CRS to cluster with others
   overlapping the SECIS 1 motif. This is because the CRS predicted by
   CMfinder on the reverse complementary strand may not share any
   similarity with the Rfam annotation.

   Of the four CRSs (M2304799, M1300789, M1012301, M1012291) spanning the
   3′ UTR of histone genes in cluster #1650 (see [153]Supplementary File
   1), the CRSs M1012301 and M1012291 are located on the same strand as
   the annotated histone Rfam motifs whereas CRSs M1300789 and M2304799
   are located on the opposite strand ([154]Supplementary File 2, Sheet
   1). The former CRSs consist of a short hairpin with the loop sequences
   ‘UUUA’ and ‘UUUC’, and the latter CRSs also exhibited a short hairpin
   but with the reverse complementary loop sequence ‘GAAA’ in both cases.
   Despite being on the opposite strand, the CRSs M1300789 and M2304799
   have almost the identical secondary structure as Histone3.
   Consequently, we were able to cluster these CRSs in one cluster using
   RNAscClust and identify the Rfam annotations on the opposite strand. In
   contrast, SECIS 1 forms a longer stem with two interior loops. As a
   result, the reverse complementary structure does not correspond to the
   colocalized CRS predicted on the opposite strand. The complete list of
   Rfam families overlapping with the CRSs in 5′ and 3′ UTR is provided in
   [155]Supplementary File 2, which also include the total annotated
   instances for each Rfam family in Sheets 5 and 6.

   In our analysis of EvoFam families, we identified three instances
   overlapping with the 5′ UTR, one of which intersected with at least two
   CRSs from different genomic locations. We successfully recovered this
   EvoFam family (see [156]Supplementary File 2, Sheet 4) in the clusters,
   exhibiting complete overlap between its hits and the clustered CRSs,
   achieving 100% coverage. Similarly, we detected 13 EvoFam families
   overlapping with the 3′ UTR, seven of which intersected with at least
   two CRSs from distinct genomic loci. Among these, we managed to
   retrieve two families within the clusters, with their hits completely
   overlapping (100% coverage) with the clustered CRSs (see
   [157]Supplementary File 2, Sheet 3). Notably, this includes cluster
   #1650 containing histone genes.

   Upon further investigation of the remaining five EvoFam families not
   retrieved in the clusters, we observed that CRSs exhibited greater
   length compared to the EvoFam motifs, with approximately 30% of CRS
   length overlapping 100% of the EvoFam motifs. Furthermore, many of
   these CRSs were situated on the opposite strand relative to the EvoFam
   motifs (see [158]Supplementary File 2, Sheet 3). Given the longer CRSs,
   which are also reverse complementary to the EvoFam motifs on the
   opposite strand, it is unlikely that the reverse complementary
   structure necessarily corresponds to the EvoFam motif. Consequently, we
   do not anticipate these CRSs to cluster together.

Overrepresented GO categories and Pathways in clusters

   To categorize the clusters according to their functional attributes, we
   performed a functional enrichment analysis by utilizing the GO and
   Pathway annotations associated with protein-coding genes that harbor
   the CRSs within their UTR regions. Our hypothesis was that if CRSs
   within clusters share common functionally significant structural
   elements, their corresponding host genes would also exhibit shared
   functional attributes.

   To test this hypothesis, we first performed a GO overrepresentation
   analysis on each cluster, employing the GO annotation provided by the
   GO Consortium ([159]25) (refer to the Methods section, ‘GO and Pathway
   overrepresentation analysis’ for more information). As a result, we
   identified 104 of the 793 (13%) clusters in 5′ UTR and 441 of 2403
   (18%) clusters in 3′ UTR that demonstrated statistically significant
   overrepresentation of at least one GO term (P[adj] < 0.05; at least
   three genes in the cluster with the annotation; FC enrichment > 1.5).
   In Figure [160]6, we present the top 10 GO terms across the three
   domains: ‘biological processes’, ‘molecular functions’, and ‘cellular
   components’. These terms have been ranked based on the number of
   clusters in which they are overrepresented, after eliminating any
   redundant terms that essentially are subsets of higher-level GO terms
   in the ontology hierarchy.

Figure 6.

   [161]Figure 6.
   [162]Open in a new tab

   Clusters overrepresenting GO terms. The top 10 overrepresented
   (adjusted P-value <0.05) GO terms from all three GO classes (y-axis),
   among the clusters (x-axis) from (A) 5′ UTR and (B) 3′ UTR are shown.
   The redundant GO terms were excluded.

   Several notable GO terms that were overrepresented in the clusters were
   uncovered. In the domain of biological processes, terms such as
   ‘regulation of transcription by RNA polymerase II’, ‘cell
   differentiation’, ‘nervous system development’ and ‘signal
   transduction’ stood out. Under molecular functions, terms like ‘DNA
   binding’, ‘nucleotide binding’, ‘identical protein binding’, ‘metal ion
   binding’, ‘transferase activity’ and ‘kinase activity’ were prominent.
   In case of cellular components, significant terms included ‘nucleus’,
   ‘cytoplasm’, ‘plasma membrane’, ‘extracellular region’, ‘synapse’,
   ‘cell projection’ and ‘cell junction’, among others. The complete list
   of overrepresented GO terms and associated genes for each cluster are
   available in [163]Supplementary File 3. Many of these GO terms are
   directly related to neuron development, which aligns with our previous
   study highlighting structured elements with enriched expression in
   mouse brains ([164]39).

   Our analysis further highlight certain functional activities that are
   associated with either 5′ UTRs or the 3′ UTRs based on the GO terms
   within the clusters of CRSs. Clusters containing 3′ UTR CRSs
   demonstrate the overrepresentation of following GO terms related to
   biological processes—‘mRNA processing’, ‘RNA splicing’, ‘protein
   phosphorylation’, ‘protein transport’, ‘regulation of gene expression’
   and ‘RNA splicing’. These terms highlight the crucial roles played by
   3′ UTRs in processes such as alternative splicing, mRNA stability,
   translation efficiency, and localization that are well known ([165]40).
   Additionally, molecular functions within these clusters exhibit GO
   terms such as ‘protein kinase binding’, ‘phospho-protein phosphatase
   activity’, ‘ubiquitin protein ligase activity’ and ‘enzyme binding’.
   The cellular components of these clusters are characterized by GO terms
   like ‘protein-containing complex’, ‘apical plasma membrane’,
   ‘postsynaptic membrane’ and ‘neuronal cell body’.

   In contrast, clusters associated with 5′ UTR CRSs show the following
   overrepresented GO terms. The biological processes within these
   clusters include terms like ‘G protein-coupled receptor signaling
   pathway’, ‘skeletal system development’, ‘blood vessel development’ and
   ‘collagen fibril organization’. Molecular functions in these clusters
   involve GO terms such as ‘receptor antagonist activity’, ‘SMAD
   binding’ and ‘extracellular matrix structural constituent conferring
   tensile strength’. Moreover, the cellular components in these clusters
   are characterized by GO terms such as ‘mitochondrial matrix’ and
   ‘collagen trimer’ ([166]Supplementary File 3; Figure [167]6). Of
   particular significance is the presence of SMAD (Suppressor of Mothers
   against Decapentaplegic) binding in relation to 5′ UTRs. SMAD protein
   family are part of the TGF-β signaling pathway and negatively regulate
   the growth of epithelial cells. There are known SMAD proteins that are
   reported to bind to 5′ UTR and regulate the expression of genes
   ([168]41). Overall, these findings emphasize the distinct functional
   roles played by 3′ UTRs and 5′ UTRs, as evidenced by the
   overrepresentation of GO terms associated with various biological
   processes, molecular functions, and cellular components.

   Subsequently, we conducted a Pathway overrepresentation analysis on
   each cluster by leveraging the annotations provided by the Kegg and
   Reactome pathway databases. As a result, we identified 15 of the 793
   (2%) clusters 5′ UTR and 87 of the 2403 (4%) clusters in 3′ UTR that
   demonstrated statistically significant overrepresentation of at least
   one pathway (P[adj] <0.05; at least 3 genes in the cluster with the
   annotation; FC > 1.5).

   The clusters comprising CRSs from both 5′ and 3′ UTRs demonstrate
   prominent overrepresented pathways, including ‘Signal Transduction’,
   ‘Gene expression (Transcription)’, ‘Metabolism of proteins’, ‘Immune
   System’ and ‘Post-translational protein modification’, among others.
   Here we show the top 15 pathways in Figure [169]7 that were selected
   based on their frequency of overrepresentation within the clusters.

Figure 7.

   [170]Figure 7.
   [171]Open in a new tab

   Clusters over-representing Pathways. The top 15 overrepresented
   (adjusted P-value < 0.05) pathways (y-axis) among the clusters (x-axis)
   from (A) 5′ UTR and (B) 3′ UTR are shown.

   Furthermore, there are distinct pathways uniquely overrepresented in
   clusters containing either 3′ UTR or 5′ UTR CRSs. For clusters with 3′
   UTR CRSs, these pathways include ‘Gene Silencing by RNA’, ‘Membrane
   Trafficking’ and ‘Nervous system development’. On the other hand,
   clusters with 5′ UTR CRSs exhibit unique overrepresented pathways such
   as ‘Collagen biosynthesis and modifying enzymes’, ‘Collagen
   degradation’, ‘Extracellular matrix organization’ and ‘Binding and
   Uptake of Ligands by Scavenger Receptors’.

   The overrepresented pathways within the clusters directly align with
   the overrepresented GO terms we observed previously (Figure [172]6).
   For a comprehensive list of overrepresented pathways in clusters
   containing either 5′ or 3′ UTR CRSs, along with the associated gene
   lists, please refer to [173]Supplementary File 4. Overall, our findings
   provide valuable insights into the functional implications of the
   clusters, warranting further investigation and validation.

Enrichment of clustered CRSs for the binding sites of RNA-binding proteins

   RNA structures are well known to interact with proteins ([174]42), to
   investigate if our set of clustered CRSs are the binding sites for
   specific RBPs, we gathered a comprehensive catalog of 150 RBPs and
   their binding sites from the ENCODE project phase III ([175]43). We
   then investigated if these RBPs were disproportionately present in the
   clusters located in the 5′ and 3′ UTRs. Our approach involved comparing
   the extent of nucleotide coverage by each RBP binding site within our
   clustered CRSs against the background, which included all CRSs in the
   UTR regions used for clustering. Our findings revealed that out of the
   150 RBPs, the binding sites for 60 and 43 RBPs showed significant
   overrepresentation (P[adj] <0.05) in terms of coverage across the CRS
   clusters from the 5′ and 3′ UTRs, respectively. Among these, 6 RBP
   (DKC1, HNRNPC, HNRNPK, KHDRBS1, NIP7, U2AF1) binding sites showed
   common enrichment for coverage among the CRSs or clustered CRSs from
   both the 5′ and 3′ UTRs with respect to the UTR. (Figure [176]8;
   [177]Supplementary File 5). The top 15 RBPs with overrepresented
   binding sites, based on the proportion of CRSs within a cluster
   co-localizing with the binding sites (P[adj] < 0.05; FC > 1.5, and with
   at least two binding sites; see methods), in 5′ and 3′ UTRs
   respectively, are shown in Figure [178]9. These RBPs are ordered by the
   number of clusters in which they are overrepresented. The complete list
   of overrepresented RBPs across the 5′ and 3′ UTR CRS clusters can be
   found in [179]Supplementary File 5 (Sheets 3-4)

Figure 8.

   [180]Figure 8.
   [181]Open in a new tab

   Enrichment of CRSs for RBP binding sites. The enrichment for 150 RBP
   binding sites of the clustered CRSs (foreground set of CRSs, labeled on
   top) located in UTRs is shown. It has been computed by comparing it
   against a background defined as the fraction of nucleotides (nt)
   covered by each RBP in the total RBP covered UTR region (nt). The
   comparison was done using the two-sample Z-test for proportions, and
   the P-values were adjusted for multiple testing using the BH method.

Figure 9.

   [182]Figure 9.
   [183]Open in a new tab

   Clusters overrepresenting RBPs. The top 11 and 15 RBPs, respectively,
   (y-axes) with their binding sites overrepresented (adjusted P-value
   < 0.05) among the clusters (x-axes) from (A) 5′ UTR and (B) 3′ UTR are
   shown.

   We observed a particular overrepresentation of binding sites for DDX3X
   (DEAD-Box Helicase 3 X-Linked) (Figure [184]8), a human RNA helicase
   implicated in many important cellular processes and PRPF8
   (Pre-mRNA-processing-splicing factor 8) protein which is one of the
   largest and most highly conserved nuclear proteins occupying a central
   position in the catalytic core of the spliceosome in the clusters from
   the 5′ UTR. These RBPs have previously been associated with selective
   modulation of translation rates based on the 5′ UTR structures
   ([185]44). Additionally, PUM (Pumilio) proteins are known to bind the
   Pumilio recognition/response element (PRE) typically found in the 3′
   UTR of target mRNAs ([186]45). Notably, we discovered an
   overrepresentation of PUM2 binding sites colocalizing with 6 clusters
   from the 3′ UTR (Figure [187]8,[188]9; [189]Supplementary File 5, Sheet
   3).

   Furthermore, one of the clusters from the 3′ UTR (cluster #12)
   exhibited an overrepresentation of binding sites for DKC1 (Dyskerin
   pseudouridine synthase 1) protein (Figure [190]8; [191]Supplementary
   Files 1,5). DKC1 is known for its involvement in telomerase
   stabilization, maintenance, and recognition of snoRNAs with H/ACA-box
   motifs ([192]46) (RefSeq, January 2014). The sequences within this
   cluster, as well as cluster #2697 where at least one CRS of seven
   colocalizes with a DKC1 binding site, displayed consensus secondary
   structures indicative of snoRNAs (Figure [193]10). Upon closer
   examination, we found that three out of four sequences from cluster #12
   and at least one sequence from cluster #2697 overlapped with snoRNA
   annotations in GENCODE within the same region of the 3′ UTRs.
   Furthermore, these overlapping regions were also annotated as snoRNAs
   with conserved H/ACA-box motifs in Rfam. However, the remaining one
   sequence from cluster #12 and 7 sequences from cluster #2697 currently
   lack associations with any snoRNA annotations. This indicates that our
   set of clusters can serve as a valuable reference for identifying new
   RNA structure motifs.

Figure 10.

   [194]Figure 10.
   [195]Open in a new tab

   Example of Clusters The CRS cluster alignment and the consensus
   structure are shown for (A) cluster #12, where the CRS M2543650 and
   M1251899 overlap with the SNORA56 (RF00417), M2009856 overlaps SNORA77
   (RF00599) and (B) cluster #2697, where the CRS M1538378 overlap with
   SNORA18 (RF00425). All the overlapping snoRNAs belong to H/ACA box
   class.

Discussion

   Here, we studied CRSs colocalized with UTRs to explore the function of
   the corresponding coding genes in the clustering to assign potential
   regulatory roles for the clustered CRSs. Although the number of known
   Rfam families in this CRS-UTR overlap is limited, and it would be
   desirable to extend the analysis over CRS-Rfam, this would require a
   different strategy and is therefore beyond the scope in this study. We
   adopted a methodology that involved clustering CRSs based on shared
   structural features, specifically focusing on conserved base pairs.
   Next, we utilized the clustered CRSs and their associated host genes to
   explore potential overrepresentation of specific biological pathways
   and gene ontology terms. We further analyzed the coverage of RBP sites
   within the clusters, which provided valuable insights into their
   functional characteristics.

   To accomplish our objectives, we employed the RNAscClust pipeline to
   cluster CRSs from the UTRs in the human genome. We demonstrated the
   sensitivity of RNAscClust to the input set of CRSs used for clustering.
   To enhance the overall clustering performance of RNAscClust, we
   introduced two simple approaches, one at the pre-clustering stage and
   the other at the post-clustering stage. At the pre-clustering step, we
   implemented a method to select a representative subset of CRSs for
   clustering, thereby removing overlapping CRSs with common structures
   that would otherwise be assigned to the same cluster. We show that
   without this prefilter the overall effectiveness of finding clusters of
   CRSs from different genomic regions is reduced. At the post-clustering
   stage, we employed an iterative processing approach. Initially, we
   removed outlier sequences from the obtained clusters based on a
   sequence length threshold. We then realigned and predicted consensus
   structures. Subsequently, we eliminated sequences from the clusters
   with >50% unpaired nucleotides. This was followed by realignment of the
   remaining CRSs in the cluster and the prediction of consensus
   structures. Overall, we only removed approximately 14% of the CRSs
   assigned to a cluster, resulting in clusters with enhanced alignment
   quality and consensus structures. We also utilized structure probing
   data generated from DMS-seq experiments conducted on human K562 cells.
   Our analysis showed that 4703 (30%) of the 16 092 CRS clusters in our
   list overlapped with DMS-seq signals. Base pairs in the clustered
   individual CRSs corresponding to the consensus structure exhibited
   significantly lower DMS signal scores compared to unpaired nucleotides
   and the rest of the UTR without CRS overlap. This finding supports the
   structured nature of the majority of our CRS clusters. It should be
   noted that individual CRSs in the cluster may have additional base
   pairs not present in the consensus structure that could be supported by
   DMS signal scores. Despite this, we still observed a statistically
   significant difference between the paired and unpaired positions
   according to the consensus. Additionally, since we used DMS-seq data
   from a single cell line, incorporating data from more samples and other
   cell lines could potentially provide better DMS-seq coverage.

   By functionally classifying the clusters by GO terms of the host gene,
   we were able to recover known RNA structure families and discover
   potential biological processes and molecular functions associated with
   the grouped set of CRSs. Additionally, we found clusters where some
   CRSs could be linked to established Rfam functional structures,
   indicating that the remaining CRSs within those clusters may hold
   promise as novel motifs. These motifs could be further investigated
   using various structure probing technologies ([196]47), along with
   experimental and functional validation, to gain insights into RNA
   structure-function relationships.

   As CMfinder predictions might also suggest a conserved RNA structure on
   the reverse complementary strand, the clustering search space can be
   increased by adding the reverse-complemented versions of CRSs to the
   input. The structures of CRSs and their reverse complement may be
   different why their inclusion will broaden the coverage of important
   structured regulatory elements. Additionally, to expand the clustering
   of CRSs to the next stage, one could leverage the cmbuild and
   cmcalibrate programs from the Infernal toolsuite to build and calibrate
   covariance models (CMs) using the obtained cluster of CRSs.
   Subsequently, the cmpress and cmscan programs from Infernal could be
   used to scan the entire genome and identify genomic loci with secondary
   structures strongly resembling the clusters of CRSs. Moreover, the
   methodology outlined in this study can be extended to cluster CRSs
   derived from various non-coding RNAs, including long non-coding RNAs
   (lncRNAs) known for their tendency to possess intricate secondary and
   tertiary structures. These structural features in the lncRNA gene
   biotype frequently play a vital role in determining their
   functionality. In many cases, it is the conservation of these
   structures, rather than the conservation of the primary sequence, that
   governs their functional significance. An additional interesting area
   for investigation involves the exploration of G-quadruplex structures.
   These structures consist of G-quartets connected by loop nucleotides
   within Guanine (G)-rich sequences in nucleic acids. G-quadruplexes are
   recognized for their pivotal regulatory roles in various biological
   processes, including, but not limited to, DNA replication,
   transcription, and translation ([197]48). Since the guanine in
   G-quadruplexes interacts with two other guanines, they are not
   described by the RNA secondary structures that we are clustering here.
   However, examining the intersection between known G-quadruplex
   structures ([198]49,[199]50) and our list of CRS and clusters presented
   in this study could aid in the identification and categorization of
   evolutionarily conserved G-quadruplex motifs.

Conclusion

   In this study, we present a comprehensive catalog consisting of 793
   clusters from the 5′ UTR and 2403 clusters from the 3′ UTR,
   representing the largest collection of CRS clusters reported for these
   regions in the human genome. Additionally, we perform functional
   characterization of these clusters by examining the overrepresentation
   of RBP sites and conducting functional enrichment analysis using GO and
   pathway annotations.

   Our findings reveal clusters in the 5′ and 3′ UTRs that exhibit
   significant enrichment of binding sites for specific groups of RBPs.
   These clusters serve as valuable references for discovering novel RNA