Abstract

   MicroRNAs (miRNAs) are small regulatory molecules that repress the
   translational processes of their target genes by binding to their 3′
   untranslated regions (3′ UTRs). Because the target genes are
   predominantly determined by their sequence complementarity to the miRNA
   seed regions (nucleotides 2–7) which are evolutionarily conserved, it
   is inferred that the target relationships and functions of the miRNA
   family members are conserved across many species. Therefore, detecting
   the relevant miRNA families with confidence would help to clarify the
   conserved miRNA functions, and elucidate miRNA-mediated biological
   processes. We present a mixture model of position weight matrices for
   constructing miRNA functional families. This model systematically finds
   not only evolutionarily conserved miRNA family members but also
   functionally related miRNAs, as it simultaneously generates position
   weight matrices representing the conserved sequences. Using mammalian
   miRNA sequences, in our experiments, we identified potential miRNA
   groups characterized by similar sequence patterns that have common
   functions. We validated our results using score measures and by the
   analysis of the conserved targets. Our method would provide a way to
   comprehensively identify conserved miRNA functions.

   Keywords: Mixture model, MicroRNA, EM algorithm, Machine learning,
   Position weight matrix, Sequence analysis

Introduction

   MicroRNAs (miRNAs) are a kind of small noncoding RNA which mediate a
   wide variety of biological processes, including development,
   differentiation, and metabolism ([28]Ambros, 2004; [29]Bartel, 2004;
   [30]Bushati & Cohen, 2007). The molecules regulate gene expression by
   repressing the translation of the target mRNAs or by directly cleaving
   them, with the RNA-induced silencing complex (RISC). In this regulatory
   process, nucleotide positions 2–7 of the miRNAs play an important role
   in the selection of target gene, and are known as seed regions
   ([31]Bartel, 2009).

   The conservation of miRNAs and their targets has been reported
   previously ([32]Altuvia et al., 2005). For example, the relationship
   between let-7 miRNA and lin-41, one of its targets, is broadly
   conserved in many species, and let-7 and lin-4 miRNAs also bind to a
   conserved target gene, lin-28, in mammals and nematodes
   ([33]Pasquinelli et al., 2000; [34]Moss & Tang, 2003). Several studies
   have also shown that a number of miRNA-target relationships are
   conserved in plants ([35]Floyd & Bowman, 2004; [36]Axtell & Bartel,
   2005). Because the target genes are identified based on the binding of
   the conserved seed sequences of the miRNAs, the target mRNAs and their
   functions are considered to be evolutionarily conserved among the
   seed-sharing miRNAs ([37]Lewis, Burge & Bartel, 2005; [38]Friedman et
   al., 2009).

   One approach to defining the miRNA functions is to search for miRNAs
   that are differentially expressed under specific conditions. Various
   analyses have successfully identified the miRNAs associated with a
   particular biological state using miRNA expression profiles ([39]Chen
   et al., 2006; [40]Liu et al., 2012; [41]Zhang et al., 2012; [42]Tzur et
   al., 2008). Recently, a study based on an analysis of miRNA profiles
   with a deep sequencing approach showed that the let-7 family,
   well-conserved miRNA sequences, is highly expressed in human embryonic
   stem cells and shares downstream targets ([43]Koh et al., 2010).
   However, many researchers have demonstrated that the expression
   patterns of conserved miRNAs are not strictly conserved ([44]Ason et
   al., 2006; [45]Bak et al., 2008). Moreover, changes in their expression
   levels might be not an origin necessarily causing the particular
   biological changes but a result from specific circumstances or
   environmental conditions.

   Another way of investigating miRNA functions is to use a low-level
   animal model or other species. Using transgenic or knock-out
   experiments, several studies have identified the biological roles of
   miRNAs ([46]Costinean et al., 2006; [47]Park, Choi & McManus, 2010).
   Lindow et al. studied orthologues of the targets of conserved miRNAs to
   identify conserved regulatory interactions between miRNAs and their
   targets in different species ([48]Lindow et al., 2007). Recently, an
   experiment was designed to study the regulatory functions of miRNAs for
   a pathway by identifying conserved miRNAs and their targets in humans
   and Drosophila ([49]Hyun et al., 2009).

   Although the functions of several miRNAs have been reported in previous
   studies, the functions of the majority of miRNAs remain poorly
   understood. To effectively advance the research into miRNA functions
   using conservation information, a careful construction of miRNA
   functional families is essential. Here, we describe a method for
   identifying conserved sets of miRNAs to clarify their conserved
   functions. In this paper, we define the miRNA functional family as a
   set of miRNAs which share common functions. These miRNA families are
   represented by their sequences and share their target relationships.
   Therefore, a study of the miRNA functional family should extend our
   understanding of the shared functions of miRNAs across species.

   We develop a method of identifying miRNAs which perform similar
   function based on a mixture model of position weight matrices (PWMs)
   derived from miRNA sequences. The PWM is a commonly used representation
   model in biological sequence analyses, constructed by calculating the
   frequency of each specific base (A, T, G and C) at each nucleotide
   position in the motif sequence sets. The PWM model has been
   successfully applied to diverse problems in DNA and protein sequence
   analysis, and has demonstrated its usefulness in identifying functional
   sequence elements ([50]Bailey & Elkan, 1995; [51]Hannenhalli & Wang,
   2005; [52]Orenstein, Linhart & Shamir, 2012).

   In this work, we systematically generated the PWMs from whole mammalian
   miRNA sequences using a mixture model. The mixture model is a type of
   probabilistic model for density estimation based on underlying data
   with mixed distributions. The mixture model has been known to capture
   the dominant patterns in samples by component distributions
   ([53]Houseman et al., 2008; [54]McNicholas & Murphy, 2010; [55]Costa,
   Boccignone & Ferraro, 2012; [56]Melnykov & Maitra, 2010). Therefore, it
   is appropriate to use a mixture model to identify the sequence patterns
   and to group similar miRNAs from whole miRNA sequences that contain
   different bases. Given a set of miRNAs, our approach finds the miRNA
   sequence profiles in subclasses and constructs the PWMs by estimating
   the mixing probabilities. Our experimental results show the
   characteristics of the consensus sequences, with accurate modeling of
   the base distributions at each position in the PWM. The results confirm
   that our method can help to identify miRNA functions by collecting
   similar sequences, and provides an overview of the evolutionarily and
   functionally related miRNA groups for future functional analyses.

Materials & Methods

MicroRNA sequences and position weight matrix

   We collected the mature miRNA sequences from miRBase (release 14)
   ([57]Griffiths-Jones et al., 2008; [58]Kozomara & Griffiths-Jones,
   2011) and extracted the mammalian miRNAs from these. We selected for
   the analysis of the 6-mers at nucleotide positions 2–7, the seed
   regions, in the mature sequences. Then, we constructed the PWMs, which
   are scoring matrices weighted according to a specific position in the
   given seed sequences, with a computational learning approach. The PWMs
   for our experiments were initialized by randomly and repeatedly
   extracting sequences of six nucleotides from human genome sequences.

Mixture model of PWMs and the expectation–maximization (EM) algorithm

   We developed a mixture model of position weight matrices to construct
   miRNA families, and estimated the model parameters with an EM algorithm
   to maximize the likelihood of the model.

   Suppose that X = {X[1], X[2], …, X[i], …, X[N]} is a dataset of N miRNA
   sequences, and each sequence length |X[i]| is L. When the model is set
   as having k PWMs, the matrices are denoted by W = {W[1], W[2], …,
   W[N]}, which is a set of 4 × L position weight matrices derived from
   miRNA seed sequences. The probability of sequence data X is then
   represented as:
   [MATH: <mstyle
   displaystyle="true"><mi>P</mi><mo>(</mo><mi>X</mi><mo>|</mo><mi>W</mi><
   mo>)</mo><mo>=</mo><munderover><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><
   mo>=</mo><mn>1</mn></mrow><mrow><mi>k</mi></mrow></munderover><msub><mr
   ow><mi>λ</mi></mrow><mrow><mi>i</mi></mrow></msub><mi>P</mi><mo>(</mo><
   mi>X</mi><mo>|</mo><msub><mrow><mi>λ</mi></mrow><mrow><mi>i</mi></mrow>
   </msub><mo>,</mo><msub><mrow><mi>W</mi></mrow><mrow><mi>i</mi></mrow></
   msub><mo>)</mo><mo>=</mo><munderover><mrow><mo>∑</mo></mrow><mrow><mi>i
   </mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>k</mi></mrow></munderover><ms
   ub><mrow><mi>λ</mi></mrow><mrow><mi>i</mi></mrow></msub><munderover><mr
   ow><mo>∏</mo></mrow><mrow><mi>m</mi><mo>=</mo><mn>1</mn></mrow><mrow><m
   i>N</mi></mrow></munderover><munderover><mrow><mo>∏</mo></mrow><mrow><m
   i>v</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>L</mi></mrow></munderover>
   <msub><mrow><mi>W</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>[</mo><m
   i>u</mi><mo>=</mo><msub><mrow><mi>X</mi></mrow><mrow><mi>m</mi><mi>v</m
   i></mrow></msub><mo>,</mo><mi>v</mi><mo>]</mo><mo>,</mo></mstyle>
   :MATH]

   where λ[i] is a weight value for the i-th PWM, and X[mv] is the v-th
   base symbol in the m-th miRNA sequence. W[i][u, v] is the value of
   index (u, v) in the i-th PWM, and W[i][u = X[mv], v] is the
   probabilistic value matched to the base symbol of X[mv] in the index
   (u, v) of the matrix W[i]. The sum of λs should be 1, i.e.,:
   [MATH: <mstyle
   displaystyle="true"><munderover><mrow><mo>∑</mo></mrow><mrow><mi>i</mi>
   <mo>=</mo><mn>1</mn></mrow><mrow><mi>k</mi></mrow></munderover><msub><m
   row><mi>λ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><mn>1</mn>
   <mspace width="1em"
   class="quad"></mspace><mo>(</mo><msub><mrow><mi>λ</mi></mrow><mrow><mi>
   i</mi></mrow></msub><mo>≥</mo><mn>0</mn><mo>)</mo><mo>.</mo></mstyle>
   :MATH]

   To show assignment of the data samples to the model, we introduce a
   hidden variable z[mi] indicating the probability that an input sequence
   X[m] is represented by a PWM W[i]. Given these data, the expected
   log-likelihood to be maximized by the learning process in the
   PWM-mixture model is given by:
   [MATH: <mstyle
   displaystyle="true"><mo>log</mo><mi>L</mi><mo>(</mo><mi>W</mi><mo>,</mo
   ><mi>λ</mi><mo>|</mo><mi>X</mi><mo>)</mo><mo>=</mo><munderover><mrow><m
   o>∑</mo></mrow><mrow><mi>m</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</
   mi></mrow></munderover><munderover><mrow><mo>∑</mo></mrow><mrow><mi>i</
   mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>k</mi></mrow></munderover><mi>E
   </mi><mo>(</mo><msub><mrow><mi>z</mi></mrow><mrow><mi>m</mi><mi>i</mi><
   /mrow></msub><mo>)</mo><mo>log</mo><mo>(</mo><mi>P</mi><mo>(</mo><msub>
   <mrow><mi>X</mi></mrow><mrow><mi>m</mi></mrow></msub><mo>|</mo><msub><m
   row><mi>W</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo><msub><mro
   w><mi>λ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>,</mo></
   mstyle> :MATH]

   L(⋅) is a likelihood function in the model, and E(⋅) is an expected
   value. The parameters of the model are estimated with the EM algorithm.
   The EM algorithm searches for the maximum value of the likelihood by
   iteratively repeating the E-step and M-step until convergence is
   achieved.

   The E-step calculates the expected value of the hidden variable z[mi]
   as follows:
   [MATH: <mstyle
   displaystyle="true"><mi>E</mi><mo>(</mo><msub><mrow><mi>z</mi></mrow><m
   row><mi>m</mi><mi>i</mi></mrow></msub><mo>)</mo><mo>=</mo><mi>P</mi><mo
   >(</mo><msub><mrow><mi>z</mi></mrow><mrow><mi>m</mi><mi>i</mi></mrow></
   msub><mo>=</mo><mn>1</mn><mo>|</mo><msub><mrow><mi>X</mi></mrow><mrow><
   mi>m</mi></mrow></msub><mo>,</mo><msub><mrow><mi>W</mi></mrow><mrow><mi
   >i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>λ</mi></mrow><mrow><mi>i
   </mi></mrow></msub><mo>)</mo><mo>=</mo><mfrac><mrow><mi>P</mi><mo>(</mo
   ><msub><mrow><mi>X</mi></mrow><mrow><mi>m</mi></mrow></msub><mo>|</mo><
   msub><mrow><mi>W</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo><ms
   ub><mrow><mi>λ</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mrow><m
   underover><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></
   mrow><mrow><mi>k</mi></mrow></munderover><mi>P</mi><mo>(</mo><msub><mro
   w><mi>X</mi></mrow><mrow><mi>m</mi></mrow></msub><mo>|</mo><msub><mrow>
   <mi>W</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>)</mo><msub><mrow><m
   i>λ</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow></mfrac><mo>.</mo><
   /mstyle> :MATH]

   E(z[mi]) is the posterior probability that a miRNA seed sequence X[m]
   is fitted in the position weight matrix W[i].

   The M-step computes the parameters, W and λs, to maximize the
   log-likelihood by
   [MATH: <mstyle displaystyle="true"><mover
   accent="true"><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>i</mi></mrow
   ></msub></mrow><mrow><mo>ˆ</mo></mrow></mover><mo>=</mo><mfrac><mrow><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></mfrac><munderover><mrow><mo>∑</
   mo></mrow><mrow><mi>m</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi></
   mrow></munderover><mi>E</mi><mo>(</mo><msub><mrow><mi>z</mi></mrow><mro
   w><mi>m</mi><mi>i</mi></mrow></msub><mo>)</mo><mo>,</mo></mstyle>
   :MATH]
   [MATH: <mstyle
   displaystyle="true"><msub><mrow><mi>W</mi></mrow><mrow><mi>i</mi></mrow
   ></msub><mo>[</mo><mover
   accent="true"><mrow><mi>u</mi></mrow><mrow><mo>ˆ</mo></mrow></mover><mo
   >,</mo><mi>v</mi><mo>]</mo><mo>=</mo><mfrac><mrow><munderover><mrow><mo
   >∑</mo></mrow><mrow><mi>m</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</m
   i></mrow></munderover><msub><mrow><mi>W</mi></mrow><mrow><mi>i</mi></mr
   ow></msub><mo>[</mo><mi>u</mi><mo>=</mo><msub><mrow><mi>X</mi></mrow><m
   row><mi>m</mi><mi>v</mi></mrow></msub><mo>,</mo><mi>v</mi><mo>]</mo><mi
   >E</mi><mo>(</mo><msub><mrow><mi>z</mi></mrow><mrow><mi>m</mi><mi>i</mi
   ></mrow></msub><mo>)</mo></mrow><mrow><munder><mrow><mo>∑</mo></mrow><m
   row><mi>b</mi><mo>∈</mo><mo>{</mo><mi>A</mi><mo>,</mo><mi>U</mi><mo>,</
   mo><mi>G</mi><mo>,</mo><mi>C</mi><mo>}</mo></mrow></munder><munderover>
   <mrow><mo>∑</mo></mrow><mrow><mi>m</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>N</mi></mrow></munderover><msub><mrow><mi>W</mi></mrow><mrow><mi>i
   </mi></mrow></msub><mo>[</mo><mi>u</mi><mo>=</mo><mi>b</mi><mo>,</mo><m
   i>v</mi><mo>]</mo><mi>E</mi><mo>(</mo><msub><mrow><mi>z</mi></mrow><mro
   w><mi>m</mi><mi>i</mi></mrow></msub><mo>)</mo></mrow></mfrac><mo>.</mo>
   </mstyle> :MATH]

   Here,
   [MATH: <mover
   accent="true"><mrow><mo>⋅</mo></mrow><mrow><mo>ˆ</mo></mrow></mover>
   :MATH]
   is the estimate of the parameters, and N is the sample size.

   The algorithm is then run iteratively until the marginal likelihood is
   maximized. When the learning is finished, a miRNA sequence X[m] is
   assigned to the i-th group among the k groups, according to the hidden
   variable z[mi] that has the maximum value.

Score functions for validation

   To interpret and validate our grouping results, we used two approaches
   to score the miRNA sequence sets in each cluster. The first scoring
   function, the match score, was originally developed to search
   transcription factor binding sites in DNA sequences ([59]Kel et al.,
   2003). The function Score[M] is calculated as follows:
   [MATH: <mstyle displaystyle="true"><msub><mrow><mi
   mathvariant="italic">Score</mi></mrow><mrow><mi>M</mi></mrow></msub><mo
   >(</mo><msub><mrow><mi>X</mi></mrow><mrow><mi>M</mi></mrow></msub><mo>)
   </mo><mo>=</mo><mfrac><mrow><munderover><mrow><mo>∑</mo></mrow><mrow><m
   i>v</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>L</mi></mrow></munderover>
   <mi>I</mi><mo>(</mo><mi>v</mi><mo>)</mo><msub><mrow><mi>f</mi></mrow><m
   row><mi>v</mi><mo>,</mo><mi>b</mi></mrow></msub><mo>−</mo><munderover><
   mrow><mo>∑</mo></mrow><mrow><mi>v</mi><mo>=</mo><mn>1</mn></mrow><mrow>
   <mi>L</mi></mrow></munderover><mi>I</mi><mo>(</mo><mi>v</mi><mo>)</mo><
   munderover><mrow><mi>f</mi></mrow><mrow><mi>v</mi></mrow><mrow><mi
   mathvariant="normal">min</mi></mrow></munderover></mrow><mrow><munderov
   er><mrow><mo>∑</mo></mrow><mrow><mi>v</mi><mo>=</mo><mn>1</mn></mrow><m
   row><mi>L</mi></mrow></munderover><mi>I</mi><mo>(</mo><mi>v</mi><mo>)</
   mo><munderover><mrow><mi>f</mi></mrow><mrow><mi>v</mi></mrow><mrow><mi
   mathvariant="normal">max</mi></mrow></munderover><mo>−</mo><munderover>
   <mrow><mo>∑</mo></mrow><mrow><mi>v</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>L</mi></mrow></munderover><mi>I</mi><mo>(</mo><mi>v</mi><mo>)</mo>
   <munderover><mrow><mi>f</mi></mrow><mrow><mi>v</mi></mrow><mrow><mi
   mathvariant="normal">min</mi></mrow></munderover></mrow></mfrac><mo>,</
   mo></mstyle> :MATH]

   where f[v,b] is the frequency of nucleotide b (b∈{A, T, G, C}) at
   position v, and
   [MATH: <msubsup><mrow><mi>f</mi></mrow><mrow><mi>v</mi></mrow><mrow><mi
   mathvariant="normal">min</mi></mrow></msubsup> :MATH]
   is the frequency of the nucleotide that is the lowest frequency at
   position v, and
   [MATH: <msubsup><mrow><mi>f</mi></mrow><mrow><mi>v</mi></mrow><mrow><mi
   mathvariant="normal">max</mi></mrow></msubsup> :MATH]
   is the highest at the position. I(v) represents the conservation of
   position v in a matrix, defined as:
   [MATH: <mstyle
   displaystyle="true"><mi>I</mi><mo>(</mo><mi>v</mi><mo>)</mo><mo>=</mo><
   munder><mrow><mo>∑</mo></mrow><mrow><mi>b</mi><mo>∈</mo><mo>{</mo><mi>A
   </mi><mo>,</mo><mi>U</mi><mo>,</mo><mi>G</mi><mo>,</mo><mi>C</mi><mo>}<
   /mo></mrow></munder><msub><mrow><mi>f</mi></mrow><mrow><mi>v</mi><mo>,<
   /mo><mi>b</mi></mrow></msub><mo>log</mo><mo>(</mo><mn>4</mn><msub><mrow
   ><mi>f</mi></mrow><mrow><mi>v</mi><mo>,</mo><mi>b</mi></mrow></msub><mo
   >)</mo><mo>.</mo></mstyle> :MATH]

   The information value I(⋅) makes that mismatches in highly conserved
   regions are suppressed, whereas mismatches in less conserved regions
   are relatively acceptable.

   The other score function is the silhouette measure, first described by
   [60]Rousseeuw (1987). The score, Score[S], can be used to determine how
   tightly grouped all the datasets in the cluster are. The score function
   is defined as:
   [MATH: <mstyle displaystyle="true"><msub><mrow><mi
   mathvariant="italic">Score</mi></mrow><mrow><mi>S</mi></mrow></msub><mo
   >(</mo><msub><mrow><mi>X</mi></mrow><mrow><mi>m</mi></mrow></msub><mo>)
   </mo><mo>=</mo><mfrac><mrow><mi>β</mi><mo>(</mo><msub><mrow><mi>X</mi><
   /mrow><mrow><mi>m</mi></mrow></msub><mo>)</mo><mo>−</mo><mi>α</mi><mo>(
   </mo><msub><mrow><mi>X</mi></mrow><mrow><mi>m</mi></mrow></msub><mo>)</
   mo></mrow><mrow><mo>max</mo><mo>{</mo><mi>α</mi><mo>(</mo><munder><mrow
   ><mi>X</mi></mrow><mrow><mi>m</mi></mrow></munder><mo>)</mo><mo>,</mo><
   mi>β</mi><mo>(</mo><msub><mrow><mi>X</mi></mrow><mrow><mi>m</mi></mrow>
   </msub><mo>)</mo><mo>}</mo></mrow></mfrac><mo>,</mo></mstyle> :MATH]

   where α(X[m]) is the average distance of X[m] to all the other
   sequences in the same cluster, and β(X[m]) is the minimum value for the
   average distance of X[m] to every other single cluster. The score value
   varies between −1 and 1. If Score[S](X[m]) is close to 1, it means that
   the sequence X[m] is well-matched to its own cluster, and dissimilar to
   the other neighboring clusters. The dissimilarity between two sequences
   is calculated as the Hamming distance.

Analysis of the functional relationships in each group

   To evaluate the biological meaning of our results, it is necessary to
   analyze the functional relationships among the miRNAs in the same
   group. We predicted the target genes of the miRNAs within each group
   using microCosm Targets release version 5. We chose for further
   analyses the target genes that were bound by more than half the miRNAs
   within each group.

   We investigated the biological process and molecular function
   categories of Gene Ontology (GO), and the entries for the KEGG pathways
   enriched with the target genes in each group, to verify the functional
   relationships among the human miRNAs assigned to the same group. These
   analyses were conducted using the DAVID Bioinformatics Resources
   ([61]Huang, Sherman & Lempicki, 2009).

   Next, we looked for the shared functions of the conserved miRNAs and
   their targets across species. We extracted information for homologues
   between human and mouse genes from the Ensembl database ([62]Flicek et
   al., 2013), and then determined whether the miRNA members in the same
   group shared the conserved targets and their biological roles.

Results

miRNA functional family construction

   We ran our algorithm independently multiple times with random starts on
   the same datasets, and with a varying number of clusters (5–100). For
   each number of clusters, we repeated the algorithm 10 times, because of
   the possible existence of many local maxima for the mixture model, and
   we selected the one that maximized the likelihood value. With the
   highest likelihood value, the chosen number of clusters was 81. All the
   grouping results are shown in [63]Table S1.

   We computed two score measures, Score[S] (silhouette score) and
   Score[M](match score), to validate our grouping results (see Methods
   section). We calculated the average values for each group, and then
   compared them with the random and hierarchical clustering results. To
   confirm that our approach identified relevant miRNA families, we also
   compared the results with the family information in miRBase
   ([64]Griffiths-Jones et al., 2008; [65]Kozomara & Griffiths-Jones,
   2011). In miRBase, the miRNA families have already been defined, and
   several of them are known to be sequentially and functionally
   conserved. We compared our family construction results with the miRBase
   family information, random groupings and hierarchical clustering
   results using the average Score[S] and Score[M] values. [66]Table 1
   shows that the Score[S] and Score[M] values for our results are high.
   The average Score[M] for our results is similar to that for the miRBase
   families, and the Score[S] is much better in our results. The match
   score, Score[M] simply evaluates how well the sequences are represented
   by the PWM in their own group, and does not explain how well the
   similar data are collected together by dividing the incompatible
   instances into several groups, because the score does not measure the
   differences with samples in other groups. Therefore, the scores for our
   results indicate that our approach can be used as a way to construct a
   sequence-based family, by assigning similar miRNA sequences to an
   identical group well, while assigning dissimilar sequences to different
   sets.

Table 1. Comparison of the average scores for all miRNA sequences.

                           Score[S] Score[M]
   Random                  −0.065   0.592
   miRBase                 0.092    0.919
   Hierarchical clustering 0.126    0.892
   Our approach            0.144    0.892
   [67]Open in a new tab

   [68]Table 2 shows that the well-known miRNA family members defined in
   miRBase are grouped together in our experiment. For example, all the
   let-7 miRNA sequences, members of one of the most well-known families,
   are collected into the same group, cluster 10. The let-7 miRNA family
   members are known to be highly conserved across species in both their
   sequences and functions, and the members play roles in tumor
   suppression, and cell differentiation, proliferation and development
   ([69]Roush & Slack, 2008). The mir-181 family is grouped into cluster
   44 in our experiment. Its members are considered to be oncogenic miRNAs
   that down-regulate the Tcl1 overexpressed in mature B-cell lymphomas,
   and Hox protein, a repressor of differentiation processes in mammals
   ([70]Pekarsky et al., 2006; [71]Naguibneva et al., 2006). The members
   of the miRNA families that show corresponding results, including let-7,
   mir-15, mir-181, mir-196, and so on, share similar sequences, and the
   sequence conservation in our groupings can also be detected with
   WebLogo ([72]Fig. S1) ([73]Crooks et al., 2004). Our results confirm
   that our approach effectively groups previously known miRNA family
   sequences together.

Table 2. Examples of the enrichments of miRBase family members in our groups.

   AC number   miRNA family ID Group ID Enrichments
   MIPF0000002 let-7           c10      86/86
   MIPF0000006 mir-15          c63      68/68
   MIPF0000007 mir-181         c44      52/52
   MIPF0000011 mir-19          c13      37/37
   MIPF0000013 mir-25          c62      42/42
   MIPF0000014 mir-9           c27      16/16
   MIPF0000022 mir-7           c35      19/19
   MIPF0000024 mir-103         c65      34/35
   MIPF0000025 mir-99          c12      38/38
   MIPF0000026 mir-218         c48      18/18
   MIPF0000027 mir-23          c67      30/31
   MIPF0000028 mir-135         c81      25/27
   MIPF0000031 mir-196         c36      27/27
   MIPF0000034 mir-130         c80      41/44
   MIPF0000042 mir-204         c25      27/27
   MIPF0000042 mir-26          c20      23/23
   MIPF0000046 mir-101         c74      20/20
   MIPF0000048 mir-128         c27      16/16
   MIPF0000050 mir-153         c14      14/14
   MIPF0000051 mir-221         c45      25/25
   MIPF0000054 mir-216         c17      22/22
   MIPF0000055 mir-194         c39      13/13
   MIPF0000058 mir-205         c33      16/16
   MIPF0000059 mir-184         c43      13/13
   MIPF0000062 mir-214         c51      17/17
   MIPF0000063 mir-192         c15      23/23
   MIPF0000066 mir-183         c66      15/15
   MIPF0000074 mir-105         c28      17/20
   [74]Open in a new tab

   Although many of our groupings are similar to previously constructed
   miRNA families as shown in [75]Table 2, several results differ. For
   example, we cannot find sufficiently similar sequence patterns in the
   sequences of miRBase MIPF0000018 family (mir-154) ([76]Fig. 1A,
   [77]Table 3). In our experiment, the miRNAs in MIPF0000018 were
   allocated to several different groups, including clusters 3, 37, and
   52, and the similarities between the sequences in our groups were shown
   more clearly. For further confirmation, we calculated the silhouette
   measure (Score[S]) and match score (Score[M]) for MIPF0000018
   ([78]Table 3). The Score[S] and Score[M] for the miRBase family were
   markedly lower than the results for our groups.

Figure 1. Sequence similarity of groups involving mir-381 in miRBase family
and in our results using WebLogo.

   [79]Figure 1
   [80]Open in a new tab

   (A) MIPF0000018 in miRBase, (B) group 52 in our results. The x-axis
   shows the position numbers of the miRNA sequences.

Table 3. Allocation of members within the miRBase family MIPF0000018 in our
results.

   Group ID    # miRNAs       Score[S] Score[M]
   MIPF0000018 142            −0.669   0.707
   c3          82 (42[81]^*)  −0.087   0.886
   c27         432 (12[82]^*) −0.366   0.773
   c37         58 (11[83]^*)  −0.146   0.835
   c52         19 (10[84]^*)  0.767    0.990
   c71         72 (12[85]^*)  0.069    0.894
   c75         79 (11[86]^*)  −0.215   0.876
   [87]Open in a new tab

   Notes.
   ^*

   is the number of miRNAs in miRBase family MIPF0000018.

   In more detail, the miR-381 sequences are included in the MIPF0000018
   family with the mir-154 miRNAs. However, in our analysis, the mir-381
   sequences are grouped with the mir-466-3p sequences in cluster 52. In
   this group, the average Score[S] is 0.767 and the average Score[M] is
   0.990, which means that the sequences in the group are highly
   conserved. These scores show that the mir-381 sequences are
   sufficiently similar to those mir-466-3p, as expected. The sequence
   similarities are clearly shown using WebLogo in [88]Fig. 1B. The first
   five bases of the seed sequences are identical, AUACA in the sequences.
   From this analysis, it might be supposed that mir-381 was inherited
   from the same ancestral miRNA gene as mir-466-3p, but not the same as
   that of mir-154. The inference is reasonable because the mature form of
   the mir-381 miRNAs is also a 3′-donor sequence in their secondary
   structure, like mir-466-3p. The property that the miRNA members within
   one miRBase family are split into several groups in our results is also
   found in other miRNAs including mir-188 and mir-506 ([89]Table S1).
   Conversely, MIPF0000013 (mir-25) and MIPF0000069 (mir-32) in miRBase
   are merged into one group, cluster 62, in our results. Actually, the
   seed sequences of mir-32 are completely identical to those of mir-25,
   AUUGCA. Our grouping of cluster 62 is strongly supported by the
   observation that mir-32 and mir-25 have similar roles. For instance,
   these miRNAs can lead to cancer by inhibiting apoptosis, because mir-32
   suppresses the expression of Bim protein, a pro-apoptotic factor, as
   mir-25 does in cancer cells ([90]Petrocca et al., 2008; [91]Ambs et
   al., 2008). Therefore, we established that our method might identify
   functionally related families.

Functional analysis for the constructed family

   To verify the biological functions of our miRNA groups more
   comprehensively, we conducted the Gene Ontology (GO) and KEGG pathway
   enrichment analysis of the human target genes of the miRNAs within each
   group. The functions of miRNAs are strongly related to the biological
   roles of their target genes, because the miRNAs recognize the target
   mRNAs and inhibit their expression. We used the target information
   produced in microCosm Target version 5. The enrichment results for the
   target genes in each group are assessed in [92]Tables S2 and [93]S3. We
   show the most significant part of the results of the GO enrichment
   analysis using biological process terms in [94]Table 4, and for the
   KEGG pathways in [95]Table 5. In most of the groups, the target genes
   are significantly involved in several GO categories. Similarly, we
   checked the biological relationships among the miRNAs in our groups by
   the KEGG pathway enrichment analysis, and found frequently and
   statistically enriched pathways based on the target genes of the
   members in each group. These results for the GO annotation and KEGG
   pathway enrichment analyses are consistent with the biological
   functions of the miRNAs previously reported in the literature. For
   example, the target genes of cluster 63 were related to the cell cycle
   in our enrichment analyses. In fact, it is already known that the
   mir-16 gene family in cluster 63 regulates cell-cycle progression
   ([96]Linsley et al., 2007; [97]Xia et al., 2009).

Table 4. GO Biological Process enrichment of the target genes of the miRNAs
in each group.

   Group ID GO accession GO term p-value
   c64 GO:0009987 Cellular process 4.75E−06
   c52 GO:0016070 RNA metabolic process 5.84E−05
   c64 GO:0006139 Nucleobase, nucleoside, nucleotide and
   nucleic acid metabolic process 6.09E−05
   c52 GO:0016071 mRNA metabolic process 8.88E−05
   c55 GO:0046489 Phosphoinositide biosynthetic process 9.26E−05
   c56 GO:0006281 DNA repair 9.29E−05
   c52 GO:0006396 RNA processing 9.86E−05
   c52 GO:0006397 mRNA processing 1.11E−04
   c64 GO:0034641 Cellular nitrogen compound metabolic process 1.36E−04
   c52 GO:0000087 M phase of mitotic cell cycle 1.56E−04
   [98]Open in a new tab

Table 5. KEGG pathway enrichment of the target genes of the miRNAs in each
group.

Group ID Entry    Pathway name                                           p-value
c52      hsa04630 Jak-STAT signaling pathway                             0.0022
c64      hsa00230 Purine metabolism                                      0.0026
c52      hsa04060 Cytokine-cytokine receptor interaction                 0.0029
c55      hsa00563 Glycosylphosphatidylinositol (GPI)-anchor biosynthesis
   0.0034
c6       hsa00340 Histidine metabolism                                   0.0051
   [99]Open in a new tab

   Finally, we checked the conserved target information for the human and
   mouse miRNAs categorized into the same group. [100]Figure 2A shows an
   example of the conserved target analyses in the two different species.
   The human and mouse miRNAs within cluster 63 share orthologous target
   genes. These analyses help to clarify the functions of the miRNAs in
   each group because the target relationships are conserved across most
   species. Using our miRNA grouping results, we can speculate about their
   functions based on our knowledge of other miRNAs, with already-known
   functions, in the same group. As an example, it has previously been
   shown that the expression of Bcl2 protein is negatively regulated by
   miR-15 and miR-16, which are assigned to the same group, cluster 63
   ([101]Cimmino et al., 2005). Because the Bcl2 protein inhibits cell
   apoptosis, its overexpression leads to leukemias or other cancers
   ([102]Cory & Adams, 2002). In our experiment, there were many other
   miRNAs within the cluster 63, such as mir-322, mir-424, mir-497, and so
   on, grouped with the miR-15 family members, whose functions are not yet
   clearly known. From our results, we can infer that these miRNAs may
   induce cancers by a similar mechanism to that of miR-15 and miR-16
   ([103]Fig. 2B). In practice, microCosm predicts that mir-503 binds to
   the Bcl2 gene transcript. Therefore, our grouping analyses, together
   with the conservation information, provide clues to the biological
   effects of functionally unknown miRNAs. Furthermore, the example
   demonstrates that, as well as identifying the miRNA functions, our
   approach can help to discover miRNA-mRNA modules in the complex gene
   regulatory networks and to understand the combinatorial effects of
   miRNAs in cellular processes.

Figure 2. Conserved targets and functional relationships of miRNAs in cluster
63.

   [104]Figure 2
   [105]Open in a new tab

   (A) The miRNA family members in cluster 63 target orthologous genes
   across species, and this observation implies that functions of the
   family members are similar. (B) Highly expressed mir-15 and mir-16
   induce cell death by targeting the Bcl2 gene, a repressor of apoptosis,
   but the misregulation of these miRNAs causes cancer. The other miRNAs
   in cluster 63 might be supposed to have similar regulatory functions.

Discussion

   Understanding gene regulation is still challenging, and action of
   miRNAs, in particular, may cause the processes to be even more complex
   and harder to interpret. Much research had been directed towards
   understanding the gene regulation by miRNAs and identifying their
   functions, but there have been not many studies that comprehensively
   and systematically examine conserved information across various species
   although it is known that miRNA genes are evolutionarily conserved
   ([106]Shi, Gao & Wang, 2012; [107]Berezikov, 2011; [108]Borenstein &
   Ruppin, 2006; [109]Li et al., 2010). Furthermore, computational target
   prediction methods have mainly focused on one-to-one interactions, and
   the experimental identification and validation of miRNA functions
   remain time consuming and technically limited. In this study, we have
   undertaken a fundamental task of the miRNA research in identifying
   evolutionarily conserved miRNAs, to extend the functional studies of
   miRNAs.

   The previous classification of miRNA family in miRBase was based on the
   Rfam database ([110]Griffiths-Jones et al., 2008; [111]Kozomara &
   Griffiths-Jones, 2011; [112]Gardner et al., 2009; [113]Burge et al.,
   2013). However, it might need to use another definition for functional
   analysis or target prediction. For example, [114]Friedman et al. (2009)
   defined 87 miRNA families and the definition was used for target
   prediction. Also, there has been lots of other research to precisely
   detect miRNA families using evolutionary information
   ([115]Guerra-Assunção & Enright, 2012; [116]Gerlach et al., 2009). We
   constructed miRNA functional families from miRNA seed sequences based
   on a mixture model of position weight matrices. Our method starts with
   random PWMs extracted from human genome sequences. Through iterative
   learning to maximize the likelihood value using the EM algorithm, this
   method assigns the miRNA sequences to each group and builds PWMs that
   represent the corresponding conserved sequences in each group, by
   adjusting the parameters in the mixture model. We have presented
   results for all the mammalian miRNAs, demonstrating that our approach
   constructs biologically relevant miRNA families, using score measures
   and functional analyses of their target genes. We have also shown that
   these results can facilitate the identification of conserved and
   biologically related subsets or modules of miRNAs and mRNAs by
   analyzing the conserved target information in our groups.

   Usually, a model selection criterion such as Bayesian information
   criterion (BIC), is adapted to reduce model complexity when using the
   maximum likelihood estimation ([117]Schwarz, 1978; [118]Jones, 2011).
   However, the scheme could not be applied in our experiment. BIC always
   penalizes multiple clusters since our approach has a relatively huge
   number of parameters unlike other general models. Moreover, it is
   difficult to interpret biologically.

   Previous works have assumed that each miRNA sequence is contained in
   only one miRNA family. Unfortunately, it is not possible to clearly
   know what the true ancestor of each miRNA is and how the sequence has
   evolved. The imprecise assumptions may limit the study of branched
   miRNAs. However, our approach has the potential to overcome these
   restrictions because it assigns the miRNAs to families with a
   probabilistic value. Although we selected a group of miRNAs by choosing
   the one with the maximum probabilistic value among hidden variables in
   our experiment, it is feasible to accommodate overlapping occurrences
   of miRNAs by modifying the group assignment scheme and to flexibly
   assign a sequence to various number of clusters. Moreover, by
   diversifying the number of clusters, it may be likely to find more
   broadly conserved miRNA groups or more specialized families.

   Many miRNAs remain to be identified, and it is not easy to identify the
   conserved sequence patterns of several miRNAs when the number of
   corresponding family members is limited. However, with the increasing
   availability of miRNA sequences, our method can substantially improve
   the grouping properties with greater precision. Furthermore, the PWMs
   generated by our method might also be used to search for novel miRNAs
   in the genome, as in the identification of transcription factor binding
   sites.

   In conclusion, because a large fraction of protein-coding genes is
   regulated by miRNAs, the systematic and comprehensive search for
   conserved miRNAs may be a useful way to understand gene regulatory
   processes and to elucidate the biological functions of miRNAs. Our
   method should provide a basis for the functional annotation of miRNAs
   and fundamental insight into the widespread impact of miRNAs.

Supplemental Information

   Figure S1. Sequence similarities represented by Weblogo in (A) c10, (B)
   c63, (C) c44 and (D) c36.
   [119]Click here for additional data file.^ (72.3KB, png)
   DOI: 10.7717/peerj.199/supp-1
   Table S1. microRNA lists in each group.
   [120]Click here for additional data file.^ (166.7KB, pdf)
   DOI: 10.7717/peerj.199/supp-2
   Table S2. GO (Molecular Function) enrichment analyses for target genes
   in each group.
   [121]Click here for additional data file.^ (226KB, pdf)
   DOI: 10.7717/peerj.199/supp-3
   Table S3. GO (Biological Process) enrichment analyses for target genes
   in each group.
   [122]Click here for additional data file.^ (593.7KB, pdf)
   DOI: 10.7717/peerj.199/supp-4

Funding Statement

   This work was supported by the National Research Foundation (NRF) grant
   funded in part by the Korean government (MISP) (2010-0017734,
   2013M3B5A2035921, 2012M3A9D1054622). S-YS was supported by the Basic
   Science Research Program (2012R1A1A2002804) through NRF funded by the
   MISP. The funders had no role in study design, data collection and
   analysis, decision to publish, or preparation of the manuscript.

Additional Information and Declarations

Competing Interests

   The authors declare they have no competing interests.

Author Contributions

   Je-Keun Rhee conceived and designed the experiments, performed the
   experiments, analyzed the data, contributed reagents/materials/analysis
   tools, wrote the paper.

   Soo-Yong Shin analyzed the data, wrote the paper.

   Byoung-Tak Zhang conceived and designed the experiments, wrote the
   paper.

References