Abstract

Background

   MicroRNA regulation is fundamentally responsible for fine-tuning the
   whole gene network in human and has been implicated in most
   physiological and pathological conditions. Studying regulatory impact
   of microRNA on various cellular and disease processes has resulted in
   numerous computational tools that investigate microRNA-mRNA
   interactions through the prediction of static binding site highly
   dependent on sequence pairing. However, what hindered the practical use
   of such target prediction is the interplay between competing and
   cooperative microRNA binding that complicates the whole regulatory
   process exceptionally.

Results

   We developed a new method for improved microRNA target prediction based
   on Dirichlet Process Gaussian Mixture Model (DPGMM) using a large
   collection of molecular features associated with microRNA, mRNA, and
   the interaction sites. Multiple validations based on microRNA-mRNA
   interactions reported in recent large-scale sequencing analyses and a
   screening test on the entire human transcriptome show that our model
   outperformed several state-of-the-art tools in terms of promising
   predictive power on binding sites specific to transcript isoforms with
   reduced false positive prediction. Last, we illustrated the use of
   predicted targets in constructing conditional microRNA-mediated gene
   regulation networks in human cancer.

Conclusion

   The probability-based binding site prediction provides not only a
   useful tool for differentiating microRNA targets according to the
   estimated binding potential but also a capability highly important for
   exploring dynamic regulation where binding competition is involved.

Electronic supplementary material

   The online version of this article (10.1186/s12864-018-5029-7) contains
   supplementary material, which is available to authorized users.

   Keywords: MicroRNA, MicroRNA target prediction, Dirichlet process
   Gaussian mixture, Machine learning, Bayesian inference, Dynamic
   microRNA regulation

Background

   MicroRNAs (miRNAs) are important post-transcriptional gene regulators
   that silence messenger RNA (mRNA) targets via mRNA degradation or
   translational repression [[33]1, [34]2]. They hybridize with
   complementary sequences in the 3′-untranslated regions of mRNA,
   particularly in the “seed region” (2nd-8th bases on the 5′ end), for
   their binding [[35]3]. In RNA-induced silencing complex, both miRNA and
   mRNA are degraded if the miRNA nucleotide sequence has a high degree of
   complementarity to the sequence in the mRNA target [[36]4, [37]5];
   otherwise, the binding of miRNA to mRNA will halt mRNA translation
   without causing degradation [[38]5, [39]6]. The large-scale miRNA-mRNA
   interactions detected by sequencing analyses has shown various
   interaction patterns, e.g., many interactions happen via complementary
   sequences in discontinuous regions other than seed region [[40]7],
   indicating different regulatory mechanisms. In addition, compelling
   evidence reveals the dynamic nature of miRNA-mRNA interaction that
   multiple miRNAs can bind to the same mRNA sequence or different copies
   of the same transcript -- cooperative interactions [[41]8], while
   multiple different mRNAs, possibly also with other types of RNA, e.g.,
   long non-coding RNAs and circular RNAs [[42]9], can compete for binding
   to the same miRNA -- competitive interactions [[43]10]. Furthermore,
   other factors, such as genetic mutations [[44]11–[45]14], competition
   with RNA binding proteins [[46]15, [47]16], and conditional expression
   of miRNA and mRNA can also affect the status of miRNA-mRNA
   interactions. All listed above indicate the complexity underlying
   miRNA-mediated gene regulation.

   The molecular mechanisms have been partially clarified by extensive
   studies focusing on miRNA biogenesis and function [[48]17, [49]18],
   which also show the participation of miRNAs in virtually every aspect
   of cellular activities starting with differentiation and development of
   cells. MiRNAs affect normal functioning of the cell including
   metabolism, proliferation and apoptotic cell death as well as
   malfunctions such as viral infection, and tumorigenesis
   [[50]19–[51]24]. In humans, it is estimated that 2500+ miRNAs can
   regulate over 60% of human genes [[52]25]. Research interest in miRNA
   regulation has been dramatically increasing, resulting in numerous
   computational tools such as TargetScan [[53]26], miRDB [[54]27],
   miRanda [[55]28], and mirSVR [[56]29]. The miRNA targets predicted by
   these tools can be used to indirectly infer miRNA function, e.g.,
   through pathway enrichment analysis [[57]23]. However, the complexity
   of miRNA-mRNA binding, especially the cooperative and competitive
   binding modes observed with miRNAs complicates the target prediction
   task. Most methods that focus primarily on finding complementary
   sequences in the seed region fail to address this complexity.

   Despite the challenges faced in computational prediction, novel
   sequencing techniques has facilitated experimental discovery of a large
   number of miRNA-mRNA interactions. For example, the crosslinking,
   ligation, and sequencing of hybrids (CLASH) analysis has identified
   18,514 miRNA-mRNA interactions where 60% of the interactions were
   associated with seed region [[58]7]. Further, the coding regions of
   mRNAs were shown to house ~ 60% of the binding sites. The existing
   algorithms are designed on the assumption of 3’-UTR centric binding,
   new algorithms will need to revise this assumption. Another study using
   covalent ligation of endogenous Argonaute–bound RNAs (CLEAR)-CLIP in
   human hepatoma cells corroborated the above results: ~ 26% of the
   interactions were associated with seed region and ~ 57% are non-3’UTR
   interactions [[59]30].

   In this study, we designed a new computational method for miRNA target
   prediction using Dirichlet Process Gaussian Mixture Model (DPGMM)
   [[60]31], with integration of the large-scale sequencing-detected
   interactions. The main aim is to infer interactions along with
   indicated confidence. Given the large number of interaction patterns
   miRNAs and mRNAs can have (to-be-discovered) and the uncertainty about
   source of similarity, clustering is the tool of choice to group similar
   interactions with respect to either the miRNA or mRNA involved. In
   clustering tasks where the number of clusters are not known ahead of
   observing the data, the non-parametric Bayesian method DPGMM is
   commonly used [[61]32–[62]36]. DPGMM also has advantages in
   accommodating clusters with various sizes and structures, free
   specification of the number of clusters, easy computation, and
   interpretability [[63]20], compared to other multi-class learning
   systems, such as SVM, K-means, and GMM clustering. To accomplish, we
   first considered a large number of molecular features related to
   miRNA-mRNA binding sites including sequence pairing [[64]26],
   evolutionary conservation [[65]37], free energy of the miRNA-mRNA
   heteroduplex [[66]38], target site accessibility [[67]39], and the
   flanking sequence of the target site on mRNA. A few novel features
   possibly associated with binding efficacy were also considered, such as
   AU-rich nucleotide composition near the binding site, proximity to
   sites for co-expressed miRNAs (possibly associated with cooperative
   action), proximity to residues pairing at miRNA nucleotides 13–16,
   positioning away from the center of long UTRs [[68]2, [69]40]. In
   addition, several statistics related to binding site were also
   assessed, e.g., the number of complementary pairs within seed region
   and/or within the whole biding site. Based on these heterogeneous
   features, a feature vector will be constructed for each given
   miRNA-mRNA interaction, as input to the model.

   For each candidate interaction, the new system can output an assignment
   score as posterior probability for each of the clusters. By assessing
   all interactions from the same cluster, one can explore new insights in
   interactive patterns reflected by each identified cluster. In addition,
   based on the sequence information of experimentally-detected
   interactions and aforementioned distinguishing features, this system
   will allow one to assess if one miRNA can bind to a specific splicing
   transcript, a very unique feature highly useful in practice when
   compared to gene-level prediction offered by existing tools. At last,
   we demonstrated in a breast cancer case study the application of
   predicted target information to infer conditional miRNA-gene
   interaction through modeling dynamic gene regulation while considering
   multiple other gene expression regulation mechanisms such as
   transcription factor and copy number variation (CNV).

Methods

Data preparation and feature generation

   Table [70]1 summarizes the miRNA-mRNA interactions used in this study.
   Experimentally identified interactions collected from public databases
   and interactions reported in sequencing analyses constitute the
   training and validation data [[71]41]. For each interaction, an initial
   set of 2059 features were generated. Besides general structure and
   sequence features reported in the literature [[72]2, [73]40] [[74]3,
   [75]26], we explored new features such as the length of an interaction
   and the flanking sequence of the binding site on mRNA. We included the
   frequencies of all possible k-nucleotide combinations (k = 1,…,4) on
   both miRNA and mRNA sequence involved in an interaction. RNAFold was
   used to calculate the minimum free energy, secondary structure, and
   open degree of each binding site [[76]3]. A summary of all features
   compiled in this study is given in Fig. [77]1A, with details provided
   in Additional file [78]1.

Table 1.

   Datasets applied in this study
   Datasets Content
   CLASH data [[79]7] 17,436 interactions on Human kidney cell (HEK293),
   associated with Ago1
   iPAR-CLIP data [[80]8] 10,566 interactions on HEK293, human embryonic
   stem cell, EBV-infected lymphoblastoid cell lines, and primary effusion
   lymphoma cell line, associated with Ago1 and Ago2
   CLEAR-CLIP data [[81]30] 32,711 interactions on Human hepatoma cell,
   associated with Ago
   mirTarbase [[82]20] 11,002 interactions on Human genome, predicted by
   miRanda;
   483 validated interactions by non-sequencing analysis
   RefSeq [[83]42] 56,000 human transcripts
   [84]Open in a new tab

Fig. 1.

   [85]Fig. 1
   [86]Open in a new tab

   Interaction features used in the model. (a) Breakdown of features
   according to categories. (b) Recursive feature elimination process. Top
   panel shows the performance of the feature elimination process on the
   initial set of 2059 features. Bottom panel illustrates the
   distributions of four discriminative features in the positive and
   negative datasets

Classification using mixture model

   Positive training data for the classifier consisted of experimentally
   identified miRNA-mRNA interactions. Negative training data was
   generated to represent miRNA-mRNA pairs that don’t interact. We
   summarize the whole procedures as follows.
    1. Negative Data Generation

   Negative sample generation was done by sliding a k-mer window (k = 22)
   across all known mRNA sequences. The commonly-used negative interaction
   set consisting of a small number of randomly generated interactions is
   often biased and not sufficient to represent the entire negative space.
   To address this problem, we generated a 4-tiered negative set with each
   tier corresponding to a different level of negative potential. Higher
   levels of confidence are captured in higher tiers (Table [87]2). For
   instance, level 1 data includes randomly-generated false binding sites
   among reported miRNA and mRNA pairs, whereas level 4 data represents
   interactions randomly synthesized from unreported miRNAs and mRNAs.
   There are infinitely many possible negative interactions between all
   human miRNAs and mRNAs; in order to keep the size of the negative set
   comparable with the positive set, we selected ~ 8000 interactions from
   each of the four categories randomly as representative interactions of
   that category.
     * 2)
       Building the Classifier

Table 2.

   The positive and negative datasets
   Dataset Statistics
   miRNA/mRNA interaction
   Pos-1: Interactions reported in CLASH data 399
   7000 17,436
   Pos-2: Interactions reported in iPAR-CLIP data 291
   4043 10,567
   Neg-1: Interactions generated on reported miRNA and mRNA pairs 755
   9179 8768
   Neg-2: Interactions generated on reported miRNAs and unreported mRNAs
   755
   20,516 8768
   Neg-3: Interactions generated on unreported miRNAs and reported mRNAs
   1833
   9179 7332
   Neg-4: Interactions generated on unreported miRNAs and unreported mRNAs
   1833
   20,516 7332
   [88]Open in a new tab

   In order to discover unknown binding patterns and properties, a
   classifier was first trained to differentiate positive interactions
   from negative. The rationale is as follows. After feature generation,
   for a given miRNA r and a set of binding sites I related to r, the ith
   binding site is represented with an m-dimensional feature vector x[i]
   based on a set of m features (m = 2059 in the initial analysis).
   Feature vectors corresponding to all the n sites in I are represented
   by {x[1], …, x[i], …, x[n]}. Let n[c] be the number of clusters
   observed in miRNA-mRNA interactions and z[i](i = 1, …, n[c]) represent
   the cluster membership of the interaction x[i]. We then apply Dirichlet
   Process Gaussian Mixture Model (DPGMM) to obtain interaction clusters.
   The Dirichlet Process (DP) is the prior distribution for the mixture
   model specifying a distribution of probability distributions. In this
   setting, these distributions specify the parameters of miRNA-mRNA
   interaction clusters. The parameters of DP are a base distribution
   G[0] and a positive concentration parameter α. Base distribution
   G[0] is the expected value of the process while α determines the
   dispersion of the distributions around G[0].  A small α results in
   distributions that are concentrated around G[0].  As α increases, the
   dispersion of distributions increase.

   In general, in a Gaussian mixture model with K components, the
   likelihood of data is:
   [MATH: <mi>p</mi><mfenced close=")" open="(" separators="|"><mi
   mathvariant="bold-italic">x</mi><mrow><msub><mi>θ</mi><mn>1</mn></msub>
   <mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>θ</mi><mi>K</mi></msub></mrow><
   /mfenced><mo>=</mo><munderover><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>
   1</mn></mrow><mi>K</mi></munderover><msub><mi>π</mi><mi>j</mi></msub><m
   i mathvariant="script">N</mi><mfenced close=")" open="("
   separators="|,"><mi
   mathvariant="bold-italic">x</mi><msub><mi>μ</mi><mi>j</mi></msub><msub>
   <mi>S</mi><mi>j</mi></msub></mfenced> :MATH]

   where π denotes the mixing proportions and θ[j] = {π[j], μ[j], S[j]} is
   the set of parameters; proportion, mean and precision, of a component
   in the mixture. (μ[j], S[j]) are drawn from a distribution G that is in
   turn drawn from a DP(α, G[0]). Fixed covariance and a conjugate
   [MATH: <mi mathvariant="script">N</mi><mfenced close=")" open="("
   separators=","><mn>0</mn><mn>1</mn></mfenced> :MATH]
   prior on the component means were used. The optimum value for α was
   experimentally determined to be 10 (results not shown).

   In the clustering setting, we let z[i] represent the cluster number for
   observation x[i], 0 ≤ z[i] ≤ n[c], the prior on cluster assignments is
   [MATH: <mi>p</mi><mfenced close=")"
   open="("><mrow><msub><mi>z</mi><mi>i</mi></msub><mo>=</mo><mi>j</mi></m
   row></mfenced><mo>=</mo><mfrac><msub><mi>n</mi><mi>j</mi></msub><mrow><
   mi mathvariant="normal">α</mi><mo>+</mo><mi
   mathvariant="normal">n</mi><mo>−</mo><mn>1</mn></mrow></mfrac> :MATH]

   for an existing cluster j and
   [MATH: <mi>p</mi><mfenced close=")"
   open="("><mrow><msub><mi>z</mi><mi>i</mi></msub><mo>=</mo><mi>K</mi><mo
   >+</mo><mn>1</mn></mrow></mfenced><mo>=</mo><mfrac><mi
   mathvariant="normal">α</mi><mrow><mi
   mathvariant="normal">α</mi><mo>+</mo><mi
   mathvariant="normal">n</mi><mo>−</mo><mn>1</mn></mrow></mfrac> :MATH]

   for a new cluster. Here, n is the total number of data points and n[j]
   is the number of data points in cluster j.

   Cluster assignments are done using the normalized log posteriors of
   clusters. The log posterior is:
   [MATH: <mo>log</mo><mfenced close=")"
   open="("><msub><mi>n</mi><mi>j</mi></msub></mfenced><mo>+</mo><mo>log</
   mo><mspace width="0.25em"></mspace><mfrac><mrow><mo>exp</mo><mfenced
   close=")"
   open="("><mrow><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mfen
   ced close=")" open="("><mrow><mi mathvariant="italic">x</mi><mo
   mathvariant="bold">−</mo><mi>μ</mi></mrow></mfenced><mi>T</mi></msup><m
   sup><mi
   mathvariant="normal">Σ</mi><mrow><mo>−</mo><mn>1</mn></mrow></msup><mfe
   nced close=")" open="("><mrow><mi
   mathvariant="italic">x</mi><mo>−</mo><mi>μ</mi></mrow></mfenced></mrow>
   </mfenced></mrow><msqrt><mfenced close="|"
   open="|"><mrow><mn>2</mn><mi>π</mi><mi
   mathvariant="normal">Σ</mi></mrow></mfenced></msqrt></mfrac><mspace
   width="0.25em"></mspace> :MATH]

   for existing clusters and
   [MATH: <mo>log</mo><mfenced close=")" open="("><mi
   mathvariant="normal">α</mi></mfenced><mo>+</mo><mo>log</mo><mfrac><mrow
   ><mo>exp</mo><mfenced close=")"
   open="("><mrow><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mfen
   ced close=")" open="("><mrow><mi
   mathvariant="italic">x</mi><mo>−</mo><mi>μ</mi></mrow></mfenced><mi>T</
   mi></msup><msup><mi
   mathvariant="normal">Σ</mi><mrow><mo>−</mo><mn>1</mn></mrow></msup><mfe
   nced close=")" open="("><mrow><mi
   mathvariant="italic">x</mi><mo>−</mo><mi>μ</mi></mrow></mfenced></mrow>
   </mfenced></mrow><msqrt><mfenced close="|"
   open="|"><mrow><mn>2</mn><mi>π</mi><mi
   mathvariant="normal">Σ</mi></mrow></mfenced></msqrt></mfrac> :MATH]

   for new clusters, where x is the feature vector, μ is the cluster mean
   vector and 힢 is the covariance matrix for the cluster. Normalization
   factor is the sum of the log posteriors. A new observation is assigned
   to the cluster with the highest normalized posterior probability. Ties
   are broken randomly during cluster assignment.

   We trained the DPGMM model on the training set consisting of positive
   and negative interactions until the system converged. 25% of the total
   dataset was kept out of training to be used as test data. Each
   interaction was assigned to the cluster which has the highest posterior
   probability of assignment. This model provides the flexibility of
   assessing the accuracy of the clustering at different levels: 1)
   whether a new interaction can be correctly assigned to a positive or
   negative cluster, or 2) whether a new interaction is assigned to a
   cluster that contains the participating miRNA. Several metrics have
   been used to evaluate the performance including the sensitivity,
   specificity, accuracy and Matthews correlation coefficient (MCC).
   Optimization of the model was done by a grid search of parameters over
   a large range. The main parameter of the DPGMM model is the α
   parameter, for which the values 10, 30, 60, 90 and 100 were considered.
   The model was trained for different number of iterations, namely 10,
   30, 60, 90 and 100. Bayesian Information Criteria (BIC) [[89]42] were
   obtained for each possible combination of the above values of the
   parameters. The combination that resulted in the lowest BIC were used
   in the final model.
     * 3)
       Feature Selection

   A feature elimination analysis was performed on the initial set of 2059
   features to remove unrelated and noisy features and search for the
   minimal set of relevant features that optimize classification
   performance. We performed the first filtering step based on t-test on
   each feature between positive and negative data sets where 44
   non-discriminative features with p-value > 0.05 were removed from our
   initial feature list. At this elimination step, multiple hypothesis
   correction methods were not applied. Next, an independent logistic
   regression classifier was built for each of the remaining features.
   Each feature was associated with the Area Under the Curve (AUC) value
   resulting from its corresponding logistic regression classifier and
   features were ranked according to their AUC values. Then we followed a
   recursive feature elimination (RFE) procedure to fruther remove
   features irrelevant or negligible to our classification goal. The
   procedure is as follows: a DPGMM classifier was built using the
   remaining features after the first elimination step and the 5-fold
   cross validation accuracy of the model was recorded. Next, we removed
   the feature with the lowest AUC value and performed DPGMM
   classification again. We recursively removing the least important
   feature and performing DPGMM classification until a minimal set of
   features, without losing the classification performance, is obtained.
   The remaining features were used for the final model.
     * 4)
       Cascade Model for MiRNA Specific Clusters

   Our model offers different levels of clustering results. Each
   successive level is more specialized than the previous level
   (Fig. [90]2). After the initial clustering, some clusters are expected
   to contain exclusively negative interactions, while others may contain
   both negative and positive interactions. Also, the clusters contain
   different types of mRNA or miRNAs. For those clusters that were
   heterogeneous and had more than 30 examples, we continued clustering
   the examples into sub-clusters with additional DPGMM rounds. Stopping
   criteria for clustering were: 1) homogeneity of miRNA type, 2)
   homogeneity of mRNA type and 3) size of clusters (Fig. [91]2). In case
   of termination due to criterion 3, the sub-clusters were excluded from
   the final model. We construceted a clustering tree using this recursive
   procedure. The root level clusters are the results of the initial
   clustering, and the leaf level clusters are the result of the last
   clustering. `represent the results of the last DPGMM clustering round,
   we expect to see specific clusters that are homogeneous with respect to
   a miRNA or mRNA.

Fig. 2.

   [92]Fig. 2
   [93]Open in a new tab

   Cascade DPGMM model. (a) Flowchart describing training of the model.
   (b) Illustrative example of cascade DPGMM. Rectangles represent mixed
   clusters (M), circles represent homogeneous clusters (H). Clustering
   goes on until homogeneous clusters are obtained

   The root level clusters can be used to evaluate if a new interaction
   object can be correctly predicted as a real interaction or not while
   the lower layers can help us to evaluate if the object can be assigned
   to its own cluster. A certain number of mixed clusters are expected to
   remain in the model corresponding to different miRNAs that follow
   similar binding patterns.

Model evaluation and genome-scale target screening

   In this study, CLASH and iPAR-CLIP data were used to train and test the
   model (75% of the interactions were used as training set and 25% as
   testing set). The final model was decided based on performance using
   the remaining features after feature elimination. For independent
   evaluation, duplicated interactions in CLEAR-CLIP data and mirTarBase
   data were removed and the remaining were used.

   Next, the RefSeq data [[94]43] were used to screen the whole human
   transcriptome for possible binding interactions with miRNAs. To
   determine if a position on mRNA is a candidate site for a given miRNA,
   we locally aligned the sequences of the candidate site and the miRNA.
   An extension of 30 nucleotides on both ends were considered while
   computing features for a given candidate binding site. A candidate
   binding site and a miRNA pair were tested for a possible interaction by
   assigning to a cluster within the clustering tree model. Leaf level
   clusters were considered as final cluster assignments.

   An interaction confidence metric (IC) was used to rank the predicted
   interactions according to how reliable they are. IC for a candidate
   interaction i and the dominant miRNA k in i’s assigned cluster is
   IC[ik] = z[i] * c[k], where z[i] indicates normalized assignment
   probability and c[k] is the proportion of miRNA k in the cluster.

   Last, to demonstrate the use of predicted interactions under the
   context of studying dynamic gene regulation, we have performed a case
   study of inferring conditional miRNA regulation associated with cancer
   progression. Based on a set of genomic data on breast cancer from The
   Cancer Genome Atlas (TCGA) [[95]44] including miRNA and mRNA
   expressions, CNVs, and DNA methylation profiles and a meta-Lasso
   regression model our group has been recently developed [[96]45], miRNA
   regulation associated with different cancer stages were detected.
   Previously, the regression model evaluates the likelihood of
   interactions between a pair of miRNA and mRNA based on a regulatory
   score (RS) which is calculated through aggregation of binding
   probability and binding affinity [[97]45]. In this study, we replace
   the RS score by the DPGMM-derived posterior probability, the highest
   assignment score. Along with other factors such as Transcription factor
   (TF)-gene regulatory potential, Lasso regression was utilized to
   identify the TFs and miRNAs that regulate a specific gene under a given
   condition.

Results

   An initial training of the model using the whole set of 2059 features
   and 75% of the CLASH and iPAR-CLIP data resulted in 34 positive and 38
   negative clusters. The model was tested on the test data (the remaining
   25% of CLASH and iPAR-CLIP) which resulted in 82.0% overall accuracy.
   Table [98]3 shows the promising validation results on several
   independent datasets.

Table 3.

   Prediction performance on the training, testing, and independent
   datasets
   Dataset Performance
   Sensitivity Specificity Accuracy MCC
   Training
   75% of CLASH, iPAR, and negative 0.78 0.86 0.82 0.64
   Testing
   25% of CLASH, iPAR, and negative 0.77 0.86 0.82 0.64
   Validation-1 CLEAR-CLIP 0.80 – 0.80 –
   Validation-2 mirTarbase (validated) 0.61 – 0.61 –
   Validation-3 mirTarbase (predicted) 0.62 – 0.62 –
   [99]Open in a new tab

   Our model performs decently on the majority of data sets where the
   interaction discovery was sequencing-based, with the exception of
   mirTarBase. The sensitivity for mirTarBase data is low, 61.0% for the
   validated interactions and 62.0% for the predicted ones from other
   tools, respectively. The discrepancy can be due to two possible
   explanations. First, CLASH and CLIP interactions were detected by
   genome-wise sequencing where each miRNA-mRNA pair often involves many
   different interactions at different sites while the mirTarBase only
   reports a single binding site on a gene target. Second, predicted
   targets in mirTarBase were less reliable compared to the experimentally
   validated ones.

Selected features

   A subset of the initial 2059 features (Methods) that optimized the
   performance of the model was selected using 5-fold cross validation. A
   minimal set of discriminative features were kept that optimized the
   model by eliminating the noisy features (Fig. [100]1B). Feature
   selection was based on minimum information loss which represents the
   loss of the predictive power. We observed a slight increase in the
   overall accuracy on mirTarBase data from 61 to 62% to 64% accuracy
   after the least important features were eliminated recursively (with
   377 remaining features).

   During feature elimination, we observed that sensitivity and
   specificity were complementary while the accuracy and MCC were
   relatively consistent. The model containing 377 features resulted in
   the highest accuracy and MCC (see Additional file [101]1). We used
   these features in our final model. An additional table lists the top 30
   selected features, ranked by AUC value that reflects the discerning
   power (see Additional file [102]2). For example, the alignment score on
   the binding site, the open degree of the mRNA sequence, and ‘T’ counts
   in both miRNA and mRNA sequences are highly distinguishing for
   miRNA-mRNA interaction (Fig. [103]1B).

Final classification model

   The final model was trained using the selected 377 features and the
   optimal parameters. The performance of our method was evaluated in
   terms of classification performance and clustering coherence as
   follows.

Classification performance

   Table [104]4 summarizes the prediction performance on training,
   testing, and three independent test datasets. Using the root level
   clusters, the prediction results in high accuracies for both the
   training set and CLEAR-CLIP test set. The prediction performance for
   mirTarBase data set was not as desirable. One explanation for this
   discrepancy may be the small size of miRTarBase compared to CLEAR-CLIP
   which is about two times bigger. Our resulting clustering tree had a
   depth of 5. As clustering depth increases, more positive clusters than
   negative are obtained implying specialization of clusters to miRNA or
   mRNA types. The proportion of positive clusters within a layer
   increases as the tree grows in depth. The results of the Leaf layer
   (Table [105]4) shows improvement in classification performance. Highest
   improvement was obtained on CLEAR-CLIP data set from 87% accuracy to
   92%. Both sensitivities and specificities were improved as well as
   accuracy. Both the differentiation of positive and negative clusters
   and separation of positive interactions into different clusters
   according to the sequences involved is more prominent in the deeper
   levels. Accordingly, leaf level clusters were selected to be used for
   prediction.

Table 4.

   Performance based on the 1st and leaf layer clusters
   Dataset Sensitivity Specificity Accuracy MCC
   1st Leaf 1st Leaf 1st Leaf 1st Leaf
   Training
   75% of CLASH, iPAR, negative 0.93 0.94 0.96 0.97 0.95 0.96 0.89 0.91
   Testing
   25% of CLASH, iPAR, negative 0.93 0.94 0.96 0.97 0.94 0.96 0.89 0.91
   Validation-1
   CLEAR-CLIP 0.87 0.92 – – 0.87 0.92 – –
   Validation-2 mirTarbase (val.) 0.63 0.60 – – 0.63 0.60 – –
   Validation-3 mirTarbase (pre.) 0.60 0.58 – – 0.60 0.58 – –
   [106]Open in a new tab

MiRNA/mRNA specific clusters

   At the leaf level, the final model had 281 clusters, 244 of which were
   positive. We observed that 136 (56%) of the positive leaf clusters
   consist of only a single type of miRNA. When the homogeneity level
   required for each cluster (percentage of interactions associated with
   the same miRNA) drops, the proportion of homogeneous clusters
   increases, e.g., up to 73% (178 clusters) at homogeneity threshold at
   80% and up to 167 (68%) clusters at homogeneity level 90%. We show an
   example hierarchy of clusters in Fig. [107]3A. At the root layer, the
   cluster has multiple miRNA types; miR-30c, miR-15b, miR-26a, and
   miR-421 constitute a high proportion of this cluster. In the subsequent
   layer, miR-15b and mir-30c are separated into their own clusters, as
   well as miR-30b and miR-30d. In the deeper layers, both miR-26a and
   miR-421 form their own clusters. In addition, we observed that the
   percentages of the dominant miRNA (miRNA with highest presence) in
   clusters are high (Fig. [108]3B).

Fig. 3.

   [109]Fig. 3
   [110]Open in a new tab

   Clusters resulting from cascade DPGMM model. (a) Clustering tree shown
   at bottom left. A section of the tree is enlarged at the top right.
   Ellipses are homogeneous clusters. Numbers in parenthesis represent
   number of examples in the cluster. Striped boxes are mixed clusters,
   width of each stripe corresponds to the percentage of a miRNA in the
   mixture. (b) (top) the distribution of number of examples in clusters
   over all clusters; (bottom) distribution of percentages of dominant
   miRNAs in each cluster. A dominant miRNA has highest presence in a
   cluster

Validation study

   Our method was compared with other state-of-the-art miRNA target
   prediction tools including TargetScan [[111]26], miRDB [[112]27],
   microT [[113]46], and microT-CDS [[114]47], based on the predicted
   positive interactions (Additional file [115]2). While our method only
   predicted 22,215 interactions among 550 miRNAs and 7529 mRNAs, others
   show significantly higher numbers that are beyond the census
   estimation. For example, TargetScan, microT and microT-CDS all
   predicted over millions of interactions among these miRNAs and mRNAs,
   which implies higher levels of false positives. When comparing the
   average targets per miRNA and average miRNA regulators per mRNA, our
   results are closer to reality. On the other hand, all four exisitng
   tools reported greater number of targets for one miRNA. We used the
   37,539 positive examples in our test data to compare the sensitivity of
   our method with these tools. Our method achieves considerably higher
   sensitivity (see Additional file [116]2).

Transcriptome screening

   Clusters in our model specializes to miRNAs as cluster level becomes
   closer to the leaves. We leveraged this feature to not only predict if
   a candidate pair forms a genuine interaction but also to assess if the
   miRNA involved in this interaction is similar to the dominant miRNA, in
   terms of count, in the cluster the interaction is assigned to. The
   flexibility offered by our model makes screening the transcriptome at
   different levels possible. For example, at the lowest confidence level,
   where we only consider interactions with non-zero IC’s, our model
   predicts at least one miRNA regulator for each gene, and on average 145
   miRNA regulators per gene. Figure [117]4 shows the distribution of
   number target of mRNAs per miRNA and number of miRNA regulator per
   gene. Considering the functional study of miRNA is largely dependent on
   accurate identification of its gene target, our prediction method can
   be highly useful as a reliable resource to facilitate downstream
   studies on miRNA regulation, which is also demonstrated in the next
   case study. The list of all predicted binding sites is available at
   [118]http://sbbi-panda.unl.edu/miRCript/ (Additional online files
   unnecessary for review).

Fig. 4.

   [119]Fig. 4
   [120]Open in a new tab

   Transcriptome prediction statistics. (a) Distribution of number of mRNA
   transcript targets per miRNA. (b) Distribution of number of miRNA
   regulators per gene. y-axis shows the percentage in whole transcriptome

A case study of conditional miRNA regulation in Cancer

   Based on the aforementioned predicted miRNA target information, we have
   applied a regression model [[121]45] with the DPGMM-derived assignment
   score representing the miRNA-mRNA regulatory likelihood. In this case,
   the DPGMM approach allows us to assess the binding potential among
   different biding sites between a gene and different miRNAs, which is
   important to study the competing binding among miRNAs and mRNAs.
   Meanwhile, it also provides a unique feature compared to other binary
   classification strategies. With integration of cancer associated
   genomic data, we were able to examine the conditional miRNA-gene
   regulation that are associated with tumor progression. For example,
   Fig. [122]5 illustrates the dynamic miRNA regulation pattern on gene
   EGFR (epidermal growth factor receptor), an important tyrosine kinase
   involving in cell growth and cancer development. Based on our
   prediction, 44 miRNAs can potentially bind to EGFR but only a subset of
   miRNAs play the roles under a specific condition. In the figure, we
   observed that different sets of miRNAs interact with the target gene
   across different cancer stages; some interactions are active all along
   during the progression (those in red) while others are stage specific
   (those in blue) depending on the availability of the miRNAs under
   different conditions. These findings are in agreement with our
   understanding of the dynamic regulation process.

Fig. 5.

   [123]Fig. 5
   [124]Open in a new tab

   Network topologies of predicted regulators of EGFR gene in Stage I, II,
   and III breast cancer

Discussion

   Emerging sequencing technologies has changed the landscape of research
   on miRNA target identification. Availability of data pertaining to
   genome-scale miRNA interactome facilitated bioinformatics research
   immensely. Particularly, use of transcript-level interaction data
   combined with mRNA specific features in this study allows confidence in
   our model for transcript-level target prediction. To the best of our
   knowledge, splicing transcript specific miRNA binding site prediction
   is a novel feature that is lacking in many existing tools.

   Moreover, a considerable amount of miRNA-mRNA interactions via
   complementary sequences have been discovered in gapped regions
   [[125]11]. These findings suggest that prediction solely depending on
   sequence and/or contextual features such as binding energy, seed match,
   and conservation is not sufficient. As such the static target
   prediction tools that utilize only that information have received
   critical skepticism. We addressed this challenge in our model by
   incorporating data about various types of molecular features that
   differentiate distinct interaction patterns. Meanwhile, high false
   prediction miRNA target prediction rates are still a big concern.

   There are several technical challenges that we encountered during the
   course of this work. We have used publicly available sequencing data.
   In compiling the data derived from different sequencing technologies,
   the raw sequence data needed to be reprocessed and a consolidated
   annotation had to be produced. Another challenge we faced is a common
   one in miRNA target prediction research, i.e., the negative set is huge
   compared to the positive set 10 times more in our case. To make our
   negative set more representative, we devised a new method and
   maintained a comparable size with the positive set.

   The ambiguity around cooperative and competitive binding mechanisms
   adds to the complexity and semi-stochasticity associated with
   miRNA-mediated gene regulation. Our method can be used to infer
   competitive binding, since it assigns likelihoods to the numerous
   potential binding sites of a gene to the same miRNA which can be used
   to evaluate the binding potential. Our recent study shows that several
   miRNAs can affect a given pathway by regulating the same or different
   genes involved in the pathway [[126]45]. For example, miR-18a-3p,
   −320a, −193b-3p, and -92b-3p co-regulate the glycolysis/gluconeogenesis
   and focal adhesion in cancers of kidney, liver, lung, and uterus.
   Similar applications shed light on miRNA regulatory mechanisms and
   novel roles and meanwhile, the functional studies all highlight the
   importance and challenges of reliable miRNA-mRNA interaction
   prediction.

Conclusions

   In this study we developed a new method for predicting human miRNA-mRNA
   interactions reliably. This statistical approach has improved
   prediction performance compared to similar existing tools and includes
   several unique features. Importantly, this tool can address practical
   questions such as common binding properties across miRNAs. Also, the
   interactions are predicted at transcript level which gives a more
   detailed view of interaction than the existing tools that predict
   gene-level binding sites. In our future work, we plan to identify miRNA
   co-binding module by use of conditional mRNA and miRNA genomic data,
   which will take the stochastic nature of miRNA-mRNA interaction into
   consideration. As such, we believe this study lay out the groundwork
   for future research on cooperative miRNA module and dynamic gene
   regulation.

Additional files

   [127]Additional file 1:^ (82.7KB, xlsx)

   (Sheet 1) Whole set of features; (Sheet 2) Remaining features after
   elimination. (XLSX 82 kb)
   [128]Additional file 2:^ (45KB, xls)

   (Sheet 1) Top 30 selected features; (Sheet 2) Performance comparison;
   (Sheet 3) Sensitivity Comparison. (XLS 45 kb)

Acknowledgements